-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Ignore failures during UCX testing #1189
Conversation
UCX testing in CI mostly fails which leads to noise in the CI results. I propose disabling the tests temporarily until the issue is root caused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little bit against this. The UCX tests actually found bugs when I was developing the scan algorithms. We know the signature for the bugs, and though it is very annoying, we can usually just rerun the test so it passes.
Ok, that is fair. The failure occurs intermittently and always occurs in the UCX cma endpoint flush routine during shmem_transport_fini(). This indicates there is some form of memory corruption occurring, but it seems to only be reproducible in the GitHub CI test harness. How would you feel about this? I could define a configure flag to disable the flush on close in UCX transport (by disabling the UCP_EP_CLOSE_MODE_FLUSH flag from the call to ucp_ep_close_nb()). This disable flag would be enabled during the CI testing until we root cause the issue. This way the UCX testing still occurs but that specific bug is masked. |
If it is easy enough to do, I would very much prefer this solution, however, only if you think it is a simple fix. If it involves a lot of work, forget about it and we can disable the whole thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, disabling an entire section in CI should be the last possible resort. Either adding a config to disable the failing portion or adding the ability to skip specific tests during make check
should be explored first
Another option may be to add an exception framework to CI for 'known issues' - when detected re-test. That way we don't bury a bug. |
Github actions has job setting that will convert errors into warnings: continue-on-error. This might be a good compromise. All of the tests are run, but erroneous results for UCX's make check are converted into warnings and are annotated in the results output. |
cfdfb3c
to
437608f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the compromise. I would suggest rewording the PR title, otherwise LGTM
Good catch! |
437608f
to
cc4cc2e
Compare
There two issues with UCX in CI Sandia-OpenSHMEM#1148 and Sandia-OpenSHMEM#1048 that are under debug. Temporarily ignore the failures in UCX until these tickets are root caused and resolved inorder to reduce noise in CI results. All of the UCX tests are run, but erroneous results for UCX's "make check" are converted into warnings and are annotated in the results output. Signed-off-by: Mark F. Brown <mark.f.brown@intel.com>
There two issues with UCX in CI #1148 and #1048 that are under debug. Temporarily disable the test until these tickets are root caused and resolved inorder to reduce noise in CI results.