Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Ignore failures during UCX testing #1189

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

markbrown314
Copy link
Collaborator

There two issues with UCX in CI #1148 and #1048 that are under debug. Temporarily disable the test until these tickets are root caused and resolved inorder to reduce noise in CI results.

@markbrown314
Copy link
Collaborator Author

UCX testing in CI mostly fails which leads to noise in the CI results. I propose disabling the tests temporarily until the issue is root caused.

Copy link
Collaborator

@avincigu avincigu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit against this. The UCX tests actually found bugs when I was developing the scan algorithms. We know the signature for the bugs, and though it is very annoying, we can usually just rerun the test so it passes.

@markbrown314
Copy link
Collaborator Author

I am a little bit against this. The UCX tests actually found bugs when I was developing the scan algorithms. We know the signature for the bugs, and though it is very annoying, we can usually just rerun the test so it passes.

Ok, that is fair. The failure occurs intermittently and always occurs in the UCX cma endpoint flush routine during shmem_transport_fini(). This indicates there is some form of memory corruption occurring, but it seems to only be reproducible in the GitHub CI test harness.

How would you feel about this? I could define a configure flag to disable the flush on close in UCX transport (by disabling the UCP_EP_CLOSE_MODE_FLUSH flag from the call to ucp_ep_close_nb()). This disable flag would be enabled during the CI testing until we root cause the issue. This way the UCX testing still occurs but that specific bug is masked.

@avincigu
Copy link
Collaborator

I am a little bit against this. The UCX tests actually found bugs when I was developing the scan algorithms. We know the signature for the bugs, and though it is very annoying, we can usually just rerun the test so it passes.

Ok, that is fair. The failure occurs intermittently and always occurs in the UCX cma endpoint flush routine during shmem_transport_fini(). This indicates there is some form of memory corruption occurring, but it seems to only be reproducible in the GitHub CI test harness.

How would you feel about this? I could define a configure flag to disable the flush on close in UCX transport (by disabling the UCP_EP_CLOSE_MODE_FLUSH flag from the call to ucp_ep_close_nb()). This disable flag would be enabled during the CI testing until we root cause the issue. This way the UCX testing still occurs but that specific bug is masked.

If it is easy enough to do, I would very much prefer this solution, however, only if you think it is a simple fix. If it involves a lot of work, forget about it and we can disable the whole thing.

Copy link
Collaborator

@abrooks98 abrooks98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, disabling an entire section in CI should be the last possible resort. Either adding a config to disable the failing portion or adding the ability to skip specific tests during make check should be explored first

@bcmIntc
Copy link
Collaborator

bcmIntc commented Feb 27, 2025

Another option may be to add an exception framework to CI for 'known issues' - when detected re-test. That way we don't bury a bug.

@markbrown314
Copy link
Collaborator Author

markbrown314 commented Feb 27, 2025

Github actions has job setting that will convert errors into warnings: continue-on-error.

This might be a good compromise. All of the tests are run, but erroneous results for UCX's make check are converted into warnings and are annotated in the results output.

Copy link
Collaborator

@abrooks98 abrooks98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the compromise. I would suggest rewording the PR title, otherwise LGTM

@markbrown314
Copy link
Collaborator Author

Thanks for the compromise. I would suggest rewording the PR title, otherwise LGTM

Good catch!

@markbrown314 markbrown314 changed the title CI: Temporarily disable UCX testing CI: Ignore failures during UCX testing Feb 27, 2025
There two issues with UCX in CI Sandia-OpenSHMEM#1148 and Sandia-OpenSHMEM#1048 that are under debug.
Temporarily ignore the failures in UCX until these tickets are root caused and
resolved inorder to reduce noise in CI results.

All of the UCX tests are run, but erroneous results for UCX's "make check" are
converted into warnings and are annotated in the results output.

Signed-off-by: Mark F. Brown <mark.f.brown@intel.com>
@markbrown314 markbrown314 merged commit 767fe2a into Sandia-OpenSHMEM:main Feb 27, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants