-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get job participating clients from Job Meta #3270
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nvidianz
reviewed
Mar 3, 2025
nvidianz
reviewed
Mar 3, 2025
nvidianz
reviewed
Mar 3, 2025
nvidianz
approved these changes
Mar 3, 2025
/build |
1 similar comment
/build |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes # .
Description
At the start of a job, the CJ sends a sync message to the server to get information about the job's participating clients. This mechanism has a few issues:
CJs are started at about the same time. If there are many clients (e.g. there could be thousands of clients for handling edge devices), the SP will be under a lot of pressure to process so many messages at the same time.
In case of Simulator, if the number of subprocesses is smaller than the number of clients, then some clients will be swapped in and out. Each time a client is swapped in, it needs to send the sync message again. But if this happens after the job is already done, it will fail, as reported by QA.
This PR gets rid of the sync mechanism from CJs. Once the SP starts to run a job, it already knows what clients will participate in the job. SP then includes such info in the job's metadata in the message to CPs to start the job. Once started, the CJ no longer needs to send the sync message to SP to get participating clients since it can get this info from the job's metadata.
This PR also cleans up some log messages by changing their log level to DEBUG.
NOTE: this PR does not change SJ's logic to get job clients from SP via the sync message, due to the large scope of changes. Since there is always only a single SJ for each job, it won't cost much for the SP to process one message from the SJ.
Types of changes
./runtest.sh
.