You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the feature you'd like
The braket_container.py script to launch the user-provided algorithm script is not thread safe, currently. When running a multi-node job with parallelization through MPI (the hyperparameter sagemaker_mpi_enabled makes SageMaker to invoke the braket_container.py with mpirun), this can create race conditions in paritcular in the step to download, extract and make available the user-provided code, when running a multi-node job.
The braket_container.py script should be made thread safe to account for jobs running on multiple instances or (GPU) cores with sagemaker_mpi_enabled=True.
How would this feature be used? Please describe.
The user shouldn't have to worry about this feature and, specifically, shouldn't have to change the braket_container.py script if they want to use MPI support for the jobs.
Additional context
There is an example for a simple workaround for this issue in the amazon-braket-examples repository. I have created an issue there to document this doesn't ultimately solve the problem. But, actually, I think this should be addressed here.
The text was updated successfully, but these errors were encountered:
Describe the feature you'd like
The
braket_container.py
script to launch the user-provided algorithm script is not thread safe, currently. When running a multi-node job with parallelization through MPI (the hyperparametersagemaker_mpi_enabled
makes SageMaker to invoke thebraket_container.py
withmpirun
), this can create race conditions in paritcular in the step to download, extract and make available the user-provided code, when running a multi-node job.The
braket_container.py
script should be made thread safe to account for jobs running on multiple instances or (GPU) cores withsagemaker_mpi_enabled=True
.How would this feature be used? Please describe.
The user shouldn't have to worry about this feature and, specifically, shouldn't have to change the
braket_container.py
script if they want to use MPI support for the jobs.Additional context
There is an example for a simple workaround for this issue in the amazon-braket-examples repository. I have created an issue there to document this doesn't ultimately solve the problem. But, actually, I think this should be addressed here.
The text was updated successfully, but these errors were encountered: