Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Mapping #4326

Merged
merged 2 commits into from
Feb 21, 2025
Merged

Conversation

WeiqunZhang
Copy link
Member

For perlmutter and frontier, if there are multiple devices available, we will try to map GPUs to the closest core.

For an FFT test on perlmutter using 256 nodes, the correct mapping reduced the run time from 0.172 to 0.127. Note that you can achieve the similar effect with srun ... bash -c "export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID)); ..." by manually limiting the number of visible devices. But in this commit, we are trying to do this automatically for the user. Also note that MPI appears to crash with gpu-bind=closest on perlmutter. So we need to use gpu-bind=none.

For frontier, you could use gpu-bind=closest. But if your use gpu-bind=none, this commit will try to do the correct mapping for you.

In this commit, we also removed the old machine stuff and added new code for machine detection.

@WeiqunZhang WeiqunZhang requested a review from atmyers February 7, 2025 21:14
if ((Machine::name() != "nersc.perlmutter") &&
(Machine::name() != "olcf.frontier"))
{
amrex::Warning("Multiple GPUs are visible to each MPI rank. This is usually not an issue. But this may lead to incorrect or suboptimal rank-to-GPU mapping.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the implementation below, we need to be more precise now: for the machines we implement logic for, we should post something like: "Fixing GPU assignment for Frontier according to heuristics..." or so?

For perlmutter and frontier, if there are multiple device available, we will
try to map GPUs to the closest core.

For an FFT test on perlmutter using 256 nodes, the correct mapping reduced
the run time from 0.172 to 0.127. Note that you can achieve the similar
effect with `srun ... bash -c "export
CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID)); ..."` by manually limiting the
number of visible devices. But in this commit, we are trying to do this
automatically for the user. Also note that MPI appears to crash with
gpu-bind=closest on perlmutter. So we need to use gpu-bind=none.

For frontier, you could use gpu-bind=closest. But if your use gpu-bind=none,
this commit will try to do the correct mapping for you.

In this commit, we also removed the old machin stuff and added new code for
machine detection.
@WeiqunZhang WeiqunZhang force-pushed the perlmutter_gpu_mapping branch from d779130 to 947955a Compare February 14, 2025 20:04
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM 👍

@ax3l ax3l self-assigned this Feb 21, 2025
@ax3l ax3l merged commit bfd1f11 into AMReX-Codes:development Feb 21, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants