Due to various issues with the torchrun
/torch.distributed.run
API
on Jülich Supercomputing Centre (JSC)
systems, this
package provides a new launcher called torchrun_jsc
that wraps the
old one as a drop-in replacement. This package really just provides a
fixed torchrun
, so contrary to the name of this package, it is
portable across all machines. In other words, torchrun_jsc
supports
a superset of machines that torchrun
supports, thus there is no need
to special-case torchrun_jsc
in your scripts.
The only requirement for its usage is Slurm and a PyTorch version ≥1.9
(because earlier versions do not have the torchrun
API implemented).
The package hopes that the current solution is forward-compatible, but
will emit warnings if a PyTorch version ≥3 is used.
python -m pip install --no-cache-dir torchrun_jsc
python -m pip install git+https://github.com/HelmholtzAI-FZJ/torchrun_jsc.git
Modify your execution like the following:
Old
torchrun [...]
# or
python -m torch.distributed.run [...]
New
torchrun_jsc [...]
# or
python -m torchrun_jsc [...]
Please remember to use srun
to start your Python processes,
otherwise necessary Slurm variables will not be set.
You can configure the following environment variables to customize
torchrun_jsc
's behavior. This can, for example, be useful if you are
not using Slurm or cannot rely on its environment variables.
TORCHRUN_JSC_PREFER_ARG_PATCHING
: whether to use argument patching or an alternative monkey-patching method. The alternative method does not rely on Slurm but adds another surface for breakage because it depends on private, internal PyTorch APIs. Also, the alternative method is not portable and may not be supported on your system. If this is set to0
,TORCHRUN_JSC_NODE_RANK
andTORCHRUN_JSC_HOST_NODE_RANK
(see below for both) will not be used. Defaults to1
("true", i.e., use argument patching).TORCHRUN_JSC_NODE_RANK
: should be set to the executing node's rank. Defaults to the value ofSLURM_NODEID
; if that variable isn't set either, behavior depending on this value is ignored.TORCHRUN_JSC_HOST_NODE_RANK
: should be set to the node rank of the host node (i.e., the one that is listed as the rendezvous endpoint). Defaults to0
.TORCHRUN_JSC_PREFER_OLD_SOLUTION
: whether to always use an old patching solution even if it could be avoided. The new solution is less intrusive, but only available for PyTorch ≥2.5. Defaults to0
("false", i.e., do not use the old solution unless necessary).
First, if TORCHRUN_JSC_PREFER_ARG_PATCHING=1
(the default), the
torchrun_jsc
launcher will patch torchrun
's --rdzv_conf
argument's is_host
configuration so that the correct process is
recognized as the host process for setting up the communication
server. If TORCHRUN_JSC_PREFER_ARG_PATCHING=0
, the function for
recognizing the host machine will be patched.
After that, depending on your PyTorch version, there are various modes of functionality to achieve that the correct address is used for rendezvousing:
- PyTorch ≥3:
- If using "new solution": Additionally patch the
--local_addr
argument on the host node to be the same as the given rendezvous endpoint and emit a warning. - If using "old solution": Monkey-patch the function used to obtain the rendezvous hostname, the function setting up rendezvous metadata, the function setting up node metadata, and emit a warning.
- If using "new solution": Additionally patch the
- PyTorch ≥2.5 <3:
- If using "new solution": Additionally patch the
--local_addr
argument on the host node to be the same as the given rendezvous endpoint. - If using "old solution": Monkey-patch the function used to obtain the rendezvous hostname, the function setting up rendezvous metadata, and the function setting up node metadata. (With minor differences in the patching depending on the PyTorch version.)
- If using "new solution": Additionally patch the
- PyTorch ≥2.4 <2.5: Monkey-patch the function used to obtain the rendezvous hostname and the function setting up rendezvous metadata. (With minor differences in the patching depending on the PyTorch version.)
- PyTorch ≥1.9 <2.4: Monkey-patch the function used to obtain the rendezvous hostname.
- PyTorch <1.9: If this package is somehow installed for a
non-matching PyTorch version, it will error out because the
torchrun
API does not exist in these versions.