`torchrun_jsc`: `torchrun` on Jülich Supercomputing Centre

Due to various issues with the torchrun/torch.distributed.run API on Jülich Supercomputing Centre (JSC) systems, this package provides a new launcher called torchrun_jsc that wraps the old one as a drop-in replacement. This package really just provides a fixed torchrun, so contrary to the name of this package, it is portable across all machines. In other words, torchrun_jsc supports a superset of machines that torchrun supports, thus there is no need to special-case torchrun_jsc in your scripts.

The only requirement for its usage is Slurm and a PyTorch version ≥1.9 (because earlier versions do not have the torchrun API implemented). The package hopes that the current solution is forward-compatible, but will emit warnings if a PyTorch version ≥3 is used.

Installation

PyPI

python -m pip install --no-cache-dir torchrun_jsc

Source

python -m pip install git+https://github.com/HelmholtzAI-FZJ/torchrun_jsc.git

Usage

Modify your execution like the following:

Old

torchrun [...]
# or
python -m torch.distributed.run [...]

New

torchrun_jsc [...]
# or
python -m torchrun_jsc [...]

Please remember to use srun to start your Python processes, otherwise necessary Slurm variables will not be set.

Advanced usage

You can configure the following environment variables to customize torchrun_jsc's behavior. This can, for example, be useful if you are not using Slurm or cannot rely on its environment variables.

TORCHRUN_JSC_PREFER_ARG_PATCHING: whether to use argument patching or an alternative monkey-patching method. The alternative method does not rely on Slurm but adds another surface for breakage because it depends on private, internal PyTorch APIs. Also, the alternative method is not portable and may not be supported on your system. If this is set to 0, TORCHRUN_JSC_NODE_RANK and TORCHRUN_JSC_HOST_NODE_RANK (see below for both) will not be used. Defaults to 1 ("true", i.e., use argument patching).
TORCHRUN_JSC_NODE_RANK: should be set to the executing node's rank. Defaults to the value of SLURM_NODEID; if that variable isn't set either, behavior depending on this value is ignored.
TORCHRUN_JSC_HOST_NODE_RANK: should be set to the node rank of the host node (i.e., the one that is listed as the rendezvous endpoint). Defaults to 0.
TORCHRUN_JSC_PREFER_OLD_SOLUTION: whether to always use an old patching solution even if it could be avoided. The new solution is less intrusive, but only available for PyTorch ≥2.5. Defaults to 0 ("false", i.e., do not use the old solution unless necessary).

How does it work?

First, if TORCHRUN_JSC_PREFER_ARG_PATCHING=1 (the default), the torchrun_jsc launcher will patch torchrun's --rdzv_conf argument's is_host configuration so that the correct process is recognized as the host process for setting up the communication server. If TORCHRUN_JSC_PREFER_ARG_PATCHING=0, the function for recognizing the host machine will be patched.

After that, depending on your PyTorch version, there are various modes of functionality to achieve that the correct address is used for rendezvousing:

PyTorch ≥3:
- If using "new solution": Additionally patch the --local_addr argument on the host node to be the same as the given rendezvous endpoint and emit a warning.
- If using "old solution": Monkey-patch the function used to obtain the rendezvous hostname, the function setting up rendezvous metadata, the function setting up node metadata, and emit a warning.
PyTorch ≥2.5 <3:
- If using "new solution": Additionally patch the --local_addr argument on the host node to be the same as the given rendezvous endpoint.
- If using "old solution": Monkey-patch the function used to obtain the rendezvous hostname, the function setting up rendezvous metadata, and the function setting up node metadata. (With minor differences in the patching depending on the PyTorch version.)
PyTorch ≥2.4 <2.5: Monkey-patch the function used to obtain the rendezvous hostname and the function setting up rendezvous metadata. (With minor differences in the patching depending on the PyTorch version.)
PyTorch ≥1.9 <2.4: Monkey-patch the function used to obtain the rendezvous hostname.
PyTorch <1.9: If this package is somehow installed for a non-matching PyTorch version, it will error out because the torchrun API does not exist in these versions.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
src/torchrun_jsc		src/torchrun_jsc
.gitignore		.gitignore
LICENSE		LICENSE
PACKAGING.md		PACKAGING.md
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`torchrun_jsc`: `torchrun` on Jülich Supercomputing Centre

Installation

PyPI

Source

Usage

Advanced usage

How does it work?

About

Releases

Packages

Languages

License

HelmholtzAI-FZJ/torchrun_jsc

Folders and files

Latest commit

History

Repository files navigation

torchrun_jsc: torchrun on Jülich Supercomputing Centre

Installation

PyPI

Source

Usage

Advanced usage

How does it work?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`torchrun_jsc`: `torchrun` on Jülich Supercomputing Centre

Packages