Skip to content

Fixed version of `torchrun` on Jülich Supercomputing Centre

License

Notifications You must be signed in to change notification settings

HelmholtzAI-FZJ/torchrun_jsc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

torchrun_jsc: torchrun on Jülich Supercomputing Centre

Due to various issues with the torchrun/torch.distributed.run API on Jülich Supercomputing Centre (JSC) systems, this package provides a new launcher called torchrun_jsc that wraps the old one as a drop-in replacement. This package really just provides a fixed torchrun, so contrary to the name of this package, it is portable across all machines. In other words, torchrun_jsc supports a superset of machines that torchrun supports, thus there is no need to special-case torchrun_jsc in your scripts.

The only requirement for its usage is Slurm and a PyTorch version ≥1.9 (because earlier versions do not have the torchrun API implemented). The package hopes that the current solution is forward-compatible, but will emit warnings if a PyTorch version ≥3 is used.

Installation

PyPI

python -m pip install --no-cache-dir torchrun_jsc

Source

python -m pip install git+https://github.com/HelmholtzAI-FZJ/torchrun_jsc.git

Usage

Modify your execution like the following:

Old

torchrun [...]
# or
python -m torch.distributed.run [...]

New

torchrun_jsc [...]
# or
python -m torchrun_jsc [...]

Please remember to use srun to start your Python processes, otherwise necessary Slurm variables will not be set.

Advanced usage

You can configure the following environment variables to customize torchrun_jsc's behavior. This can, for example, be useful if you are not using Slurm or cannot rely on its environment variables.

  • TORCHRUN_JSC_PREFER_ARG_PATCHING: whether to use argument patching or an alternative monkey-patching method. The alternative method does not rely on Slurm but adds another surface for breakage because it depends on private, internal PyTorch APIs. Also, the alternative method is not portable and may not be supported on your system. If this is set to 0, TORCHRUN_JSC_NODE_RANK and TORCHRUN_JSC_HOST_NODE_RANK (see below for both) will not be used. Defaults to 1 ("true", i.e., use argument patching).
  • TORCHRUN_JSC_NODE_RANK: should be set to the executing node's rank. Defaults to the value of SLURM_NODEID; if that variable isn't set either, behavior depending on this value is ignored.
  • TORCHRUN_JSC_HOST_NODE_RANK: should be set to the node rank of the host node (i.e., the one that is listed as the rendezvous endpoint). Defaults to 0.
  • TORCHRUN_JSC_PREFER_OLD_SOLUTION: whether to always use an old patching solution even if it could be avoided. The new solution is less intrusive, but only available for PyTorch ≥2.5. Defaults to 0 ("false", i.e., do not use the old solution unless necessary).

How does it work?

First, if TORCHRUN_JSC_PREFER_ARG_PATCHING=1 (the default), the torchrun_jsc launcher will patch torchrun's --rdzv_conf argument's is_host configuration so that the correct process is recognized as the host process for setting up the communication server. If TORCHRUN_JSC_PREFER_ARG_PATCHING=0, the function for recognizing the host machine will be patched.

After that, depending on your PyTorch version, there are various modes of functionality to achieve that the correct address is used for rendezvousing:

  1. PyTorch ≥3:
    • If using "new solution": Additionally patch the --local_addr argument on the host node to be the same as the given rendezvous endpoint and emit a warning.
    • If using "old solution": Monkey-patch the function used to obtain the rendezvous hostname, the function setting up rendezvous metadata, the function setting up node metadata, and emit a warning.
  2. PyTorch ≥2.5 <3:
    • If using "new solution": Additionally patch the --local_addr argument on the host node to be the same as the given rendezvous endpoint.
    • If using "old solution": Monkey-patch the function used to obtain the rendezvous hostname, the function setting up rendezvous metadata, and the function setting up node metadata. (With minor differences in the patching depending on the PyTorch version.)
  3. PyTorch ≥2.4 <2.5: Monkey-patch the function used to obtain the rendezvous hostname and the function setting up rendezvous metadata. (With minor differences in the patching depending on the PyTorch version.)
  4. PyTorch ≥1.9 <2.4: Monkey-patch the function used to obtain the rendezvous hostname.
  5. PyTorch <1.9: If this package is somehow installed for a non-matching PyTorch version, it will error out because the torchrun API does not exist in these versions.

About

Fixed version of `torchrun` on Jülich Supercomputing Centre

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages