diff --git a/docs/hpc/6lonestar/ml.md b/docs/hpc/6lonestar/ml.md index 15329a4..61f060b 100644 --- a/docs/hpc/6lonestar/ml.md +++ b/docs/hpc/6lonestar/ml.md @@ -1,160 +1,73 @@ ## Machine Learning on LS6 { #ml } -Lonestar6 is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages. +Lonestar6 is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install PyTorch on our systems with a virtual environment: -### Running PyTorch { #ml-pytorch } - -Install Pytorch and TensorBoard. - -1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using the [idev](../../software/idev) utility: +### Install PyTorch +1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using TACC's [`idev`][TACCIDEV] utility: ```cmd-line login$ idev -p gpu-a100-dev -N 1 -n 1 -t 1:00:00 ``` -1. Create a Python virtual environment: - +1. Create a Python virtual environment: ```cmd-line - c123-456.ls6$ module load python3/3.9.7 - c123-456.ls6$ python3 -m venv /path/to/virtual-env # (e.g., $SCRATCH/python-envs/test) + c123-456$ module load python3/3.9.7 + c123-456$ python3 -m venv /path/to/virtual-env # (e.g., $SCRATCH/python-envs/test) ``` 1. Activate the Python virtual environment: - ```cmd-line - c123-456.ls6$ source /path/to/virtual-env/bin/activate + c123-456$ source /path/to/virtual-env/bin/activate ``` -1. Now install PyTorch and TensorBoard: - +1. Now install PyTorch: ```cmd-line - c123-456.ls6$ pip3 install torch==1.12.1 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 - c123-456.ls6$ pip3 install tensorboard + c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 ``` -#### Single-Node { #ml-pytorch-singlnode } +### Testing PyTorch Installation -1. Download the benchmark: +To test your installation of PyTorch we point you to a few benchmark calculations that are part of PyTorch's tutorials on multi-GPU and multi-node training. See PyTorch's documentation: [Distributed Data Parallel in PyTorch](https://pytorch.org/tutorials/beginner/ddp_series_intro.html). These tutorials include several scripts set up to run single-node training and multi-node training. +#### Single-Node + +1. Download the benchmark: ```cmd-line - c123-456.ls6$ cd $SCRATCH - c123-456.ls6$ git clone https://github.com/gpauloski/kfac-pytorch.git - c123-456.ls6$ cd kfac-pytorch - c123-456.ls6$ git checkout tags/v0.3.2 - c123-456.ls6$ pip3 install -e . - c123-456.ls6$ pip3 install torchinfo tqdm Pillow - c123-456.ls6$ export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH + c123-456$ cd $SCRATCH + c123-456$ git clone https://github.com/pytorch/examples.git ``` 1. Run the benchmark on one node (3 GPUs): - ```cmd-line - c123-456.ls6$ python3 -m torch.distributed.launch --nproc_per_node=3 examples/torch_cifar10_resnet.py --kfac-update-freq 0 + c123-456$ torchrun --nproc_per_node=3 examples/distributed/ddp-tutorial-series/multigpu_torchrun.py 50 10 ``` + +#### Multi-Node -#### Multi-Node { #ml-pytorch-multinode } - -1. Request two nodes in the `gpu-a100-dev` queue using the [`idev`](../../software/idev) utility: - +1. Request two nodes in the [`gpu-a100-dev`](#queues) queue using TACC's [`idev`][TACCIDEV] utility: ```cmd-line login2.ls6$ idev -N 2 -n 2 -p gpu-a100-dev -t 01:00:00 ``` -1. Activate the Python virtual environment: - - ```cmd-line - c123-456.ls6$ source /path/to/virtual-env/bin/activate - ``` - 1. Move to the benchmark directory: - ```cmd-line - c123-456.ls6$ cd $SCRATCH/kfac-pytorch + c123-456$ cd $SCRATCH ``` -1. Create a script called "`run.sh`". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh". +1. Create a script called "run.sh". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh". - ```job-script + ```file #!/bin/bash HOST=$1 NODES=$2 LOCAL_RANK=${PMI_RANK} - python3 -m torch.distributed.launch --nproc_per_node=3 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \ - examples/torch_cifar10_resnet.py --kfac-update-freq 0 + torchrun --nproc_per_node=3 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \ + examples/distributed/ddp-tutorial-series/multinode.py 50 10 ``` 1. Run multi-gpu training: - - ```cmd-line - c123-456.ls6$ ibrun -np 2 ./run.sh c123-456 2 - ``` - -### Running Tensorflow { #ml-tensorflow } - -Follow these instructions to install and run TensorFlow benchmarks on Lonestar6's A100. Lonestar6's A100 runs TensorFlow 2.8.2 with Python 3.7.13. Lonestar6's supports CUDA/11.3, CUDA/11.4, and CUDA/12.0. By default, we use CUDA/11.3. Select the appropriate CUDA version for your TensorFlow version. - -1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using the [idev](../../software/idev) utility: - - ```cmd-line - login2.ls6$ idev -N 1 -n 1 -p gpu-a100-dev -t 01:00:00 - ``` - -1. Create a Python virtual environment: - - ```cmd-line - c123-456.ls6$ module load python3/3.7.13 cuda/11.3 cudnn nccl - c123-456.ls6$ python3 -m venv /path/to/virtual-env # e.g., $SCRATCH/python-envs/test - ``` - -1. Activate the Python virtual environment: - - ```cmd-line - c123-456.ls6$ source /path/to/virtual-env/bin/activate - ``` - -1. Install TensorFlow and Horovod: - - ```cmd-line - c123-456.ls6$ pip3 install tensorflow-gpu==2.8.2 - ``` - - We suggest installing Horovod version 0.25.0. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions. - - ```cmd-line - c123-456.ls6$ HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc \ - HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0 - ``` - -#### Single-Node { #ml-tensorflow-singlenode } - -1. Download the tensorflow benchmark to your `$SCRATCH` directory, then check out the branch that matches your tensorflow version. - - ```cmd-line - c123-456.ls6$ cds; git clone https://github.com/tensorflow/benchmarks.git - c123-456.ls6$ cd benchmarks - c123-456.ls6$ git checkout 51d647f # master head as of 08/18/2022 - ``` - -1. Load modules and activate the Python virtual environment: - - ```cmd-line - c123-456.ls6$ module load python3/3.7.13 cuda/11.3 cudnn nccl - c123-456.ls6$ source /path/to/virtual-env/bin/activate - ``` - -1. Benchmark the performance with synthetic dataset on 1 GPU: - - ```cmd-line - c123-456.ls6$ cd scripts/tf_cnn_benchmarks - c123-456.ls6$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200 - ``` - -1. Benchmark the performance with synthetic dataset on 3 GPUs: - ```cmd-line - c123-456.ls6$ cd scripts/tf_cnn_benchmarks - c123-456.ls6$ ibrun -np 3 python3 tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \ - --model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True + c123-456$ ibrun -np 2 ./run.sh c123-456 2 ``` diff --git a/docs/hpc/6lonestar/notices.md b/docs/hpc/6lonestar/notices.md index 35d2d4e..51a6fcd 100644 --- a/docs/hpc/6lonestar/notices.md +++ b/docs/hpc/6lonestar/notices.md @@ -1,5 +1,5 @@ # Lonestar6 User Guide -*Last update: June 28, 2024* +*Last update: September 18, 2024* ## Notices { #notices } diff --git a/docs/hpc/frontera.md b/docs/hpc/frontera.md index eabc1d2..4fbd01c 100644 --- a/docs/hpc/frontera.md +++ b/docs/hpc/frontera.md @@ -1,5 +1,5 @@ # Frontera User Guide -Last update: September 12, 2024 +Last update: September 18, 2024