Lonestar6, Frontera, Vista - upated ML sections

TACC · Sep 18, 2024 · 802a8b5 · 802a8b5
1 parent 0690a63
commit 802a8b5
Show file tree

Hide file tree

Showing 9 changed files with 119 additions and 453 deletions.
diff --git a/docs/hpc/6lonestar/ml.md b/docs/hpc/6lonestar/ml.md
@@ -1,160 +1,73 @@
 ## Machine Learning on LS6 { #ml }
 
-Lonestar6 is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages.
+Lonestar6 is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install PyTorch on our systems with a virtual environment: 
 
-### Running PyTorch  { #ml-pytorch }
-
-Install Pytorch and TensorBoard.
-
-1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using the [idev](../../software/idev) utility:
+### Install PyTorch 
 
+1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using TACC's [`idev`][TACCIDEV] utility:
 	```cmd-line
 	login$ idev -p gpu-a100-dev -N 1 -n 1 -t 1:00:00
 	```
 
-1. Create a Python virtual environment:
-
+1. Create a Python virtual environment: 
 	```cmd-line
-	c123-456.ls6$ module load python3/3.9.7
-	c123-456.ls6$ python3 -m venv /path/to/virtual-env  # (e.g., $SCRATCH/python-envs/test)
+	c123-456$ module load python3/3.9.7
+	c123-456$ python3 -m venv /path/to/virtual-env  # (e.g., $SCRATCH/python-envs/test)
 	```
 
 1. Activate the Python virtual environment:
-
 	```cmd-line
-	c123-456.ls6$ source /path/to/virtual-env/bin/activate
+	c123-456$ source /path/to/virtual-env/bin/activate
 	```
 
-1. Now install PyTorch and TensorBoard:
-
+1. Now install PyTorch: 
 	```cmd-line
-	c123-456.ls6$ pip3 install torch==1.12.1 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
-	c123-456.ls6$ pip3 install tensorboard
+	c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
 	```
 
-#### Single-Node { #ml-pytorch-singlnode }
+### Testing PyTorch Installation 
 
-1. Download the benchmark:
+To test your installation of PyTorch we point you to a few benchmark calculations that are part of PyTorch's tutorials on multi-GPU and multi-node training.  See PyTorch's documentation: [Distributed Data Parallel in PyTorch](https://pytorch.org/tutorials/beginner/ddp_series_intro.html). These tutorials include several scripts set up to run single-node training and multi-node training.
 
+#### Single-Node
+
+1. Download the benchmark:
 	```cmd-line
-	c123-456.ls6$ cd $SCRATCH
-	c123-456.ls6$ git clone https://github.com/gpauloski/kfac-pytorch.git
-	c123-456.ls6$ cd kfac-pytorch
-	c123-456.ls6$ git checkout tags/v0.3.2
-	c123-456.ls6$ pip3 install -e .
-	c123-456.ls6$ pip3 install torchinfo tqdm Pillow
-	c123-456.ls6$ export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
+	c123-456$ cd $SCRATCH 
+	c123-456$  git clone https://github.com/pytorch/examples.git
 	```
 
 1. Run the benchmark on one node (3 GPUs):
-
 	```cmd-line
-	c123-456.ls6$ python3 -m torch.distributed.launch --nproc_per_node=3 examples/torch_cifar10_resnet.py --kfac-update-freq 0
+	c123-456$ torchrun --nproc_per_node=3 examples/distributed/ddp-tutorial-series/multigpu_torchrun.py 50 10
 	```
+
+#### Multi-Node
 
-#### Multi-Node { #ml-pytorch-multinode }
-
-1. Request two nodes in the `gpu-a100-dev` queue using the [`idev`](../../software/idev) utility:
-
+1. Request two nodes in the [`gpu-a100-dev`](#queues) queue using TACC's [`idev`][TACCIDEV] utility:
 	```cmd-line
 	login2.ls6$ idev -N 2 -n 2 -p gpu-a100-dev -t 01:00:00
 	```
 
-1. Activate the Python virtual environment:
-
-	```cmd-line
-	c123-456.ls6$ source /path/to/virtual-env/bin/activate
-	```
-
 1. Move to the benchmark directory:
-
 	```cmd-line
-	c123-456.ls6$ cd $SCRATCH/kfac-pytorch
+	c123-456$ cd $SCRATCH 
 	```
 
-1. Create a script called "`run.sh`". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".
+1. Create a script called "run.sh". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".
 
-	```job-script
+	```file
 	#!/bin/bash
 	HOST=$1
 	NODES=$2
 	LOCAL_RANK=${PMI_RANK}
-	python3 -m torch.distributed.launch --nproc_per_node=3  --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
-    	examples/torch_cifar10_resnet.py --kfac-update-freq 0
+	torchrun --nproc_per_node=3  --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
+		examples/distributed/ddp-tutorial-series/multinode.py 50 10
 	```
 
 1. Run multi-gpu training:
-
-	```cmd-line
-	c123-456.ls6$ ibrun -np 2 ./run.sh c123-456 2
-	```
-
-### Running Tensorflow  { #ml-tensorflow }
-
-Follow these instructions to install and run TensorFlow benchmarks on Lonestar6's A100. Lonestar6's A100 runs TensorFlow 2.8.2 with Python 3.7.13. Lonestar6's supports CUDA/11.3, CUDA/11.4, and CUDA/12.0. By default, we use CUDA/11.3. Select the appropriate CUDA version for your TensorFlow version.
-
-1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using the [idev](../../software/idev) utility:
-
-	```cmd-line
-	login2.ls6$ idev -N 1 -n 1 -p gpu-a100-dev -t 01:00:00
-	```
-
-1. Create a Python virtual environment:
-
-	```cmd-line
-	c123-456.ls6$ module load python3/3.7.13 cuda/11.3 cudnn nccl
-	c123-456.ls6$ python3 -m venv /path/to/virtual-env # e.g., $SCRATCH/python-envs/test
-	```
-
-1. Activate the Python virtual environment:
-
-	```cmd-line
-	c123-456.ls6$ source /path/to/virtual-env/bin/activate
-	```
-
-1. Install TensorFlow and Horovod:
-
-	```cmd-line
-	c123-456.ls6$ pip3 install tensorflow-gpu==2.8.2
-	```
-
-	We suggest installing Horovod version 0.25.0. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions.
-
-	```cmd-line
-	c123-456.ls6$ HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc \
-   	HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0
-	```
-
-#### Single-Node { #ml-tensorflow-singlenode }
-
-1. Download the tensorflow benchmark to your `$SCRATCH` directory, then check out the branch that matches your tensorflow version.
-
-	```cmd-line
-	c123-456.ls6$ cds; git clone https://github.com/tensorflow/benchmarks.git
-	c123-456.ls6$ cd benchmarks 
-	c123-456.ls6$ git checkout 51d647f     # master head as of 08/18/2022
-	```
-
-1. Load modules and activate the Python virtual environment:
-
-	```cmd-line
-	c123-456.ls6$ module load python3/3.7.13 cuda/11.3 cudnn nccl
-	c123-456.ls6$ source /path/to/virtual-env/bin/activate
-	```
-
-1. Benchmark the performance with synthetic dataset on 1 GPU:
-
-	```cmd-line
-	c123-456.ls6$ cd scripts/tf_cnn_benchmarks
-	c123-456.ls6$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200
-	```
-
-1. Benchmark the performance with synthetic dataset on 3 GPUs:
-
 	```cmd-line
-	c123-456.ls6$ cd scripts/tf_cnn_benchmarks
-	c123-456.ls6$ ibrun -np 3 python3 tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \
-   		--model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True
+	c123-456$ ibrun -np 2 ./run.sh c123-456 2
 	```
 
 
diff --git a/docs/hpc/6lonestar/notices.md b/docs/hpc/6lonestar/notices.md
@@ -1,5 +1,5 @@
 # Lonestar6 User Guide
-*Last update: June 28, 2024*
+*Last update: September 18, 2024*
 
 
 ## Notices { #notices }

diff --git a/docs/hpc/frontera.md b/docs/hpc/frontera.md
@@ -1,5 +1,5 @@
 # Frontera User Guide
-Last update: September 12, 2024
+Last update: September 18, 2024
 <!-- SDL <a href="https://frontera-xortal.tacc.utexas.edu/user-guide/docs/user-guide.pdf">Download PDF <i class="fa fa-file-pdf-o"></i></a></span>-->
 
 <!-- 
@@ -1533,156 +1533,82 @@ When using the Intel Fortran compiler, **compile with [`-assume buffered_io`](ht
 
 ## Machine Learning { #ml }
 
-Frontera is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages.
+Frontera is well equipped to provide researchers with the latest in machine learning frameworks, for example PyTorch. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install Pytorch on our systems with a virtual environment: 
 
-### Running PyTorch  { #ml-pytorch }
-
-Install Pytorch and TensorBoard.
-
-1. Request a single compute node in Frontera's `rtx-dev` queue using the [`idev`](https://docs.tacc.utexas.edu/software/idev) utility:
+### Install Pytorch 
 
+1. Request a single compute node in Frontera's rtx-dev queue using the idev utility:
 	```cmd-line
 	login2.frontera$ idev -N 1 -n 1 -p rtx-dev -t 02:00:00
 	```
-
 1. Create a Python virtual environment: 
-
 	```cmd-line
 	c123-456$ module load python3/3.9.2
-	c123-456$ python3 -m venv /path/to/virtual-env  # (e.g., $SCRATCH/python-envs/test)
 	```
-
+c123-456$ python3 -m venv /path/to/virtual-env  # (e.g., $SCRATCH/python-envs/test)
 1. Activate the Python virtual environment:
-
 	```cmd-line
 	c123-456$ source /path/to/virtual-env/bin/activate
 	```
-
-1. Now install PyTorch and TensorBoard: 
-
+1. Now install PyTorch: 
 	```cmd-line
-	c123-456$ pip3 install torch==1.12.1 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
-	c123-456$ pip3 install tensorboard
+	c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
 	```
 
-#### Single-Node { #ml-pytorch-singlnode }
+### Testing Pytorch Installation 
 
-1. Download the benchmark:
+To test your installation of pytorch we point you to a few benchmark calculations that are part of Pytorch’s tutorials on multigpu and mulitnode training.  You can find their tutorial here . The tutorial includes scripts set up to run single node multigpu training as well as multinode training which we demo below:
+
+#### Single-Node
 
+1. Download the benchmark:
 	```cmd-line
-	c123-456$ cd $SCRATCH
-	c123-456$ git clone https://github.com/gpauloski/kfac-pytorch.git
-	c123-456$ cd kfac-pytorch
-	c123-456$ git checkout tags/v0.3.2
-	c123-456$ pip3 install -e .
-	c123-456$ pip3 install torchinfo tqdm Pillow
-	c123-456$ export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
+	c123-456$ cd $SCRATCH (or directory on scratch were you want this repo to reside)
 	```
-
+c123-456$  git clone https://github.com/pytorch/examples.git
 1. Run the benchmark on one node (4 GPUs):
-
 	```cmd-line
-	c123-456$ python3 -m torch.distributed.launch --nproc_per_node=4 examples/torch_cifar10_resnet.py --kfac-update-freq 0
+	c123-456$ torchrun --nproc_per_node=4 examples/distributed/ddp-tutorial-series/multigpu_torchrun.py 50 10
 	```
-
-#### Multi-Node { #ml-pytorch-multinode }
-
-1. Request two nodes in the `rtx-dev` queue using the [`idev`](https://docs.tacc.utexas.edu/software/idev) utility:
-
+
+Multi-Node
+1. Request two nodes in the rtx-dev queue using the idev utility:
 	```cmd-line
-	login2.frontera$ idev -N 2 -n 2 -p rtx-dev -t 02:00:00
+	login2.frontera$idev -N 2 -n 2 -p rtx-dev -t 02:00:00
 	```
-
 1. Move to the benchmark directory:
-
 	```cmd-line
-	c123-456$ cd $SCRATCH/kfac-pytorch
+	c123-456$ cd $SCRATCH (or directory on scratch were  this repo resides)
 	```
+1. Create a script called "run.sh". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".
 
-1. Create a script called "`run.sh`". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".
-
-	```job-script
+	```file
 	#!/bin/bash
 	HOST=$1
 	NODES=$2
 	LOCAL_RANK=${PMI_RANK}
-	python3 -m torchdistributed.launch --nproc_per_node=4  --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
-		examples/torch_cifar10_resnet.py --kfac-update-freq 0
+	torchrun --nproc_per_node=4  --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
+		examples/distributed/ddp-tutorial-series/multinode.py 50 10
 	```
 
 1. Run multi-gpu training:
-
 	```cmd-line
 	c123-456$ ibrun -np 2 ./run.sh c123-456 2
 	```
 
 
-### Running Tensorflow  { #ml-tensorflow }
-
-Follow these instructions to install and run TensorFlow benchmarks on Frontera RTX. Frontera RTX runs TensorFlow 2.8.0 with Python 3.8.2. Frontera supports CUDA/10.1, CUDA/11.0, and CUDA/11.1. By default, we use CUDA/11.3. Select the appropriate CUDA version for your TensorFlow version.
-
-1. Request a single compute node in Frontera's `rtx-dev` queue using the [`idev`](https://docs.tacc.utexas.edu/software/idev) utility:
-
-	```cmd-line
-	login2.frontera$ idev -N 1 -n 1 -p rtx-dev -t 02:00:00
-	```
-
-1. Create a Python virtual environment:
-
-	```cmd-line
-	c123-456$ python3 -m venv /path/to/virtual-env # e.g., $SCRATCH/python-envs/test
-	```
-
-1. Activate the Python virtual environment:
 
-	```cmd-line
-	c123-456$ source /path/to/virtual-env/bin/activate
-	```
-
-1. Install TensorFlow and Horovod:
-
-	```cmd-line
-	c123-456$ module load cuda/11.3 cudnn nccl
-	c123-456$ pip3 install tensorflow-gpu==2.8.2
-	```
 
-	We suggest installing Horovod version 0.25.0. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions.
 
-	```cmd-line
-	c123-456$ HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc \
-    	HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0
-	```
 
-#### Single-Node { #ml-tensorflow-singlenode }
 
-1. Download the tensorflow benchmark to your $SCRATCH directory, then check out the branch that matches your tensorflow version.
 
-	```cmd-line
-	c123-456$ cds; git clone https://github.com/tensorflow/benchmarks.git
-	c123-456$ cd benchmarks 
-	c123-456$ git checkout 51d647f     # master head as of 08/18/2022
-	```
 
-1. Activate the Python virtual environment:
 
-	```cmd-line
-	c123-456$ source /path/to/virtual-env/bin/activate
-	```
 
-1. Benchmark the performance with synthetic dataset on 1 GPU:
 
-	```cmd-line
-	c123-456$ cd scripts/tf_cnn_benchmarks
-	c123-456$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200
-	```
 
-1. Benchmark the performance with synthetic dataset on 4 GPUs:
 
-	```cmd-line
-	c123-456$ cd scripts/tf_cnn_benchmarks
-	c123-456$ ibrun -np 4 python3 tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \
-    	--model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True
-	```
 
 ## Visualization and VNC Sessions { #vis }