Skip to content

Commit

Permalink
Lonestar6, Frontera, Vista - upated ML sections
Browse files Browse the repository at this point in the history
  • Loading branch information
susanunit committed Sep 18, 2024
1 parent 0690a63 commit 802a8b5
Show file tree
Hide file tree
Showing 9 changed files with 119 additions and 453 deletions.
139 changes: 26 additions & 113 deletions docs/hpc/6lonestar/ml.md
Original file line number Diff line number Diff line change
@@ -1,160 +1,73 @@
## Machine Learning on LS6 { #ml }

Lonestar6 is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages.
Lonestar6 is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install PyTorch on our systems with a virtual environment:

### Running PyTorch { #ml-pytorch }

Install Pytorch and TensorBoard.

1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using the [idev](../../software/idev) utility:
### Install PyTorch

1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using TACC's [`idev`][TACCIDEV] utility:
```cmd-line
login$ idev -p gpu-a100-dev -N 1 -n 1 -t 1:00:00
```

1. Create a Python virtual environment:

1. Create a Python virtual environment:
```cmd-line
c123-456.ls6$ module load python3/3.9.7
c123-456.ls6$ python3 -m venv /path/to/virtual-env # (e.g., $SCRATCH/python-envs/test)
c123-456$ module load python3/3.9.7
c123-456$ python3 -m venv /path/to/virtual-env # (e.g., $SCRATCH/python-envs/test)
```

1. Activate the Python virtual environment:

```cmd-line
c123-456.ls6$ source /path/to/virtual-env/bin/activate
c123-456$ source /path/to/virtual-env/bin/activate
```

1. Now install PyTorch and TensorBoard:

1. Now install PyTorch:
```cmd-line
c123-456.ls6$ pip3 install torch==1.12.1 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
c123-456.ls6$ pip3 install tensorboard
c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```

#### Single-Node { #ml-pytorch-singlnode }
### Testing PyTorch Installation

1. Download the benchmark:
To test your installation of PyTorch we point you to a few benchmark calculations that are part of PyTorch's tutorials on multi-GPU and multi-node training. See PyTorch's documentation: [Distributed Data Parallel in PyTorch](https://pytorch.org/tutorials/beginner/ddp_series_intro.html). These tutorials include several scripts set up to run single-node training and multi-node training.

#### Single-Node

1. Download the benchmark:
```cmd-line
c123-456.ls6$ cd $SCRATCH
c123-456.ls6$ git clone https://github.com/gpauloski/kfac-pytorch.git
c123-456.ls6$ cd kfac-pytorch
c123-456.ls6$ git checkout tags/v0.3.2
c123-456.ls6$ pip3 install -e .
c123-456.ls6$ pip3 install torchinfo tqdm Pillow
c123-456.ls6$ export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
c123-456$ cd $SCRATCH
c123-456$ git clone https://github.com/pytorch/examples.git
```

1. Run the benchmark on one node (3 GPUs):

```cmd-line
c123-456.ls6$ python3 -m torch.distributed.launch --nproc_per_node=3 examples/torch_cifar10_resnet.py --kfac-update-freq 0
c123-456$ torchrun --nproc_per_node=3 examples/distributed/ddp-tutorial-series/multigpu_torchrun.py 50 10
```

#### Multi-Node

#### Multi-Node { #ml-pytorch-multinode }

1. Request two nodes in the `gpu-a100-dev` queue using the [`idev`](../../software/idev) utility:

1. Request two nodes in the [`gpu-a100-dev`](#queues) queue using TACC's [`idev`][TACCIDEV] utility:
```cmd-line
login2.ls6$ idev -N 2 -n 2 -p gpu-a100-dev -t 01:00:00
```

1. Activate the Python virtual environment:

```cmd-line
c123-456.ls6$ source /path/to/virtual-env/bin/activate
```

1. Move to the benchmark directory:

```cmd-line
c123-456.ls6$ cd $SCRATCH/kfac-pytorch
c123-456$ cd $SCRATCH
```

1. Create a script called "`run.sh`". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".
1. Create a script called "run.sh". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".

```job-script
```file
#!/bin/bash
HOST=$1
NODES=$2
LOCAL_RANK=${PMI_RANK}
python3 -m torch.distributed.launch --nproc_per_node=3 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
examples/torch_cifar10_resnet.py --kfac-update-freq 0
torchrun --nproc_per_node=3 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
examples/distributed/ddp-tutorial-series/multinode.py 50 10
```

1. Run multi-gpu training:

```cmd-line
c123-456.ls6$ ibrun -np 2 ./run.sh c123-456 2
```

### Running Tensorflow { #ml-tensorflow }

Follow these instructions to install and run TensorFlow benchmarks on Lonestar6's A100. Lonestar6's A100 runs TensorFlow 2.8.2 with Python 3.7.13. Lonestar6's supports CUDA/11.3, CUDA/11.4, and CUDA/12.0. By default, we use CUDA/11.3. Select the appropriate CUDA version for your TensorFlow version.

1. Request a single compute node in Lonestar6's `gpu-a100-dev` queue using the [idev](../../software/idev) utility:

```cmd-line
login2.ls6$ idev -N 1 -n 1 -p gpu-a100-dev -t 01:00:00
```

1. Create a Python virtual environment:

```cmd-line
c123-456.ls6$ module load python3/3.7.13 cuda/11.3 cudnn nccl
c123-456.ls6$ python3 -m venv /path/to/virtual-env # e.g., $SCRATCH/python-envs/test
```

1. Activate the Python virtual environment:

```cmd-line
c123-456.ls6$ source /path/to/virtual-env/bin/activate
```

1. Install TensorFlow and Horovod:

```cmd-line
c123-456.ls6$ pip3 install tensorflow-gpu==2.8.2
```

We suggest installing Horovod version 0.25.0. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions.

```cmd-line
c123-456.ls6$ HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0
```

#### Single-Node { #ml-tensorflow-singlenode }

1. Download the tensorflow benchmark to your `$SCRATCH` directory, then check out the branch that matches your tensorflow version.

```cmd-line
c123-456.ls6$ cds; git clone https://github.com/tensorflow/benchmarks.git
c123-456.ls6$ cd benchmarks
c123-456.ls6$ git checkout 51d647f # master head as of 08/18/2022
```

1. Load modules and activate the Python virtual environment:

```cmd-line
c123-456.ls6$ module load python3/3.7.13 cuda/11.3 cudnn nccl
c123-456.ls6$ source /path/to/virtual-env/bin/activate
```

1. Benchmark the performance with synthetic dataset on 1 GPU:

```cmd-line
c123-456.ls6$ cd scripts/tf_cnn_benchmarks
c123-456.ls6$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200
```

1. Benchmark the performance with synthetic dataset on 3 GPUs:

```cmd-line
c123-456.ls6$ cd scripts/tf_cnn_benchmarks
c123-456.ls6$ ibrun -np 3 python3 tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \
--model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True
c123-456$ ibrun -np 2 ./run.sh c123-456 2
```


2 changes: 1 addition & 1 deletion docs/hpc/6lonestar/notices.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Lonestar6 User Guide
*Last update: June 28, 2024*
*Last update: September 18, 2024*


## Notices { #notices }
Expand Down
122 changes: 24 additions & 98 deletions docs/hpc/frontera.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Frontera User Guide
Last update: September 12, 2024
Last update: September 18, 2024
<!-- SDL <a href="https://frontera-xortal.tacc.utexas.edu/user-guide/docs/user-guide.pdf">Download PDF <i class="fa fa-file-pdf-o"></i></a></span>-->

<!--
Expand Down Expand Up @@ -1533,156 +1533,82 @@ When using the Intel Fortran compiler, **compile with [`-assume buffered_io`](ht

## Machine Learning { #ml }

Frontera is well equipped to provide researchers with the latest in Machine Learning frameworks, PyTorch and Tensorflow. We recommend using the Python virtual environment to manage machine learning packages.
Frontera is well equipped to provide researchers with the latest in machine learning frameworks, for example PyTorch. We recommend using the Python virtual environment to manage machine learning packages. Below we detail how to install Pytorch on our systems with a virtual environment:

### Running PyTorch { #ml-pytorch }

Install Pytorch and TensorBoard.

1. Request a single compute node in Frontera's `rtx-dev` queue using the [`idev`](https://docs.tacc.utexas.edu/software/idev) utility:
### Install Pytorch

1. Request a single compute node in Frontera's rtx-dev queue using the idev utility:
```cmd-line
login2.frontera$ idev -N 1 -n 1 -p rtx-dev -t 02:00:00
```

1. Create a Python virtual environment:

```cmd-line
c123-456$ module load python3/3.9.2
c123-456$ python3 -m venv /path/to/virtual-env # (e.g., $SCRATCH/python-envs/test)
```

c123-456$ python3 -m venv /path/to/virtual-env # (e.g., $SCRATCH/python-envs/test)
1. Activate the Python virtual environment:

```cmd-line
c123-456$ source /path/to/virtual-env/bin/activate
```

1. Now install PyTorch and TensorBoard:

1. Now install PyTorch:
```cmd-line
c123-456$ pip3 install torch==1.12.1 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
c123-456$ pip3 install tensorboard
c123-456$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```

#### Single-Node { #ml-pytorch-singlnode }
### Testing Pytorch Installation

1. Download the benchmark:
To test your installation of pytorch we point you to a few benchmark calculations that are part of Pytorch’s tutorials on multigpu and mulitnode training. You can find their tutorial here . The tutorial includes scripts set up to run single node multigpu training as well as multinode training which we demo below:

#### Single-Node

1. Download the benchmark:
```cmd-line
c123-456$ cd $SCRATCH
c123-456$ git clone https://github.com/gpauloski/kfac-pytorch.git
c123-456$ cd kfac-pytorch
c123-456$ git checkout tags/v0.3.2
c123-456$ pip3 install -e .
c123-456$ pip3 install torchinfo tqdm Pillow
c123-456$ export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
c123-456$ cd $SCRATCH (or directory on scratch were you want this repo to reside)
```

c123-456$ git clone https://github.com/pytorch/examples.git
1. Run the benchmark on one node (4 GPUs):

```cmd-line
c123-456$ python3 -m torch.distributed.launch --nproc_per_node=4 examples/torch_cifar10_resnet.py --kfac-update-freq 0
c123-456$ torchrun --nproc_per_node=4 examples/distributed/ddp-tutorial-series/multigpu_torchrun.py 50 10
```

#### Multi-Node { #ml-pytorch-multinode }

1. Request two nodes in the `rtx-dev` queue using the [`idev`](https://docs.tacc.utexas.edu/software/idev) utility:


Multi-Node
1. Request two nodes in the rtx-dev queue using the idev utility:
```cmd-line
login2.frontera$ idev -N 2 -n 2 -p rtx-dev -t 02:00:00
login2.frontera$idev -N 2 -n 2 -p rtx-dev -t 02:00:00
```

1. Move to the benchmark directory:

```cmd-line
c123-456$ cd $SCRATCH/kfac-pytorch
c123-456$ cd $SCRATCH (or directory on scratch were this repo resides)
```
1. Create a script called "run.sh". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".

1. Create a script called "`run.sh`". This script needs two parameters, the hostname of the master node and the number of nodes. Add execution permission for the file "run.sh".

```job-script
```file
#!/bin/bash
HOST=$1
NODES=$2
LOCAL_RANK=${PMI_RANK}
python3 -m torchdistributed.launch --nproc_per_node=4 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
examples/torch_cifar10_resnet.py --kfac-update-freq 0
torchrun --nproc_per_node=4 --nnodes=$NODES --node_rank=${LOCAL_RANK} --master_addr=$HOST \
examples/distributed/ddp-tutorial-series/multinode.py 50 10
```

1. Run multi-gpu training:

```cmd-line
c123-456$ ibrun -np 2 ./run.sh c123-456 2
```


### Running Tensorflow { #ml-tensorflow }

Follow these instructions to install and run TensorFlow benchmarks on Frontera RTX. Frontera RTX runs TensorFlow 2.8.0 with Python 3.8.2. Frontera supports CUDA/10.1, CUDA/11.0, and CUDA/11.1. By default, we use CUDA/11.3. Select the appropriate CUDA version for your TensorFlow version.

1. Request a single compute node in Frontera's `rtx-dev` queue using the [`idev`](https://docs.tacc.utexas.edu/software/idev) utility:

```cmd-line
login2.frontera$ idev -N 1 -n 1 -p rtx-dev -t 02:00:00
```

1. Create a Python virtual environment:

```cmd-line
c123-456$ python3 -m venv /path/to/virtual-env # e.g., $SCRATCH/python-envs/test
```

1. Activate the Python virtual environment:

```cmd-line
c123-456$ source /path/to/virtual-env/bin/activate
```

1. Install TensorFlow and Horovod:

```cmd-line
c123-456$ module load cuda/11.3 cudnn nccl
c123-456$ pip3 install tensorflow-gpu==2.8.2
```

We suggest installing Horovod version 0.25.0. If you wish to install other versions of Horovod, please submit a support ticket with the subject "Request for Horovod" and TACC staff will provide special instructions.

```cmd-line
c123-456$ HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 pip3 install horovod==0.25.0
```

#### Single-Node { #ml-tensorflow-singlenode }

1. Download the tensorflow benchmark to your $SCRATCH directory, then check out the branch that matches your tensorflow version.

```cmd-line
c123-456$ cds; git clone https://github.com/tensorflow/benchmarks.git
c123-456$ cd benchmarks
c123-456$ git checkout 51d647f # master head as of 08/18/2022
```

1. Activate the Python virtual environment:

```cmd-line
c123-456$ source /path/to/virtual-env/bin/activate
```

1. Benchmark the performance with synthetic dataset on 1 GPU:

```cmd-line
c123-456$ cd scripts/tf_cnn_benchmarks
c123-456$ python3 tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --num_batches 200
```

1. Benchmark the performance with synthetic dataset on 4 GPUs:

```cmd-line
c123-456$ cd scripts/tf_cnn_benchmarks
c123-456$ ibrun -np 4 python3 tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 \
--model resnet50 --batch_size 32 --num_batches 200 --allow_growth=True
```

## Visualization and VNC Sessions { #vis }

Expand Down
Loading

0 comments on commit 802a8b5

Please sign in to comment.