Update IPEX MultiNode Docs (#228)

Tyler Titsworth · web-flow · commit e59f2e6e23df · 2024-07-08T23:48:34.000Z
Signed-off-by: tylertitsworth &lt;tyler.titsworth@intel.com&gt;
diff --git a/pytorch/README.md b/pytorch/README.md
@@ -105,16 +105,18 @@ After running the command above, copy the URL (something like `http://127.0.0.1:
 
 The images below additionally include [Intel® oneAPI Collective Communications Library] (oneCCL) and Neural Compressor ([INC]):
 
-| Tag(s)                | Pytorch  | IPEX         | oneCCL               | INC       | Dockerfile      |
-| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
-| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1]  | [v0.4.0-Beta]   |
-| `2.2.0-pip-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1]  | [v0.3.4]        |
-| `2.1.0-pip-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1]  | [v0.2.3]        |
-| `2.0.0-pip-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1]  | [v0.1.0]        |
+| Tag(s)                  | Pytorch  | IPEX           | oneCCL               | INC       | Dockerfile     |
+| ---------------------   | -------- | ------------   | -------------------- | --------- | -------------- |
+| `2.3.0-pip-multinode`   | [v2.3.0] | [v2.3.0+cpu]   | [v2.3.0][ccl-v2.3.0] | [v2.6]    | [v0.4.0-Beta]  |
+| `2.2.0-pip-multinode`   | [v2.2.2] | [v2.2.0+cpu]   | [v2.2.0][ccl-v2.2.0] | [v2.6]    | [v0.4.0-Beta]  |
+| `2.1.100-pip-mulitnode` | [v2.1.2] | [v2.1.100+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.6]    | [v0.4.0-Beta]  |
+| `2.0.100-pip-multinode` | [v2.0.1] | [v2.0.100+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.6]    | [v0.4.0-Beta]  |
+
+> [!NOTE]
+> Passwordless SSH connection is also enabled in the image, but the container does not contain any SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
 
-> **Note:** Passwordless SSH connection is also enabled in the image.
-> The container does not contain the SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
-> Since the SSH key is not owned by default user account in docker, please also do "chmod 600 authorized_keys; chmod 600 id_rsa" to grant read access for default user account.
+> [!TIP]
+> Before mounting any keys, modify the permissions of those files with `chmod 600 authorized_keys; chmod 600 id_rsa` to grant read access for the default user account.
 
 #### Setup and Run IPEX Multi-Node Container
 
@@ -132,30 +134,52 @@ To add these files correctly please follow the steps described below.
 
 1. Setup ID Keys
 
-    You can use the commands provided below to [generate the Identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.
+    You can use the commands provided below to [generate the identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.
 
     ```bash
     ssh-keygen -q -N "" -t rsa -b 4096 -f ./id_rsa
     touch authorized_keys
     cat id_rsa.pub >> authorized_keys
     ```
 
-2. Configure the permissions and ownership for all of the files you have created so far.
+2. Configure the permissions and ownership for all of the files you have created so far
 
     ```bash
     chmod 600 id_rsa config authorized_keys
     chown root:root id_rsa.pub id_rsa config authorized_keys
     ```
 
-3. Setup hostfile. The hostfile is needed for running torch distributed using `ipexrun` utility. If you're not using `ipexrun` you can skip this step.
+3. Create a hostfile for `torchrun` or `ipexrun`. (Optional)
 
     ```txt
-    <Host 1 IP/Hostname>
-    <Host 2 IP/Hostname>
+    Host host1
+        HostName <Hostname of host1>
+        IdentitiesOnly yes
+        IdentityFile ~/.root/id_rsa
+        Port <SSH Port>
+    Host host2
+        HostName <Hostname of host2>
+        IdentitiesOnly yes
+        IdentityFile ~/.root/id_rsa
+        Port <SSH Port>
     ...
     ```
 
-4. Now start the workers and execute DDP on the launcher.
+4. Configure [Intel® oneAPI Collective Communications Library] in your python script
+
+    ```python
+    import oneccl_bindings_for_pytorch
+    import os
+
+    dist.init_process_group(
+        backend="ccl",
+        init_method="tcp://127.0.0.1:3022",
+        world_size=int(os.environ.get("WORLD_SIZE")),
+        rank=int(os.environ.get("RANK")),
+    )
+    ```
+
+5. Now start the workers and execute DDP on the launcher
 
     1. Worker run command:
 
@@ -182,65 +206,36 @@ To add these files correctly please follow the steps described below.
             bash -c 'ipexrun cpu  --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port 3022 /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
         ```
 
-5. Start SSH server with a custom port.
-    If the user wants to define their own port to start the SSH server, it can be done so using the commands described below.
-
-    1. Worker command:
-
-        ```bash
-        export SSH_PORT=<User SSH Port>
-        docker run -it --rm \
-            --net=host \
-            -v $PWD/authorized_keys:/etc/ssh/authorized_keys \
-            -v $PWD/tests:/workspace/tests \
-            -e SSH_PORT=${SSH_PORT} \
-            -w /workspace \
-            intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
-            bash -c '/usr/sbin/sshd -D -p ${SSH_PORT}'
-        ```
-
-    2. Add hosts to config. (**Note:** This is an optional step)
-
-        User can optionally mount their own custom client config file to define a list of hosts and ports where the SSH server is running inside the container. An example of a hostfile is provided below. This file is supposed to be mounted in the launcher container at `/etc/ssh/ssh_config`.
+> [!NOTE]
+> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.
 
-        ```bash
-        touch config
-        ```
+#### Enable [DeepSpeed*] optimizations
 
-       ```txt
-        Host host1
-            HostName <Hostname of host1>
-            IdentitiesOnly yes
-            IdentityFile ~/.root/id_rsa
-            Port <SSH Port>
-        Host host2
-            HostName <Hostname of host2>
-            IdentitiesOnly yes
-            IdentityFile ~/.root/id_rsa
-            Port <SSH Port>
-        ...
-        ```
+To enable [DeepSpeed*] optimizations with [Intel® oneAPI Collective Communications Library], add the following to your python script:
 
-    3. Launcher run command:
+```python
+import deepspeed
 
-        ```bash
-        docker run -it --rm \
-            --net=host \
-            -v $PWD/id_rsa:/root/.ssh/id_rsa \
-            -v $PWD/config:/etc/ssh/ssh_config \
-            -v $PWD/hostfile:/workspace/hostfile \
-            -v $PWD/tests:/workspace/tests \
-            -e SSH_PORT=${SSH_PORT} \
-            -w /workspace \
-            intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
-            bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port ${SSH_PORT} /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
-        ```
+# Rather than dist.init_process_group(), use deepspeed.init_distributed()
+deepspeed.init_distributed(backend="ccl")
+```
 
-> [!NOTE]
-> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.
+Additionally, if you have a [DeepSpeed* configuration](https://www.deepspeed.ai/getting-started/#deepspeed-configuration) you can use the below command as your launcher to run your script with that configuration:
 
-> [!TIP]
-> Additionally, [DeepSpeed*] optimizations can be utilized in place of ipexrun with the `ccl` backend for multi-node training.
+```bash
+    docker run -it --rm \
+    --net=host \
+    -v $PWD/id_rsa:/root/.ssh/id_rsa \
+    -v $PWD/tests:/workspace/tests \
+    -v $PWD/hostfile:/workspace/hostfile \
+    -v $PWD/ds_config.json:/workspace/ds_config.json \
+    -w /workspace \
+    intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
+    bash -c 'deepspeed --launcher IMPI \
+    --master_addr 127.0.0.1 --master_port 3022 \
+    --deepspeed_config ds_config.json --hostfile /workspace/hostfile \
+    /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl --deepspeed'
+```
 
 ---
 
@@ -277,7 +272,7 @@ The images below additionally include [Intel® oneAPI Collective Communications
 
 | Tag(s)                | Pytorch  | IPEX         | oneCCL               | INC       | Dockerfile      |
 | --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
-| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1]  | [v0.4.0-Beta]   |
+| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6]    | [v0.4.0-Beta]   |
 | `2.2.0-idp-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1]  | [v0.3.4]        |
 | `2.1.0-idp-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1]  | [v0.2.3]        |
 | `2.0.0-idp-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1]  | [v0.1.0]        |
@@ -354,19 +349,23 @@ It is the image user's responsibility to ensure that any use of The images below
 [v2.0.110+xpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu
 
 [v2.3.0]: https://github.com/pytorch/pytorch/releases/tag/v2.3.0
+[v2.2.2]: https://github.com/pytorch/pytorch/releases/tag/v2.2.2
 [v2.2.0]: https://github.com/pytorch/pytorch/releases/tag/v2.2.0
+[v2.1.2]: https://github.com/pytorch/pytorch/releases/tag/v2.1.2
 [v2.1.0]: https://github.com/pytorch/pytorch/releases/tag/v2.1.0
 [v2.0.1]: https://github.com/pytorch/pytorch/releases/tag/v2.0.1
 [v2.0.0]: https://github.com/pytorch/pytorch/releases/tag/v2.0.0
 
-[v2.5.1]: https://github.com/intel/neural-compressor/releases/tag/v2.5.1
+[v2.6]: https://github.com/intel/neural-compressor/releases/tag/v2.6
 [v2.4.1]: https://github.com/intel/neural-compressor/releases/tag/v2.4.1
 [v2.3.1]: https://github.com/intel/neural-compressor/releases/tag/v2.3.1
 [v2.1.1]: https://github.com/intel/neural-compressor/releases/tag/v2.1.1
 
 [v2.3.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.3.0%2Bcpu
 [v2.2.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.2.0%2Bcpu
+[v2.1.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
 [v2.1.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
+[v2.0.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
 [v2.0.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
 
 [ccl-v2.3.0]: https://github.com/intel/torch-ccl/releases/tag/v2.3.0%2Bcpu