Skip to content

Commit e59f2e6

Browse files
author
Tyler Titsworth
authored
Update IPEX MultiNode Docs (#228)
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
1 parent 431abb6 commit e59f2e6

File tree

1 file changed

+69
-70
lines changed

1 file changed

+69
-70
lines changed

pytorch/README.md

+69-70
Original file line numberDiff line numberDiff line change
@@ -105,16 +105,18 @@ After running the command above, copy the URL (something like `http://127.0.0.1:
105105

106106
The images below additionally include [Intel® oneAPI Collective Communications Library] (oneCCL) and Neural Compressor ([INC]):
107107

108-
| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
109-
| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
110-
| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1] | [v0.4.0-Beta] |
111-
| `2.2.0-pip-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1] | [v0.3.4] |
112-
| `2.1.0-pip-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1] | [v0.2.3] |
113-
| `2.0.0-pip-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |
108+
| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
109+
| --------------------- | -------- | ------------ | -------------------- | --------- | -------------- |
110+
| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6] | [v0.4.0-Beta] |
111+
| `2.2.0-pip-multinode` | [v2.2.2] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.6] | [v0.4.0-Beta] |
112+
| `2.1.100-pip-mulitnode` | [v2.1.2] | [v2.1.100+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.6] | [v0.4.0-Beta] |
113+
| `2.0.100-pip-multinode` | [v2.0.1] | [v2.0.100+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.6] | [v0.4.0-Beta] |
114+
115+
> [!NOTE]
116+
> Passwordless SSH connection is also enabled in the image, but the container does not contain any SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
114117
115-
> **Note:** Passwordless SSH connection is also enabled in the image.
116-
> The container does not contain the SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
117-
> Since the SSH key is not owned by default user account in docker, please also do "chmod 600 authorized_keys; chmod 600 id_rsa" to grant read access for default user account.
118+
> [!TIP]
119+
> Before mounting any keys, modify the permissions of those files with `chmod 600 authorized_keys; chmod 600 id_rsa` to grant read access for the default user account.
118120
119121
#### Setup and Run IPEX Multi-Node Container
120122

@@ -132,30 +134,52 @@ To add these files correctly please follow the steps described below.
132134

133135
1. Setup ID Keys
134136

135-
You can use the commands provided below to [generate the Identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.
137+
You can use the commands provided below to [generate the identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.
136138

137139
```bash
138140
ssh-keygen -q -N "" -t rsa -b 4096 -f ./id_rsa
139141
touch authorized_keys
140142
cat id_rsa.pub >> authorized_keys
141143
```
142144

143-
2. Configure the permissions and ownership for all of the files you have created so far.
145+
2. Configure the permissions and ownership for all of the files you have created so far
144146

145147
```bash
146148
chmod 600 id_rsa config authorized_keys
147149
chown root:root id_rsa.pub id_rsa config authorized_keys
148150
```
149151

150-
3. Setup hostfile. The hostfile is needed for running torch distributed using `ipexrun` utility. If you're not using `ipexrun` you can skip this step.
152+
3. Create a hostfile for `torchrun` or `ipexrun`. (Optional)
151153

152154
```txt
153-
<Host 1 IP/Hostname>
154-
<Host 2 IP/Hostname>
155+
Host host1
156+
HostName <Hostname of host1>
157+
IdentitiesOnly yes
158+
IdentityFile ~/.root/id_rsa
159+
Port <SSH Port>
160+
Host host2
161+
HostName <Hostname of host2>
162+
IdentitiesOnly yes
163+
IdentityFile ~/.root/id_rsa
164+
Port <SSH Port>
155165
...
156166
```
157167

158-
4. Now start the workers and execute DDP on the launcher.
168+
4. Configure [Intel® oneAPI Collective Communications Library] in your python script
169+
170+
```python
171+
import oneccl_bindings_for_pytorch
172+
import os
173+
174+
dist.init_process_group(
175+
backend="ccl",
176+
init_method="tcp://127.0.0.1:3022",
177+
world_size=int(os.environ.get("WORLD_SIZE")),
178+
rank=int(os.environ.get("RANK")),
179+
)
180+
```
181+
182+
5. Now start the workers and execute DDP on the launcher
159183

160184
1. Worker run command:
161185

@@ -182,65 +206,36 @@ To add these files correctly please follow the steps described below.
182206
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port 3022 /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
183207
```
184208

185-
5. Start SSH server with a custom port.
186-
If the user wants to define their own port to start the SSH server, it can be done so using the commands described below.
187-
188-
1. Worker command:
189-
190-
```bash
191-
export SSH_PORT=<User SSH Port>
192-
docker run -it --rm \
193-
--net=host \
194-
-v $PWD/authorized_keys:/etc/ssh/authorized_keys \
195-
-v $PWD/tests:/workspace/tests \
196-
-e SSH_PORT=${SSH_PORT} \
197-
-w /workspace \
198-
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
199-
bash -c '/usr/sbin/sshd -D -p ${SSH_PORT}'
200-
```
201-
202-
2. Add hosts to config. (**Note:** This is an optional step)
203-
204-
User can optionally mount their own custom client config file to define a list of hosts and ports where the SSH server is running inside the container. An example of a hostfile is provided below. This file is supposed to be mounted in the launcher container at `/etc/ssh/ssh_config`.
209+
> [!NOTE]
210+
> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.
205211

206-
```bash
207-
touch config
208-
```
212+
#### Enable [DeepSpeed*] optimizations
209213

210-
```txt
211-
Host host1
212-
HostName <Hostname of host1>
213-
IdentitiesOnly yes
214-
IdentityFile ~/.root/id_rsa
215-
Port <SSH Port>
216-
Host host2
217-
HostName <Hostname of host2>
218-
IdentitiesOnly yes
219-
IdentityFile ~/.root/id_rsa
220-
Port <SSH Port>
221-
...
222-
```
214+
To enable [DeepSpeed*] optimizations with [Intel® oneAPI Collective Communications Library], add the following to your python script:
223215

224-
3. Launcher run command:
216+
```python
217+
import deepspeed
225218
226-
```bash
227-
docker run -it --rm \
228-
--net=host \
229-
-v $PWD/id_rsa:/root/.ssh/id_rsa \
230-
-v $PWD/config:/etc/ssh/ssh_config \
231-
-v $PWD/hostfile:/workspace/hostfile \
232-
-v $PWD/tests:/workspace/tests \
233-
-e SSH_PORT=${SSH_PORT} \
234-
-w /workspace \
235-
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
236-
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port ${SSH_PORT} /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
237-
```
219+
# Rather than dist.init_process_group(), use deepspeed.init_distributed()
220+
deepspeed.init_distributed(backend="ccl")
221+
```
238222

239-
> [!NOTE]
240-
> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.
223+
Additionally, if you have a [DeepSpeed* configuration](https://www.deepspeed.ai/getting-started/#deepspeed-configuration) you can use the below command as your launcher to run your script with that configuration:
241224

242-
> [!TIP]
243-
> Additionally, [DeepSpeed*] optimizations can be utilized in place of ipexrun with the `ccl` backend for multi-node training.
225+
```bash
226+
docker run -it --rm \
227+
--net=host \
228+
-v $PWD/id_rsa:/root/.ssh/id_rsa \
229+
-v $PWD/tests:/workspace/tests \
230+
-v $PWD/hostfile:/workspace/hostfile \
231+
-v $PWD/ds_config.json:/workspace/ds_config.json \
232+
-w /workspace \
233+
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
234+
bash -c 'deepspeed --launcher IMPI \
235+
--master_addr 127.0.0.1 --master_port 3022 \
236+
--deepspeed_config ds_config.json --hostfile /workspace/hostfile \
237+
/workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl --deepspeed'
238+
```
244239

245240
---
246241

@@ -277,7 +272,7 @@ The images below additionally include [Intel® oneAPI Collective Communications
277272

278273
| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
279274
| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
280-
| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1] | [v0.4.0-Beta] |
275+
| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6] | [v0.4.0-Beta] |
281276
| `2.2.0-idp-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1] | [v0.3.4] |
282277
| `2.1.0-idp-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1] | [v0.2.3] |
283278
| `2.0.0-idp-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |
@@ -354,19 +349,23 @@ It is the image user's responsibility to ensure that any use of The images below
354349
[v2.0.110+xpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu
355350
356351
[v2.3.0]: https://github.com/pytorch/pytorch/releases/tag/v2.3.0
352+
[v2.2.2]: https://github.com/pytorch/pytorch/releases/tag/v2.2.2
357353
[v2.2.0]: https://github.com/pytorch/pytorch/releases/tag/v2.2.0
354+
[v2.1.2]: https://github.com/pytorch/pytorch/releases/tag/v2.1.2
358355
[v2.1.0]: https://github.com/pytorch/pytorch/releases/tag/v2.1.0
359356
[v2.0.1]: https://github.com/pytorch/pytorch/releases/tag/v2.0.1
360357
[v2.0.0]: https://github.com/pytorch/pytorch/releases/tag/v2.0.0
361358
362-
[v2.5.1]: https://github.com/intel/neural-compressor/releases/tag/v2.5.1
359+
[v2.6]: https://github.com/intel/neural-compressor/releases/tag/v2.6
363360
[v2.4.1]: https://github.com/intel/neural-compressor/releases/tag/v2.4.1
364361
[v2.3.1]: https://github.com/intel/neural-compressor/releases/tag/v2.3.1
365362
[v2.1.1]: https://github.com/intel/neural-compressor/releases/tag/v2.1.1
366363
367364
[v2.3.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.3.0%2Bcpu
368365
[v2.2.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.2.0%2Bcpu
366+
[v2.1.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
369367
[v2.1.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
368+
[v2.0.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
370369
[v2.0.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
371370
372371
[ccl-v2.3.0]: https://github.com/intel/torch-ccl/releases/tag/v2.3.0%2Bcpu

0 commit comments

Comments
 (0)