Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling not working with multiple gpus #1

Open
finkga opened this issue Jun 23, 2022 · 2 comments
Open

Sampling not working with multiple gpus #1

finkga opened this issue Jun 23, 2022 · 2 comments

Comments

@finkga
Copy link

finkga commented Jun 23, 2022

Thank you for providing such a well-written and understandable example. I did have problems running on multiple gpus (which sometimes seems like voodoo anyway). Here's my command line and status message including traceback:

$ python main.py --accelerator gpu --devices auto --workers 0 --epochs 2 --bs 32 --pretrained False
Accelerator: gpu
Using all 2 GPUs:
 - Quadro RTX 5000
 - Quadro RTX 5000
Number of workers used: 0
Maximum number of epochs: 2
Batch size: 32
Initial learning rate: 0.1
Pretrained: False
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/d3p692/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.7M/44.7M [00:01<00:00, 36.2MB/s]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: logs/lightning_logs
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ./data/cifar-100-python.tar.gz
169001984it [00:03, 42873341.00it/s]                                                                                                                                                        
Extracting ./data/cifar-100-python.tar.gz to ./data
Files already downloaded and verified
Missing logger folder: logs/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name              | Type            | Params
------------------------------------------------------
0 | feature_extractor | Sequential      | 11.2 M
1 | classifier        | Linear          | 51.3 K
2 | test_confmat      | ConfusionMatrix | 0     
------------------------------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.911    Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/home/d3p692/CIFAR-100/main.py", line 70, in <module>
    trainer.fit(lm, ldm)
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 78, in launch
    mp.spawn(
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 101, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1406, in _run_sanity_check
    val_loop._reload_evaluation_dataloaders()
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 242, in _reload_evaluation_dataloaders
    self.trainer.reset_val_dataloader()
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1965, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 389, in _reset_eval_dataloader
    dataloaders = [self._prepare_dataloader(dl, mode=mode) for dl in dataloaders if dl is not None]
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 389, in <listcomp>
    dataloaders = [self._prepare_dataloader(dl, mode=mode) for dl in dataloaders if dl is not None]
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 299, in _prepare_dataloader
    sampler = self._resolve_sampler(dataloader, shuffle=shuffle, mode=mode)
  File "/home/d3p692/miniconda3/envs/dr/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 313, in _resolve_sampler
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You seem to have configured a sampler in your DataLoader. This will be replaced by `DistributedSampler` since `replace_sampler_ddp` is True and you are using distributed training. Either remove the sampler from your DataLoader or set `replace_sampler_ddp=False` if you want to use your custom sampler.

Using a single GPU trains just fine. Thank you again.

--Glenn

@Antoine101
Copy link
Owner

Hi Glenn,

I'm glad you find my work useful!

I did not implement multi-gpu support in this code, as I just have a single one myself and didn't have time to make the code exhaustive.
I will probably not have time to implement it in the near future neither unfortunately.

You can fork my work and start from here with Lightning's doc help, and eventually suggest a modification of my code.

Cheers

Antoine

@finkga
Copy link
Author

finkga commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants