ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device

See original GitHub issue

I am using Console to run .py file. It has pre-installed tf2.3_py3.6 kernel installed in it. It has 2 GPUS in it.

  • PyTorch Lightning Version (e.g., 1.3.0): ‘1.4.6’
  • PyTorch Version (e.g., 1.8): ‘1.6.0+cu101’
  • Python version: 3.6
  • OS (e.g., Linux): system=‘Linux’
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: Mentioned below
  • How you installed PyTorch (conda, pip, source): pip

Additional context

NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    57W / 300W |   2842MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Code:

model = SRTagger(
  n_classes=100,
  n_warmup_steps=warmup_steps,
  n_training_steps=total_training_steps 
)

criterion = nn.BCELoss()



checkpoint_callback = ModelCheckpoint(
  dirpath="checkpoints_sample_2",
  filename="best-checkpoint",
  save_top_k=1,
  verbose=True,
  monitor="val_loss",
  mode="min"
)

logger = TensorBoardLogger("lightning_logs_2", name="SmartReply2")

early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)

trainer = pl.Trainer(
  logger=logger,
  callbacks=[early_stopping_callback,checkpoint_callback],
  max_epochs=N_EPOCHS,
  gpus=[0,1],
  progress_bar_refresh_rate=50,
  amp_level='O3',
  accelerator="ddp2"
  )

print("here")

trainer.fit(model, data_module)

Error:

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------
2021-09-23 01:31:40.545020: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
**LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]**
Traceback (most recent call last):
  File "20210923_passDis_model_pc1.py", line 331, in <module>
    trainer.fit(model, data_module)
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in _run
    self._pre_dispatch()
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _pre_dispatch
    self.accelerator.pre_dispatch(self)
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 104, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 342, in pre_dispatch
    self.configure_ddp()
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 304, in configure_ddp
    LightningDistributedModule(self.model), device_ids=self.determine_ddp_device_ids(), **self._ddp_kwargs
  File "/home/pc/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 425, in __init__
    {p.device for p in module.parameters()},
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device None, and module parameters {device(type='cpu')}.

cc @justusschock @awaelchli

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
tchatoncommented, Oct 22, 2021

Dear @aivanou,

I would advice to always use ddp.

  • ddp is distributed data parallel
  • ddp2 is dp multi nodes.
2reactions
awaelchlicommented, Dec 5, 2021

Hi Ok, so the issue here is two-fold:

  1. Yes, turns out ddp2 is broken right now. Some things got lost over time it seems and due to virtually no testing of this plugin plus very few users who have interest in this, we didn’t notice till now. I have managed to make adjustments to get it fixed locally (just using duct tape an WD 40), however, only for pytorch < 1.9. For higher pytorch versions, there is the following issue.

  2. I noticed that recently pytorch stopped supporting device_ids in the DistributedDataParallel module. They added the error here, which was released with pytorch 1.9 (ValueError: device_ids can only be None or contain a single element.) More info here under the section “Distributed”. My understanding is that this is fundamental to how the DDP2 plugin works and without it, we would need to seek probably quite a large workaround. I do not yet know how to simulate the previous behavior of our DDP2 under pytorch >= 1.9, it looks like the DDP2 plugin would have to be reimplemented.

For the short term, I can unfortunately not offer you a workaround. However note, as stated before, there is no reason to use DDP2 over DP with single-node use. So please use DP for that case.

If there is a desire for it, I can polish my fix and open a PR, but as I said, that would only apply for pytorch < 1.9 (indeed, @pratikchhapolika reported this issue with pytorch 1.6.0).

Read more comments on GitHub >

github_iconTop Results From Across the Web

ValueError : DistributedDataParallel device_ids and ... - GitHub
ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device #9667.
Read more >
DistributedDataParallel device_ids and output_device ...
ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU ...
Read more >
Use Distributed Data Parallel correctly - PyTorch Forums
AssertionError : DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [ ...
Read more >
torch.nn.parallel.distributed — PyTorch master documentation
If you change the model's parameters after the DistributedDataParallel ... and output_device arguments " "only work with single-device GPU modules, ...
Read more >
DistributedDataParallel device_ids and ... - Fix Exception
Full details: ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found