ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device

I am using Console to run .py file. It has pre-installed tf2.3_py3.6 kernel installed in it. It has 2 GPUS in it.

PyTorch Lightning Version (e.g., 1.3.0): ‘1.4.6’
PyTorch Version (e.g., 1.8): ‘1.6.0+cu101’
Python version: 3.6
OS (e.g., Linux): system=‘Linux’
CUDA/cuDNN version: 11.2
GPU models and configuration: Mentioned below
How you installed PyTorch (conda, pip, source): pip

Additional context

NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    57W / 300W |   2842MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Code:

model = SRTagger(
  n_classes=100,
  n_warmup_steps=warmup_steps,
  n_training_steps=total_training_steps 
)

criterion = nn.BCELoss()



checkpoint_callback = ModelCheckpoint(
  dirpath="checkpoints_sample_2",
  filename="best-checkpoint",
  save_top_k=1,
  verbose=True,
  monitor="val_loss",
  mode="min"
)

logger = TensorBoardLogger("lightning_logs_2", name="SmartReply2")

early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)

trainer = pl.Trainer(
  logger=logger,
  callbacks=[early_stopping_callback,checkpoint_callback],
  max_epochs=N_EPOCHS,
  gpus=[0,1],
  progress_bar_refresh_rate=50,
  amp_level='O3',
  accelerator="ddp2"
  )

print("here")

trainer.fit(model, data_module)

Error:

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

2021-09-23 01:31:40.545020: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
**LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]**
Traceback (most recent call last):
  File "20210923_passDis_model_pc1.py", line 331, in <module>
    trainer.fit(model, data_module)
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in _run
    self._pre_dispatch()
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _pre_dispatch
    self.accelerator.pre_dispatch(self)
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 104, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 342, in pre_dispatch
    self.configure_ddp()
  File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 304, in configure_ddp
    LightningDistributedModule(self.model), device_ids=self.determine_ddp_device_ids(), **self._ddp_kwargs
  File "/home/pc/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 425, in __init__
    {p.device for p in module.parameters()},
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device None, and module parameters {device(type='cpu')}.

cc @justusschock @awaelchli

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

3reactions

tchatoncommented, Oct 22, 2021

Dear @aivanou,

I would advice to always use ddp.

ddp is distributed data parallel
ddp2 is dp multi nodes.

2reactions

awaelchlicommented, Dec 5, 2021

Hi Ok, so the issue here is two-fold:

Yes, turns out ddp2 is broken right now. Some things got lost over time it seems and due to virtually no testing of this plugin plus very few users who have interest in this, we didn’t notice till now. I have managed to make adjustments to get it fixed locally (just using duct tape an WD 40), however, only for pytorch < 1.9. For higher pytorch versions, there is the following issue.
I noticed that recently pytorch stopped supporting device_ids in the DistributedDataParallel module. They added the error here, which was released with pytorch 1.9 (ValueError: device_ids can only be None or contain a single element.) More info here under the section “Distributed”. My understanding is that this is fundamental to how the DDP2 plugin works and without it, we would need to seek probably quite a large workaround. I do not yet know how to simulate the previous behavior of our DDP2 under pytorch >= 1.9, it looks like the DDP2 plugin would have to be reimplemented.

For the short term, I can unfortunately not offer you a workaround. However note, as stated before, there is no reason to use DDP2 over DP with single-node use. So please use DP for that case.

If there is a desire for it, I can polish my fix and open a PR, but as I said, that would only apply for pytorch < 1.9 (indeed, @pratikchhapolika reported this issue with pytorch 1.6.0).

Top Results From Across the Web

ValueError : DistributedDataParallel device_ids and ... - GitHub

ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device #9667.

DistributedDataParallel device_ids and output_device ...

ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU ...

Use Distributed Data Parallel correctly - PyTorch Forums

AssertionError : DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [ ...

torch.nn.parallel.distributed — PyTorch master documentation

If you change the model's parameters after the DistributedDataParallel ... and output_device arguments " "only work with single-device GPU modules, ...

DistributedDataParallel device_ids and ... - Fix Exception

Full details: ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU ...