ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device
See original GitHub issueI am using Console to run .py file. It has pre-installed tf2.3_py3.6 kernel installed in it. It has 2 GPUS in it.
- PyTorch Lightning Version (e.g., 1.3.0): ‘1.4.6’
- PyTorch Version (e.g., 1.8): ‘1.6.0+cu101’
- Python version: 3.6
- OS (e.g., Linux): system=‘Linux’
- CUDA/cuDNN version: 11.2
- GPU models and configuration: Mentioned below
- How you installed PyTorch (
conda,pip, source): pip
Additional context
NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:62:00.0 Off | 0 |
| N/A 36C P0 57W / 300W | 2842MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 32C P0 43W / 300W | 3MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Code:
model = SRTagger(
n_classes=100,
n_warmup_steps=warmup_steps,
n_training_steps=total_training_steps
)
criterion = nn.BCELoss()
checkpoint_callback = ModelCheckpoint(
dirpath="checkpoints_sample_2",
filename="best-checkpoint",
save_top_k=1,
verbose=True,
monitor="val_loss",
mode="min"
)
logger = TensorBoardLogger("lightning_logs_2", name="SmartReply2")
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)
trainer = pl.Trainer(
logger=logger,
callbacks=[early_stopping_callback,checkpoint_callback],
max_epochs=N_EPOCHS,
gpus=[0,1],
progress_bar_refresh_rate=50,
amp_level='O3',
accelerator="ddp2"
)
print("here")
trainer.fit(model, data_module)
Error:
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------
2021-09-23 01:31:40.545020: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
**LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]**
Traceback (most recent call last):
File "20210923_passDis_model_pc1.py", line 331, in <module>
trainer.fit(model, data_module)
File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in _run
self._pre_dispatch()
File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _pre_dispatch
self.accelerator.pre_dispatch(self)
File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 104, in pre_dispatch
self.training_type_plugin.pre_dispatch()
File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 342, in pre_dispatch
self.configure_ddp()
File "/home/pc/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 304, in configure_ddp
LightningDistributedModule(self.model), device_ids=self.determine_ddp_device_ids(), **self._ddp_kwargs
File "/home/pc/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 425, in __init__
{p.device for p in module.parameters()},
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device None, and module parameters {device(type='cpu')}.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
ValueError : DistributedDataParallel device_ids and ... - GitHub
ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device #9667.
Read more >DistributedDataParallel device_ids and output_device ...
ValueError : DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU ...
Read more >Use Distributed Data Parallel correctly - PyTorch Forums
AssertionError : DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [ ...
Read more >torch.nn.parallel.distributed — PyTorch master documentation
If you change the model's parameters after the DistributedDataParallel ... and output_device arguments " "only work with single-device GPU modules, ...
Read more >DistributedDataParallel device_ids and ... - Fix Exception
Full details: ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Dear @aivanou,
I would advice to always use
ddp.Hi Ok, so the issue here is two-fold:
Yes, turns out ddp2 is broken right now. Some things got lost over time it seems and due to virtually no testing of this plugin plus very few users who have interest in this, we didn’t notice till now. I have managed to make adjustments to get it fixed locally (just using duct tape an WD 40), however, only for pytorch < 1.9. For higher pytorch versions, there is the following issue.
I noticed that recently pytorch stopped supporting
device_idsin theDistributedDataParallelmodule. They added the error here, which was released with pytorch 1.9 (ValueError: device_ids can only be None or contain a single element.) More info here under the section “Distributed”. My understanding is that this is fundamental to how the DDP2 plugin works and without it, we would need to seek probably quite a large workaround. I do not yet know how to simulate the previous behavior of our DDP2 under pytorch >= 1.9, it looks like the DDP2 plugin would have to be reimplemented.For the short term, I can unfortunately not offer you a workaround. However note, as stated before, there is no reason to use DDP2 over DP with single-node use. So please use DP for that case.
If there is a desire for it, I can polish my fix and open a PR, but as I said, that would only apply for pytorch < 1.9 (indeed, @pratikchhapolika reported this issue with pytorch 1.6.0).