Usage with distributed pytorch + NCCL
See original GitHub issueI’m trying to use LocalCUDACluster with pytorch (1.7.0) native distributed framework with NCCL (2.7.8) as communication backend. I have 2 GPUs and I spawn 2 workers and initialize a processing group. However, NCCL complains that
user:568:606 [1] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 3000
user:567:605 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 3000
and processing group initialization is failing.
Here is a minimal snippet to reproduce:
# test_bug.py
def run():
import torch
import dask
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
dask.config.set({"distributed.worker.daemon": False})
# dask.config.set({"distributed.scheduler.work-stealing": False})
cluster = LocalCUDACluster()
client = Client(cluster)
workers_info = client.scheduler_info().get("workers")
master_addr = "localhost"
master_port = 2345
world_size = len(workers_info)
print("world_size: ", world_size)
futures = [
client.submit(
dask_spawner,
i=i,
master_addr=master_addr,
master_port=master_port,
world_size=world_size,
workers_info=workers_info
)
for i in range(world_size)
]
client.gather(futures)
def dask_spawner(i, master_addr, master_port, world_size, workers_info):
import os
import torch.distributed as dist
from dask.distributed import get_worker
worker = get_worker()
worker_address = worker.address
rank = workers_info[worker_address].get("id")
local_rank = rank
init_method = f"tcp://{master_addr}:{master_port}"
dist.init_process_group("nccl", init_method=init_method, rank=rank, world_size=world_size)
dist.barrier()
torch.rand(10, 3, 64, 64, device="cuda")
dist.destroy_process_group()
if __name__ == "__main__":
run()
Run it like NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python test_bug.py
Versions:
torch 1.8.0.dev20201117
torchvision 0.9.0.dev20201125
dask 2.30.0
dask-cuda 0.16.0
Any hints on this, please ?
EDIT:
The above code can work (does not raise an error) with lower version of pytorch/nccl (1.6.0 and ~2.4.x) or if changing the backend from “nccl” to “gloo”. Maybe, gloo is more permissive.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Distributed communication package - torch.distributed - PyTorch
PyTorch distributed package supports Linux (stable), MacOS (stable), and Windows (prototype). By default for Linux, the Gloo and NCCL backends are built and ......
Read more >Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >Distributed communication package - torch.distributed
The torch.distributed package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one ...
Read more >torch.distributed — PyTorch 1.6.0 documentation
Use NCCL, since it currently provides the best distributed GPU training performance, especially for multiprocess single-node or multi-node distributed ...
Read more >torch.distributed — PyTorch master documentation
Use NCCL, since it currently provides the best distributed GPU training performance, especially for multiprocess single-node or multi-node distributed ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think the original question has been answered, I’m tentatively closing this, @vfdev-5 please feel free to reopen if there are follow-up questions or open a new issue should you want to discuss other ways to use Dask-CUDA with PyTorch.
To be honest, I don’t know what would be the appropriate way to deal with that, but using
client.submitseems like a good starting point. In case you haven’t seem those yet, I think the following links may provide you with some additional inspiration too:https://docs.dask.org/en/latest/gpu.html https://ml.dask.org/pytorch.html https://examples.dask.org/machine-learning/torch-prediction.html