Usage with distributed pytorch + NCCL

See original GitHub issue

I’m trying to use LocalCUDACluster with pytorch (1.7.0) native distributed framework with NCCL (2.7.8) as communication backend. I have 2 GPUs and I spawn 2 workers and initialize a processing group. However, NCCL complains that

user:568:606 [1] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 3000
user:567:605 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 3000 

and processing group initialization is failing.

Here is a minimal snippet to reproduce:

# test_bug.py

def run():

    import torch
    import dask
    from dask.distributed import Client
    from dask_cuda import LocalCUDACluster

    dask.config.set({"distributed.worker.daemon": False})
    # dask.config.set({"distributed.scheduler.work-stealing": False})

    cluster = LocalCUDACluster()
    client = Client(cluster)

    workers_info = client.scheduler_info().get("workers")
    master_addr = "localhost"
    master_port = 2345
    world_size = len(workers_info)

    print("world_size: ", world_size)

    futures = [
        client.submit(
            dask_spawner,
            i=i,
            master_addr=master_addr,
            master_port=master_port,
            world_size=world_size,
            workers_info=workers_info
        )
        for i in range(world_size)
    ]
    client.gather(futures)


def dask_spawner(i, master_addr, master_port, world_size, workers_info):

    import os
    import torch.distributed as dist

    from dask.distributed import get_worker

    worker = get_worker()
    worker_address = worker.address
    rank = workers_info[worker_address].get("id")
    local_rank = rank

    init_method = f"tcp://{master_addr}:{master_port}"
    dist.init_process_group("nccl", init_method=init_method, rank=rank, world_size=world_size)
    dist.barrier()
    torch.rand(10, 3, 64, 64, device="cuda")
    dist.destroy_process_group()    

if __name__ == "__main__":
    run()

Run it like NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python test_bug.py

Versions:

torch                  1.8.0.dev20201117
torchvision            0.9.0.dev20201125
dask                   2.30.0
dask-cuda              0.16.0

Any hints on this, please ?

EDIT:

The above code can work (does not raise an error) with lower version of pytorch/nccl (1.6.0 and ~2.4.x) or if changing the backend from “nccl” to “gloo”. Maybe, gloo is more permissive.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
pentschevcommented, Nov 30, 2020

I think the original question has been answered, I’m tentatively closing this, @vfdev-5 please feel free to reopen if there are follow-up questions or open a new issue should you want to discuss other ways to use Dask-CUDA with PyTorch.

1reaction
pentschevcommented, Nov 27, 2020

To be honest, I don’t know what would be the appropriate way to deal with that, but using client.submit seems like a good starting point. In case you haven’t seem those yet, I think the following links may provide you with some additional inspiration too:

https://docs.dask.org/en/latest/gpu.html https://ml.dask.org/pytorch.html https://examples.dask.org/machine-learning/torch-prediction.html

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed communication package - torch.distributed - PyTorch
PyTorch distributed package supports Linux (stable), MacOS (stable), and Windows (prototype). By default for Linux, the Gloo and NCCL backends are built and ......
Read more >
Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >
Distributed communication package - torch.distributed
The torch.distributed package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one ...
Read more >
torch.distributed — PyTorch 1.6.0 documentation
Use NCCL, since it currently provides the best distributed GPU training performance, especially for multiprocess single-node or multi-node distributed ...
Read more >
torch.distributed — PyTorch master documentation
Use NCCL, since it currently provides the best distributed GPU training performance, especially for multiprocess single-node or multi-node distributed ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found