Usage with distributed pytorch + NCCL

See original GitHub issue

I’m trying to use LocalCUDACluster with pytorch (1.7.0) native distributed framework with NCCL (2.7.8) as communication backend. I have 2 GPUs and I spawn 2 workers and initialize a processing group. However, NCCL complains that

user:568:606 [1] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 3000
user:567:605 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 3000

and processing group initialization is failing.

Here is a minimal snippet to reproduce:

# test_bug.py

def run():

    import torch
    import dask
    from dask.distributed import Client
    from dask_cuda import LocalCUDACluster

    dask.config.set({"distributed.worker.daemon": False})
    # dask.config.set({"distributed.scheduler.work-stealing": False})

    cluster = LocalCUDACluster()
    client = Client(cluster)

    workers_info = client.scheduler_info().get("workers")
    master_addr = "localhost"
    master_port = 2345
    world_size = len(workers_info)

    print("world_size: ", world_size)

    futures = [
        client.submit(
            dask_spawner,
            i=i,
            master_addr=master_addr,
            master_port=master_port,
            world_size=world_size,
            workers_info=workers_info
        )
        for i in range(world_size)
    ]
    client.gather(futures)


def dask_spawner(i, master_addr, master_port, world_size, workers_info):

    import os
    import torch.distributed as dist

    from dask.distributed import get_worker

    worker = get_worker()
    worker_address = worker.address
    rank = workers_info[worker_address].get("id")
    local_rank = rank

    init_method = f"tcp://{master_addr}:{master_port}"
    dist.init_process_group("nccl", init_method=init_method, rank=rank, world_size=world_size)
    dist.barrier()
    torch.rand(10, 3, 64, 64, device="cuda")
    dist.destroy_process_group()    

if __name__ == "__main__":
    run()

Run it like NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python test_bug.py

Versions:

torch                  1.8.0.dev20201117
torchvision            0.9.0.dev20201125
dask                   2.30.0
dask-cuda              0.16.0

Any hints on this, please ?

EDIT:

The above code can work (does not raise an error) with lower version of pytorch/nccl (1.6.0 and ~2.4.x) or if changing the backend from “nccl” to “gloo”. Maybe, gloo is more permissive.

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

pentschevcommented, Nov 30, 2020

I think the original question has been answered, I’m tentatively closing this, @vfdev-5 please feel free to reopen if there are follow-up questions or open a new issue should you want to discuss other ways to use Dask-CUDA with PyTorch.

1reaction

pentschevcommented, Nov 27, 2020

To be honest, I don’t know what would be the appropriate way to deal with that, but using client.submit seems like a good starting point. In case you haven’t seem those yet, I think the following links may provide you with some additional inspiration too:

https://docs.dask.org/en/latest/gpu.html https://ml.dask.org/pytorch.html https://examples.dask.org/machine-learning/torch-prediction.html

Top Results From Across the Web

Distributed communication package - torch.distributed - PyTorch

PyTorch distributed package supports Linux (stable), MacOS (stable), and Windows (prototype). By default for Linux, the Gloo and NCCL backends are built and ......

Multi node PyTorch Distributed Training Guide For People In A ...

The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...

Distributed communication package - torch.distributed

The torch.distributed package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one ...

torch.distributed — PyTorch 1.6.0 documentation

Use NCCL, since it currently provides the best distributed GPU training performance, especially for multiprocess single-node or multi-node distributed ...

torch.distributed — PyTorch master documentation

Use NCCL, since it currently provides the best distributed GPU training performance, especially for multiprocess single-node or multi-node distributed ...