NVMLError_Unkown: Unknown Error

Issue description

I installed RAPIDS on WSL2 using the installation guide here. cuDF is working fine but when I try to make a client using dask-cuda, I get the error NVMLError_Unknown: Unknown Error

Steps to reproduce the issue

Installed NVIDIA CUDA-WSL driver
Installed WSL2 with Ubuntu 18.04
Installed Miniconda on Ubuntu
Installed RAPIDS and cudatoolkit 11.2 with the command

conda create -n rapids-21.10 -c rapidsai -c nvidia -c conda-forge \
    rapids-blazing=21.10 python=3.8 cudatoolkit=11.2

Open a jupyter notebook and ran the code

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

What’s the expected result?

Start a CUDA cluster client

What’s the actual result?

Error: NVMLError_Unknown: Unkown Error

Additional details / screenshot

My GPU is a NVIDIA GeForce RTX 2060

Full traceback:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/initialize.py", line 42, in _create_cuda_context
    ctx = has_cuda_context()
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 76, in has_cuda_context
    handle = pynvml.nvmlDeviceGetHandleByIndex(index)
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error
/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42977 instead
  warnings.warn(
---------------------------------------------------------------------------
NVMLError_Unknown                         Traceback (most recent call last)
/tmp/ipykernel_17853/3542590344.py in <module>
      2 from dask.distributed import Client
      3 
----> 4 cluster = LocalCUDACluster()
      5 client = Client(cluster)

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, rmm_async, rmm_log_directory, jit_unspill, log_spilling, worker_class, **kwargs)
    344             )
    345 
--> 346         super().__init__(
    347             n_workers=0,
    348             threads_per_worker=threads_per_worker,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/local.py in __init__(self, name, n_workers, threads_per_worker, processes, loop, start, host, ip, scheduler_port, silence_logs, dashboard_address, worker_dashboard_address, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, protocol, blocked_handlers, interface, worker_class, scheduler_kwargs, scheduler_sync_interval, **worker_kwargs)
    234         workers = {i: worker for i in range(n_workers)}
    235 
--> 236         super().__init__(
    237             name=name,
    238             scheduler=scheduler,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
    281         if not self.asynchronous:
    282             self._loop_runner.start()
--> 283             self.sync(self._start)
    284             self.sync(self._correct_state)
    285 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    212             return future
    213         else:
--> 214             return sync(self.loop, func, *args, **kwargs)
    215 
    216     def _log(self, log):

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    324     if error[0]:
    325         typ, exc, tb = error[0]
--> 326         raise exc.with_traceback(tb)
    327     else:
    328         return result[0]

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in f()
    307             if callback_timeout is not None:
    308                 future = asyncio.wait_for(future, callback_timeout)
--> 309             result[0] = yield future
    310         except Exception:
    311             error[0] = sys.exc_info()

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in _start(self)
    309             if isinstance(cls, str):
    310                 cls = import_term(cls)
--> 311             self.scheduler = cls(**self.scheduler_spec.get("options", {}))
    312             self.scheduler = await self.scheduler
    313         self.scheduler_comm = rpc(

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, loop, delete_interval, synchronize_worker_interval, services, service_kwargs, allowed_failures, extensions, validate, scheduler_file, security, worker_ttl, idle_timeout, interface, host, port, protocol, dashboard_address, dashboard, http_prefix, preload, preload_argv, plugins, **kwargs)
   3797         connection_limit = get_fileno_limit() / 2
   3798 
-> 3799         super().__init__(
   3800             aliases=aliases,
   3801             handlers=self.handlers,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, aliases, clients, workers, host_info, resources, tasks, unrunnable, validate, **kwargs)
   1978         self._transition_counter = 0
   1979 
-> 1980         super().__init__(**kwargs)
   1981 
   1982     @property

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/core.py in __init__(self, handlers, blocked_handlers, stream_handlers, connection_limit, deserialize, serializers, deserializers, connection_args, timeout, io_loop)
    158         self._comms = {}
    159         self.deserialize = deserialize
--> 160         self.monitor = SystemMonitor()
    161         self.counters = None
    162         self.digests = None

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/system_monitor.py in __init__(self, n)
     57 
     58         if nvml.device_get_count() > 0:
---> 59             gpu_extra = nvml.one_time()
     60             self.gpu_name = gpu_extra["name"]
     61             self.gpu_memory_total = gpu_extra["memory-total"]

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in one_time()
     91 
     92 def one_time():
---> 93     h = _pynvml_handles()
     94     return {
     95         "memory-total": pynvml.nvmlDeviceGetMemoryInfo(h).total,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in _pynvml_handles()
     61         cuda_visible_devices = list(range(count))
     62     gpu_idx = cuda_visible_devices[0]
---> 63     return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
     64 
     65 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetHandleByIndex(index)
   1574     fn = _nvmlGetFunctionPointer("nvmlDeviceGetHandleByIndex_v2")
   1575     ret = fn(c_index, byref(device))
-> 1576     _nvmlCheckReturn(ret)
   1577     return device
   1578 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_Unknown: Unknown Error

One thing that I noticed was that despite installing CUDA 11.2 from here, when I ran nvidia-smi it says that I am using CUDA 11.6:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P5    20W /  N/A |    164MiB /  6144MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

quasibencommented, Oct 27, 2021

Thanks @pentschev for helping to resolve this. While we don’t know root cause I’m going to close for now while we continue working on better WSL2 support

0reactions

lmeyerovcommented, Dec 24, 2021

@quasiben @pentschev I drilled down a bit, and traced to this: https://github.com/rapidsai/cudf/issues/9955