NVMLError_Unkown: Unknown Error

See original GitHub issue

Issue description

I installed RAPIDS on WSL2 using the installation guide here. cuDF is working fine but when I try to make a client using dask-cuda, I get the error NVMLError_Unknown: Unknown Error

Steps to reproduce the issue

  1. Installed NVIDIA CUDA-WSL driver
  2. Installed WSL2 with Ubuntu 18.04
  3. Installed Miniconda on Ubuntu
  4. Installed RAPIDS and cudatoolkit 11.2 with the command
conda create -n rapids-21.10 -c rapidsai -c nvidia -c conda-forge \
    rapids-blazing=21.10 python=3.8 cudatoolkit=11.2
  1. Open a jupyter notebook and ran the code
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

What’s the expected result?

Start a CUDA cluster client

What’s the actual result?

Error: NVMLError_Unknown: Unkown Error

Additional details / screenshot

My GPU is a NVIDIA GeForce RTX 2060

Full traceback:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/initialize.py", line 42, in _create_cuda_context
    ctx = has_cuda_context()
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 76, in has_cuda_context
    handle = pynvml.nvmlDeviceGetHandleByIndex(index)
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error
/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42977 instead
  warnings.warn(
---------------------------------------------------------------------------
NVMLError_Unknown                         Traceback (most recent call last)
/tmp/ipykernel_17853/3542590344.py in <module>
      2 from dask.distributed import Client
      3 
----> 4 cluster = LocalCUDACluster()
      5 client = Client(cluster)

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, rmm_async, rmm_log_directory, jit_unspill, log_spilling, worker_class, **kwargs)
    344             )
    345 
--> 346         super().__init__(
    347             n_workers=0,
    348             threads_per_worker=threads_per_worker,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/local.py in __init__(self, name, n_workers, threads_per_worker, processes, loop, start, host, ip, scheduler_port, silence_logs, dashboard_address, worker_dashboard_address, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, protocol, blocked_handlers, interface, worker_class, scheduler_kwargs, scheduler_sync_interval, **worker_kwargs)
    234         workers = {i: worker for i in range(n_workers)}
    235 
--> 236         super().__init__(
    237             name=name,
    238             scheduler=scheduler,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
    281         if not self.asynchronous:
    282             self._loop_runner.start()
--> 283             self.sync(self._start)
    284             self.sync(self._correct_state)
    285 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    212             return future
    213         else:
--> 214             return sync(self.loop, func, *args, **kwargs)
    215 
    216     def _log(self, log):

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    324     if error[0]:
    325         typ, exc, tb = error[0]
--> 326         raise exc.with_traceback(tb)
    327     else:
    328         return result[0]

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in f()
    307             if callback_timeout is not None:
    308                 future = asyncio.wait_for(future, callback_timeout)
--> 309             result[0] = yield future
    310         except Exception:
    311             error[0] = sys.exc_info()

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in _start(self)
    309             if isinstance(cls, str):
    310                 cls = import_term(cls)
--> 311             self.scheduler = cls(**self.scheduler_spec.get("options", {}))
    312             self.scheduler = await self.scheduler
    313         self.scheduler_comm = rpc(

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, loop, delete_interval, synchronize_worker_interval, services, service_kwargs, allowed_failures, extensions, validate, scheduler_file, security, worker_ttl, idle_timeout, interface, host, port, protocol, dashboard_address, dashboard, http_prefix, preload, preload_argv, plugins, **kwargs)
   3797         connection_limit = get_fileno_limit() / 2
   3798 
-> 3799         super().__init__(
   3800             aliases=aliases,
   3801             handlers=self.handlers,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, aliases, clients, workers, host_info, resources, tasks, unrunnable, validate, **kwargs)
   1978         self._transition_counter = 0
   1979 
-> 1980         super().__init__(**kwargs)
   1981 
   1982     @property

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/core.py in __init__(self, handlers, blocked_handlers, stream_handlers, connection_limit, deserialize, serializers, deserializers, connection_args, timeout, io_loop)
    158         self._comms = {}
    159         self.deserialize = deserialize
--> 160         self.monitor = SystemMonitor()
    161         self.counters = None
    162         self.digests = None

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/system_monitor.py in __init__(self, n)
     57 
     58         if nvml.device_get_count() > 0:
---> 59             gpu_extra = nvml.one_time()
     60             self.gpu_name = gpu_extra["name"]
     61             self.gpu_memory_total = gpu_extra["memory-total"]

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in one_time()
     91 
     92 def one_time():
---> 93     h = _pynvml_handles()
     94     return {
     95         "memory-total": pynvml.nvmlDeviceGetMemoryInfo(h).total,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in _pynvml_handles()
     61         cuda_visible_devices = list(range(count))
     62     gpu_idx = cuda_visible_devices[0]
---> 63     return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
     64 
     65 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetHandleByIndex(index)
   1574     fn = _nvmlGetFunctionPointer("nvmlDeviceGetHandleByIndex_v2")
   1575     ret = fn(c_index, byref(device))
-> 1576     _nvmlCheckReturn(ret)
   1577     return device
   1578 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_Unknown: Unknown Error

One thing that I noticed was that despite installing CUDA 11.2 from here, when I ran nvidia-smi it says that I am using CUDA 11.6:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P5    20W /  N/A |    164MiB /  6144MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
quasibencommented, Oct 27, 2021

Thanks @pentschev for helping to resolve this. While we don’t know root cause I’m going to close for now while we continue working on better WSL2 support

0reactions
lmeyerovcommented, Dec 24, 2021

@quasiben @pentschev I drilled down a bit, and traced to this: https://github.com/rapidsai/cudf/issues/9955

Read more comments on GitHub >

github_iconTop Results From Across the Web

"Failed to initialize NVML: Unknown Error" after random ...
1. Issue or feature description ... After a random amount of time (it could be hours or days) the GPUs become unavailable inside...
Read more >
[SOLVED] Docker with GPU: "Failed to initialize NVML
Failed to initialize NVML : Unknown Error. when I try to run any container with "--gpus". In particular, I tried the command here ......
Read more >
Failed to initialize NVML: Unknown Error in Docker after Few ...
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few...
Read more >
Failed to initialize NVML: Unknown Error (in docker) - CUDA ...
Hi , My requirement is to have zeppelin and cuda packages in docker container running in GPU machine. So i have downloaded zeppelin...
Read more >
Dell PowerEdge 14G: ESXi returns "Failed to initialize NVML
Dell PowerEdge 14G: ESXi returns "Failed to initialize NVML: Unknown Error" with NVidia GPU. To resolve this issue, please set the Memory ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found