NVMLError_Unkown: Unknown Error
See original GitHub issueIssue description
I installed RAPIDS on WSL2 using the installation guide here. cuDF is working fine but when I try to make a client using dask-cuda, I get the error NVMLError_Unknown: Unknown Error
Steps to reproduce the issue
- Installed NVIDIA CUDA-WSL driver
- Installed WSL2 with Ubuntu 18.04
- Installed Miniconda on Ubuntu
- Installed RAPIDS and cudatoolkit 11.2 with the command
conda create -n rapids-21.10 -c rapidsai -c nvidia -c conda-forge \
rapids-blazing=21.10 python=3.8 cudatoolkit=11.2
- Open a jupyter notebook and ran the code
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
What’s the expected result?
Start a CUDA cluster client
What’s the actual result?
Error: NVMLError_Unknown: Unkown Error
Additional details / screenshot
My GPU is a NVIDIA GeForce RTX 2060
Full traceback:
Unable to start CUDA Context
Traceback (most recent call last):
File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/initialize.py", line 42, in _create_cuda_context
ctx = has_cuda_context()
File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 76, in has_cuda_context
handle = pynvml.nvmlDeviceGetHandleByIndex(index)
File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
_nvmlCheckReturn(ret)
File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error
/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42977 instead
warnings.warn(
---------------------------------------------------------------------------
NVMLError_Unknown Traceback (most recent call last)
/tmp/ipykernel_17853/3542590344.py in <module>
2 from dask.distributed import Client
3
----> 4 cluster = LocalCUDACluster()
5 client = Client(cluster)
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, rmm_async, rmm_log_directory, jit_unspill, log_spilling, worker_class, **kwargs)
344 )
345
--> 346 super().__init__(
347 n_workers=0,
348 threads_per_worker=threads_per_worker,
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/local.py in __init__(self, name, n_workers, threads_per_worker, processes, loop, start, host, ip, scheduler_port, silence_logs, dashboard_address, worker_dashboard_address, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, protocol, blocked_handlers, interface, worker_class, scheduler_kwargs, scheduler_sync_interval, **worker_kwargs)
234 workers = {i: worker for i in range(n_workers)}
235
--> 236 super().__init__(
237 name=name,
238 scheduler=scheduler,
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
281 if not self.asynchronous:
282 self._loop_runner.start()
--> 283 self.sync(self._start)
284 self.sync(self._correct_state)
285
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
212 return future
213 else:
--> 214 return sync(self.loop, func, *args, **kwargs)
215
216 def _log(self, log):
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
324 if error[0]:
325 typ, exc, tb = error[0]
--> 326 raise exc.with_traceback(tb)
327 else:
328 return result[0]
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in f()
307 if callback_timeout is not None:
308 future = asyncio.wait_for(future, callback_timeout)
--> 309 result[0] = yield future
310 except Exception:
311 error[0] = sys.exc_info()
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in _start(self)
309 if isinstance(cls, str):
310 cls = import_term(cls)
--> 311 self.scheduler = cls(**self.scheduler_spec.get("options", {}))
312 self.scheduler = await self.scheduler
313 self.scheduler_comm = rpc(
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, loop, delete_interval, synchronize_worker_interval, services, service_kwargs, allowed_failures, extensions, validate, scheduler_file, security, worker_ttl, idle_timeout, interface, host, port, protocol, dashboard_address, dashboard, http_prefix, preload, preload_argv, plugins, **kwargs)
3797 connection_limit = get_fileno_limit() / 2
3798
-> 3799 super().__init__(
3800 aliases=aliases,
3801 handlers=self.handlers,
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, aliases, clients, workers, host_info, resources, tasks, unrunnable, validate, **kwargs)
1978 self._transition_counter = 0
1979
-> 1980 super().__init__(**kwargs)
1981
1982 @property
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/core.py in __init__(self, handlers, blocked_handlers, stream_handlers, connection_limit, deserialize, serializers, deserializers, connection_args, timeout, io_loop)
158 self._comms = {}
159 self.deserialize = deserialize
--> 160 self.monitor = SystemMonitor()
161 self.counters = None
162 self.digests = None
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/system_monitor.py in __init__(self, n)
57
58 if nvml.device_get_count() > 0:
---> 59 gpu_extra = nvml.one_time()
60 self.gpu_name = gpu_extra["name"]
61 self.gpu_memory_total = gpu_extra["memory-total"]
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in one_time()
91
92 def one_time():
---> 93 h = _pynvml_handles()
94 return {
95 "memory-total": pynvml.nvmlDeviceGetMemoryInfo(h).total,
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in _pynvml_handles()
61 cuda_visible_devices = list(range(count))
62 gpu_idx = cuda_visible_devices[0]
---> 63 return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
64
65
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetHandleByIndex(index)
1574 fn = _nvmlGetFunctionPointer("nvmlDeviceGetHandleByIndex_v2")
1575 ret = fn(c_index, byref(device))
-> 1576 _nvmlCheckReturn(ret)
1577 return device
1578
~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
741 def _nvmlCheckReturn(ret):
742 if (ret != NVML_SUCCESS):
--> 743 raise NVMLError(ret)
744 return ret
745
NVMLError_Unknown: Unknown Error
One thing that I noticed was that despite installing CUDA 11.2 from here, when I ran nvidia-smi it says that I am using CUDA 11.6:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00 Driver Version: 510.06 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 47C P5 20W / N/A | 164MiB / 6144MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
"Failed to initialize NVML: Unknown Error" after random ...
1. Issue or feature description ... After a random amount of time (it could be hours or days) the GPUs become unavailable inside...
Read more >[SOLVED] Docker with GPU: "Failed to initialize NVML
Failed to initialize NVML : Unknown Error. when I try to run any container with "--gpus". In particular, I tried the command here ......
Read more >Failed to initialize NVML: Unknown Error in Docker after Few ...
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few...
Read more >Failed to initialize NVML: Unknown Error (in docker) - CUDA ...
Hi , My requirement is to have zeppelin and cuda packages in docker container running in GPU machine. So i have downloaded zeppelin...
Read more >Dell PowerEdge 14G: ESXi returns "Failed to initialize NVML
Dell PowerEdge 14G: ESXi returns "Failed to initialize NVML: Unknown Error" with NVidia GPU. To resolve this issue, please set the Memory ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks @pentschev for helping to resolve this. While we don’t know root cause I’m going to close for now while we continue working on better WSL2 support
@quasiben @pentschev I drilled down a bit, and traced to this: https://github.com/rapidsai/cudf/issues/9955