Failed to create cublas handle

Description

This code appears correct and working:

https://colab.research.google.com/drive/1b3XnflgL1yttHA5cOFKb3uSHPcHU64Hv?usp=share_link

But on HP-Victus laptop with Intel core and RTX 3050 GPU, it gives this error message:

2022-12-03 10:29:49.497205: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:219] failed to create cublas handle: cublas error
2022-12-03 10:29:49.497835: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:221] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.
2022-12-03 10:29:49.498309: E external/org_tensorflow/tensorflow/compiler/xla/status_macros.cc:57] INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:327) stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms) 
*** Begin stack trace ***
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	_PyObject_MakeTpCall
	
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	
	
	
	_PyObject_MakeTpCall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	
	
	_PyRun_SimpleFileObject
	_PyRun_AnyFileObject
	Py_RunMain
	Py_BytesMain
	
	__libc_start_main
	_start
*** End stack trace ***

Traceback (most recent call last):
  File "/home/reza/jjj3.py", line 17, in <module>
    yy = (xx(3,4,5))
  File "/home/reza/jjj3.py", line 15, in xx
    return (A,B, jax.numpy.matmul(A,B))
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/api.py", line 622, in cache_miss
    execute = dispatch._xla_call_impl_lazy(fun_, *tracers, **params)
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/dispatch.py", line 236, in _xla_call_impl_lazy
    return xla_callable(fun, device, backend, name, donated_invars, keep_unused,
  File "/home/reza/.local/lib/python3.10/site-packages/jax/linear_util.py", line 303, in memoized_fun
    ans = call(fun, *args)
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/dispatch.py", line 360, in _xla_callable_uncached
    keep_unused, *arg_specs).compile().unsafe_call
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/dispatch.py", line 996, in compile
    self._executable = XlaCompiledComputation.from_xla_computation(
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/dispatch.py", line 1194, in from_xla_computation
    compiled = compile_or_get_cached(backend, xla_computation, options,
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/dispatch.py", line 1077, in compile_or_get_cached
    return backend_compile(backend, serialized_computation, compile_options,
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/profiler.py", line 314, in wrapper
    return func(*args, **kwargs)
  File "/home/reza/.local/lib/python3.10/site-packages/jax/_src/dispatch.py", line 1012, in backend_compile
    return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:327) stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms)

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/reza/jjj3.py", line 17, in <module>
    yy = (xx(3,4,5))
  File "/home/reza/jjj3.py", line 15, in xx
    return (A,B, jax.numpy.matmul(A,B))
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:327) stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms)

If you replace random matrices with “ones” it works. Even the randomization seems to work. But when you randomize, the matmul fails.

I have installed jax with

pip install --upgrade pip

pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

and CUDA 11.8 with the deb install from Nvidia.

CUDA matmul sample code works: can multiply ~3000x3000 matrices very easily.

I tried to explicitly say “import jax.numpy as jnp” and still got an error.

What’s the problem?! Do I basically have to compile jax for my machine?

What jax/jaxlib version are you using?

jax 0.3.25 jaxlib 0.3.25

Which accelerator(s) are you using?

RTX 3050 CUDA

Additional system info

Python 3.10.6 Ubuntu 22.04 latest updates + CUDA 11.8

NVIDIA GPU info

reza@HP:~$ nvidia-smi 
Sat Dec  3 10:19:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   28C    P0    N/A /  N/A |      5MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2014      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
reza@HP:~$

Issue Analytics

State:
Created 10 months ago
Comments:6 (1 by maintainers)

Top GitHub Comments

2reactions

shaneactoncommented, Dec 12, 2022

@RezaRob I fixed the issue on my side. First I downgraded to jax==0.3.22 following @tanmoyio, this didn’t solve the error but rather changed the error to be jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Attempting to perform BLAS operation using StreamExecutor without BLAS support. Googling this lead me to the actual fix which was to set gpu_options.allow_growth = True.

Full code: import tensorflow as tf print("executing TF bug workaround") config = tf.compat.v1.ConfigProto(gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.8) ) config.gpu_options.allow_growth = True session = tf.compat.v1.Session(config=config) tf.compat.v1.keras.backend.set_session(session)

which needs to be executed at the start of your program. This is a common TF bug workaround

1reaction

shaneactoncommented, Dec 12, 2022

Jax and TF should host a masterclass in bad error reporting