Unable to use GPU accelerated Optimum Onnx transformer model for inference

System Info

Optimum Version: 1.5.0
Ubuntu 20.04 Linux 
Python version 3.8

Who can help?

@JingyaHuang @echarlaix When following the documentation on https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/gpu for 1.5.0 version optimum. We get the following error:

RuntimeError Traceback (most recent call last) <ipython-input-7-8429fcab1e09> in <module> 19 “education”, 20 “music”] —> 21 pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=False)

8 frames /usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in bind_input(self, name, device_type, device_id, element_type, shape, buffer_ptr) 454 :param buffer_ptr: memory pointer to input data 455 “”" –> 456 self._iobinding.bind_input( 457 name, 458 C.OrtDevice(

RuntimeError: Error when binding input: There’s no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]

This is reproducible on google colab gpu instance as well. This is observed from 1.5.0 version only and 1.4.1 works as expected.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

!pip install optimum[onnxruntime-gpu]==1.5.1 !pip install transformers onnx

from optimum.onnxruntime import ORTModelForSequenceClassification

ort_model = ORTModelForSequenceClassification.from_pretrained( “philschmid/tiny-bert-sst2-distilled”, from_transformers=True, provider=“CUDAExecutionProvider”, )

from optimum.pipelines import pipeline from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(“philschmid/tiny-bert-sst2-distilled”)

pipe = pipeline(task=“text-classification”, model=ort_model, tokenizer=tokenizer) result = pipe(“Both the music and visual were astounding, not to mention the actors performance.”) print(result)

Expected behavior

Inference fails due to device error, which is not expected.

Issue Analytics

State:
Created 9 months ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

fxmartycommented, Dec 13, 2022

For sure, thanks a lot! Don’t hesitate if you need any guidance!

0reactions

fxmartycommented, Dec 20, 2022

@smiraldr So as I understand in fact it was a device indexing issue, @JingyaHuang fixed it in https://github.com/huggingface/optimum/pull/613 . So your PR looks good as is, moving the discussion there!

Top Results From Across the Web

Accelerated inference on NVIDIA GPUs - Hugging Face

Accelerated inference on NVIDIA GPUs. By default, ONNX Runtime runs inference on CPU devices. ... Use CUDA execution provider with floating-point models.

Inference performance drop 22X on GPU hardware ... - GitHub

We expected that the performance results are closed between the transformer backend and optimum[onnxruntime-gpu] backend. But it turns out that optimum is 22X ......

Accelerate Transformer inference on CPU with Optimum and ...

In this video, I show you how to accelerate Transformer inference with Optimum, an open source library by Hugging Face, and ONNX.

Optimizing Transformers with Hugging Face Optimum

Apply graph optimization techniques to the ONNX model; Apply dynamic quantization using ORTQuantizer from Optimum; Test inference with the ...

Deploying GPT-J and T5 with NVIDIA Triton Inference Server

For an introduction to the FasterTransformer library (Part 1), see Accelerated Inference for Large Transformer Models Using NVIDIA Triton ...