The exported Mask-RCNN onnx model is not correct

Bugs in the exported onnx model

I was thinking that Mask-RCNN to onnx is already pipe cleaned according to the comments. Also, I saw torchvision has test code in here that covers different parts of mrcnn . But when I try exporting the model myself, the result onnx model can only “work” on the image that I used to export the model. If I use another image with a different size, the onnx runtime will result into errors like this:

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/pytoch_onnx/tests/test_onnx_rpn_filter_proposals.py", line 156, in test_rpn_head_anchor_generator_filter_proposal
    "feat_pool"   : features_g["pool"]
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 142, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,179046} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991

----------------------------------------------------------------------

Environment

Python 3.7.4
torchvision==0.5.0a0+07cbb46 (built from scratch)
torch==1.4.0 (downloaded from here)
onnx==1.6.0
onnxruntime==1.1.0
Base image: nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04

Reproduction

Code

The script I used to test the mrcnn onnx export: gist

Steps

1.Put these 2 files under the reference/detection directory 2.Run python test_onnx_export.py

Error Message

You will see that when we validate the onnx model on coco validation set with onnx runtime, only the first image can pass. The second image fails with errors like below:

Score batch 0 start
Score batch 0 finish
Test:  [   0/5000]  eta: 9:22:21  model_time: 6.4258 (6.4258)  evaluator_time: 0.1232 (0.1232)  time: 6.7483  data: 0.1986
Score batch 1 start
2020-01-13 04:59:12.794733672 [E:onnxruntime:, sequential_executor.cc:183 Execute] Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,340176} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991
Traceback (most recent call last):
  ....
  File "/pytoch_onnx/engine.py", line 150, in evaluate_onnx
    ort_output = ort_session.run(None, ort_image)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 142, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,340176} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991

More information

Given the mrcnn’s input images are in different sizes, I believe there are more hiding errors like this in the exported model. Here are something I got when I tried to seperate the rpn(region proposal network) module out and debug the export process:

The strides in anchor generator’s forward function is created as list of python int. So when exporting, these will be traced as constant in onnx. But this value should vary from image to image.
The image_size is used to clip the bbox, but from the log, onnx also treat this value as constant:

      %1106 : Float(),
      %1107 : Float(),
      %1108 : Float(),
      %1109 : Float()):
...
  %977 : Float(4741, 2) = onnx::Clip(%967, %1106, %1107) # /opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py:115:0
  %982 : Float(4741, 2) = onnx::Clip(%972, %1108, %1109) # /opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py:116:0

The num_anchors_per_level in rpn forward is also traced as constants. The exported Split op uses num_anchors_per_level as its attribute. This is the root cause of the “Split Error” I said above. num_anchors_per_level should not be a contant. Different images size have different values.

  %ob.1 : Float(1, 182400), %ob.2 : Float(1, 45600), %ob.3 : Float(1, 11400), %ob.4 : Float(1, 2850), %ob : Float(1, 741) = onnx::Split[axis=1, split=[182400, 45600, 11400, 2850, 741]](%811)

It seems to me that we’re still far away from a working mrcnn onnx module! A few bugs are hiding down there. Do we have plans to fix these issues?

Issue Analytics

State:
Created 4 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

shaoboc-lmfcommented, Jan 15, 2020

Hi, @fmassa

I have created an initial PR for the mrcnn’s region proposal network. See #1749.

1reaction

fmassacommented, Apr 9, 2020

@jinh574 this is still being worked on

Top Results From Across the Web

Tutorial 8: Pytorch to ONNX (Experimental)

--dynamic-export : Determines whether to export ONNX model with dynamic input and output shapes. If not specified, it will be set to False...

Issue exporting detectron2 mask-rcnn to ONNX with batchNorm

I'm fairly new to detectron2 framework and had some issues exporting detectron2's mask-rcnn to onnx, retaining the frozen batch norm layers from ...

MaskRCNN — TAO Toolkit 3.22.05 documentation

MaskRCNN does not support QAT. Sample usage¶. Here's a sample command to export a MaskRCNN model in INT8 mode: tao ...

Convert torchvision mask rcnn model - Intel Communities

__init__.<locals>.<lambda> at 0x7fbf85488f28>. [ ERROR ] Or because the node inputs have incorrect values/shapes. [ ERROR ] Or because input ...

(optional) Exporting a Model from PyTorch to ONNX and ...

ONNX Runtime is a performance-focused engine for ONNX models, which inferences ... the ONNX version to export the model to do_constant_folding=True, ...