The exported Mask-RCNN onnx model is not correct
See original GitHub issueBugs in the exported onnx model
I was thinking that Mask-RCNN to onnx is already pipe cleaned according to the comments. Also, I saw torchvision has test code in here that covers different parts of mrcnn . But when I try exporting the model myself, the result onnx model can only “work” on the image that I used to export the model. If I use another image with a different size, the onnx runtime will result into errors like this:
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
return func(*args, **kwargs)
File "/pytoch_onnx/tests/test_onnx_rpn_filter_proposals.py", line 156, in test_rpn_head_anchor_generator_filter_proposal
"feat_pool" : features_g["pool"]
File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 142, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,179046} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991
----------------------------------------------------------------------
Environment
- Python 3.7.4
- torchvision==0.5.0a0+07cbb46 (built from scratch)
- torch==1.4.0 (downloaded from here)
- onnx==1.6.0
- onnxruntime==1.1.0
- Base image: nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04
Reproduction
Code
The script I used to test the mrcnn onnx export: gist
Steps
1.Put these 2 files under the reference/detection directory
2.Run python test_onnx_export.py
Error Message
You will see that when we validate the onnx model on coco validation set with onnx runtime, only the first image can pass. The second image fails with errors like below:
Score batch 0 start
Score batch 0 finish
Test: [ 0/5000] eta: 9:22:21 model_time: 6.4258 (6.4258) evaluator_time: 0.1232 (0.1232) time: 6.7483 data: 0.1986
Score batch 1 start
2020-01-13 04:59:12.794733672 [E:onnxruntime:, sequential_executor.cc:183 Execute] Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,340176} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991
Traceback (most recent call last):
....
File "/pytoch_onnx/engine.py", line 150, in evaluate_onnx
ort_output = ort_session.run(None, ort_image)
File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 142, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,340176} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991
More information
Given the mrcnn’s input images are in different sizes, I believe there are more hiding errors like this in the exported model. Here are something I got when I tried to seperate the rpn(region proposal network) module out and debug the export process:
- The
stridesin anchor generator’s forward function is created as list of python int. So when exporting, these will be traced as constant in onnx. But this value should vary from image to image. - The
image_sizeis used to clip the bbox, but from the log, onnx also treat this value as constant:
%1106 : Float(),
%1107 : Float(),
%1108 : Float(),
%1109 : Float()):
...
%977 : Float(4741, 2) = onnx::Clip(%967, %1106, %1107) # /opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py:115:0
%982 : Float(4741, 2) = onnx::Clip(%972, %1108, %1109) # /opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py:116:0
- The
num_anchors_per_levelin rpn forward is also traced as constants. The exported Split op usesnum_anchors_per_levelas its attribute. This is the root cause of the “Split Error” I said above.num_anchors_per_levelshould not be a contant. Different images size have different values.
%ob.1 : Float(1, 182400), %ob.2 : Float(1, 45600), %ob.3 : Float(1, 11400), %ob.4 : Float(1, 2850), %ob : Float(1, 741) = onnx::Split[axis=1, split=[182400, 45600, 11400, 2850, 741]](%811)
It seems to me that we’re still far away from a working mrcnn onnx module! A few bugs are hiding down there. Do we have plans to fix these issues?
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (6 by maintainers)
Top Related StackOverflow Question
Hi, @fmassa
I have created an initial PR for the mrcnn’s region proposal network. See #1749.
@jinh574 this is still being worked on