New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNX test fail on Linux/Mac with torch==1.10 #556
Comments
But also fails for |
In our test in
Why are we suddenly getting hit with indicies vs data mismatch inside the onnx C++ gather code? This seems to indicate that something about the shape of the tensor changed, which looks vaguely related to the breaking change in the new torch, but we don't use Flipping back to torch==1.9.1 (which runs beautifully), input tensors to the function call on Thoughts @interesaaat ? |
After a fun day of digging, it seems that this issue is due to the symbolic dimensions in onnxruntime seeming to not work in our test case (_topology:279) with the new torch. Maybe it's related to pytorch/pytorch#64642 which mentions zero-dim tensors, maybe not. For now, we'll use static dimensions in our tests to get unblocked. |
Ok, the PR as-is currently fixes the dimensions problem shown above. Now all of the tests pass.....but....never return... The NEW problem (also described in the PR #554 ) is that the onnx tests involving strings hang indefinitely after printing the "all tests pass" message. For example, kasaur@p100-2:~/hummingbird$ gdb python
(gdb) run tests/test_onnxml_label_encoder_converter.py
Starting program: /usr/bin/python tests/test_onnxml_label_encoder_converter.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff35e9700 (LWP 9780)]
[New Thread 0x7ffff2de8700 (LWP 9781)]
[New Thread 0x7fffee5e7700 (LWP 9782)]
[New Thread 0x7fffebde6700 (LWP 9783)]
[New Thread 0x7fffeb5e5700 (LWP 9784)]
[Thread 0x7fffeb5e5700 (LWP 9784) exited]
[Thread 0x7fffebde6700 (LWP 9783) exited]
[Thread 0x7fffee5e7700 (LWP 9782) exited]
[Thread 0x7ffff2de8700 (LWP 9781) exited]
[Thread 0x7ffff35e9700 (LWP 9780) exited]
[New Thread 0x7fffeb5e5700 (LWP 9862)]
[New Thread 0x7fffebde6700 (LWP 9863)]
[New Thread 0x7fffee5e7700 (LWP 9864)]
[New Thread 0x7ffff2de8700 (LWP 9865)]
[New Thread 0x7fff6cf4d700 (LWP 9866)]
[New Thread 0x7fff6c74c700 (LWP 9867)]
[New Thread 0x7fff6bf4b700 (LWP 9868)]
[New Thread 0x7fff6b74a700 (LWP 9869)]
[New Thread 0x7fff6af49700 (LWP 9870)]
[New Thread 0x7fff6a748700 (LWP 9871)]
[Thread 0x7fffebde6700 (LWP 9863) exited]
[Thread 0x7fffeb5e5700 (LWP 9862) exited]
[Thread 0x7ffff2de8700 (LWP 9865) exited]
[Thread 0x7fffee5e7700 (LWP 9864) exited]
[Thread 0x7fff6cf4d700 (LWP 9866) exited]
[New Thread 0x7fff6cf4d700 (LWP 9872)]
[New Thread 0x7ffff2de8700 (LWP 9873)]
[New Thread 0x7fffee5e7700 (LWP 9874)]
[New Thread 0x7fffebde6700 (LWP 9875)]
[New Thread 0x7fff69f47700 (LWP 9876)]
[New Thread 0x7fff69746700 (LWP 9877)]
[New Thread 0x7fff68f45700 (LWP 9878)]
[Thread 0x7fffee5e7700 (LWP 9874) exited]
[Thread 0x7fff69f47700 (LWP 9876) exited]
[Thread 0x7fffebde6700 (LWP 9875) exited]
[Thread 0x7ffff2de8700 (LWP 9873) exited]
[Thread 0x7fff6cf4d700 (LWP 9872) exited]
[Thread 0x7fff68f45700 (LWP 9878) exited]
[Thread 0x7fff69746700 (LWP 9877) exited]
.
----------------------------------------------------------------------
Ran 1 test in 0.190s
OK At this point it stalls indefinitely (as so beautifully demonstrated in our pipelines!) so I hit
Starting with line 6 ( The unanswered question is....why does this line of code pass just fine for scikit-learn but stall forever with onnx? I verified that the very same tensor is created in both the onnx and skl tests. |
Trying to sort out if it's only with Looking quickly at two test files ( SKL: finishes all ONNX Hangs:
ONNX Finishes:
|
Turns out, in _label_encoder_implementations.py#L44, changing: self.condition_tensors = torch.nn.Parameter(torch.IntTensor((classes_conv)), requires_grad=False) to from copy import deepcopy
self.condition_tensors = torch.nn.Parameter(torch.IntTensor(deepcopy(classes_conv)), requires_grad=False) works! Which...yay! But also...yikes! This seems bad. Also, it does not explain why this only is necesary in the ONNX code path but not SKL. |
Ok, using best tensor copying practices (ex: this commit), it's passing. It's not clear how the ONNX session is using the tensor that requires this copy (vs SKL which doesn't), but fortunately this extra clone operation is on the convert path and not the forward (inference time) path. |
Four of our ONNX tests fail when we move from
torch==1.8
totorch==1.10
with onnx on Linux/Mac. Windows passes. (All three OS are using same ONNX versions.)This blocks PR #554 .
See build run 4451697032
It seems index values overflow the data. Example:
Nothing immediately obvious in the pytorch release log although there are some breaking changes.
The text was updated successfully, but these errors were encountered: