Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX test fail on Linux/Mac with torch==1.10 #556

Closed
ksaur opened this issue Dec 8, 2021 · 7 comments · Fixed by #554
Closed

ONNX test fail on Linux/Mac with torch==1.10 #556

ksaur opened this issue Dec 8, 2021 · 7 comments · Fixed by #554

Comments

@ksaur
Copy link
Collaborator

ksaur commented Dec 8, 2021

Four of our ONNX tests fail when we move from torch==1.8 to torch==1.10 with onnx on Linux/Mac. Windows passes. (All three OS are using same ONNX versions.)

This blocks PR #554 .

See build run 4451697032

=========================== short test summary info ============================
FAILED tests/test_extra_conf.py::TestExtraConf::test_onnx_classification_batch
FAILED tests/test_extra_conf.py::TestExtraConf::test_onnx_iforest_batch - onn...
FAILED tests/test_extra_conf.py::TestExtraConf::test_onnx_regression_batch - ...
FAILED tests/test_extra_conf.py::TestExtraConf::test_pandas_batch_onnx - onnx...
==== 4 failed, 593 passed, 12 skipped, 168992 warnings in 749.19s (0:12:29) ====

It seems index values overflow the data. Example:

self = <onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x149afadd0>
output_names = ['variable']
input_feed = {'input_0': array([[0.39217296, 0.04115659, 0.92330056, 0.40623498, 0.9442822 ,
        0.7227245 , 0.91831994, 0.8232...14, 0.8295387 , 0.15067951,
        0.19215123, 0.01701861, 0.98203224, 0.8742442 , 0.24865562]],
      dtype=float32)}
run_options = None

    def run(self, output_names, input_feed, run_options=None):
        """
        Compute the predictions.
    
        :param output_names: name of the outputs
        :param input_feed: dictionary ``{ input_name: input_value }``
        :param run_options: See :class:`onnxruntime.RunOptions`.
    
        ::
    
            sess.run([output_name], {input_name: x})
        """
        num_required_inputs = len(self._inputs_meta)
        num_inputs = len(input_feed)
        # the graph may have optional inputs used to override initializers. allow for that.
        if num_inputs < num_required_inputs:
            raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
        if not output_names:
            output_names = [output.name for output in self._outputs_meta]
        try:
>           return self._sess.run(output_names, input_feed, run_options)
E           onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GatherElements node. Name:'GatherElements_11' Status Message: GatherElements op: 'indices' shape should have values within bounds of 'data' shape. Invalid value in indices shape is: 10

Nothing immediately obvious in the pytorch release log although there are some breaking changes.

@ksaur
Copy link
Collaborator Author

ksaur commented Dec 8, 2021

pytest test_extra_conf.py -k test_onnx_classification_batch passes for torch==1.9.1.

But also fails for https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

@ksaur
Copy link
Collaborator Author

ksaur commented Dec 8, 2021

In our test test_extra_conf:test_onnx_classification_batch, when we call hb.model.predict:

in batch_container.py:119, we reach i=10, with end=103, start = 100, (total_size = 103. iterations is 11.)

  • Then we hit line 127 (i=10 and iterations -1 = 1 ), so we call the remainder function with the last 3, all as expected.

  • That's where we die (in skl_api_containers:222, named_inputs is [0:3], shape (3,200)). But somewhere it still expects 10 like the last batches. (Invalid value in indices shape is: 10)

Why are we suddenly getting hit with indicies vs data mismatch inside the onnx C++ gather code? This seems to indicate that something about the shape of the tensor changed, which looks vaguely related to the breaking change in the new torch, but we don't use torch.any/torch.all on this path, so I am not sure.

Flipping back to torch==1.9.1 (which runs beautifully), input tensors to the function call on skl_api_container:222 don't change. Is there some state inside the session?

Thoughts @interesaaat ?

@ksaur
Copy link
Collaborator Author

ksaur commented Dec 10, 2021

After a fun day of digging, it seems that this issue is due to the symbolic dimensions in onnxruntime seeming to not work in our test case (_topology:279) with the new torch. Maybe it's related to pytorch/pytorch#64642 which mentions zero-dim tensors, maybe not.

For now, we'll use static dimensions in our tests to get unblocked.

@ksaur ksaur linked a pull request Dec 10, 2021 that will close this issue
@ksaur
Copy link
Collaborator Author

ksaur commented Dec 15, 2021

Ok, the PR as-is currently fixes the dimensions problem shown above. Now all of the tests pass.....but....never return...

The NEW problem (also described in the PR #554 ) is that the onnx tests involving strings hang indefinitely after printing the "all tests pass" message.

For example, test_model_label_encoder_int_onnxml (int test for label encoder) passes just fine and does not hang, but test_model_label_encoder_str_onnxml (string test for label encoder) causes the hang. Looking at gdb with ONLY the failing test uncommented, it's somehow related to tensor_from_numpy which we call to handle the strings. GDB:

kasaur@p100-2:~/hummingbird$ gdb python
(gdb) run tests/test_onnxml_label_encoder_converter.py
Starting program: /usr/bin/python tests/test_onnxml_label_encoder_converter.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff35e9700 (LWP 9780)]
[New Thread 0x7ffff2de8700 (LWP 9781)]
[New Thread 0x7fffee5e7700 (LWP 9782)]
[New Thread 0x7fffebde6700 (LWP 9783)]
[New Thread 0x7fffeb5e5700 (LWP 9784)]
[Thread 0x7fffeb5e5700 (LWP 9784) exited]
[Thread 0x7fffebde6700 (LWP 9783) exited]
[Thread 0x7fffee5e7700 (LWP 9782) exited]
[Thread 0x7ffff2de8700 (LWP 9781) exited]
[Thread 0x7ffff35e9700 (LWP 9780) exited]
[New Thread 0x7fffeb5e5700 (LWP 9862)]
[New Thread 0x7fffebde6700 (LWP 9863)]
[New Thread 0x7fffee5e7700 (LWP 9864)]
[New Thread 0x7ffff2de8700 (LWP 9865)]
[New Thread 0x7fff6cf4d700 (LWP 9866)]
[New Thread 0x7fff6c74c700 (LWP 9867)]
[New Thread 0x7fff6bf4b700 (LWP 9868)]
[New Thread 0x7fff6b74a700 (LWP 9869)]
[New Thread 0x7fff6af49700 (LWP 9870)]
[New Thread 0x7fff6a748700 (LWP 9871)]
[Thread 0x7fffebde6700 (LWP 9863) exited]
[Thread 0x7fffeb5e5700 (LWP 9862) exited]
[Thread 0x7ffff2de8700 (LWP 9865) exited]
[Thread 0x7fffee5e7700 (LWP 9864) exited]
[Thread 0x7fff6cf4d700 (LWP 9866) exited]
[New Thread 0x7fff6cf4d700 (LWP 9872)]
[New Thread 0x7ffff2de8700 (LWP 9873)]
[New Thread 0x7fffee5e7700 (LWP 9874)]
[New Thread 0x7fffebde6700 (LWP 9875)]
[New Thread 0x7fff69f47700 (LWP 9876)]
[New Thread 0x7fff69746700 (LWP 9877)]
[New Thread 0x7fff68f45700 (LWP 9878)]
[Thread 0x7fffee5e7700 (LWP 9874) exited]
[Thread 0x7fff69f47700 (LWP 9876) exited]
[Thread 0x7fffebde6700 (LWP 9875) exited]
[Thread 0x7ffff2de8700 (LWP 9873) exited]
[Thread 0x7fff6cf4d700 (LWP 9872) exited]
[Thread 0x7fff68f45700 (LWP 9878) exited]
[Thread 0x7fff69746700 (LWP 9877) exited]
.
----------------------------------------------------------------------
Ran 1 test in 0.190s

OK

At this point it stalls indefinitely (as so beautifully demonstrated in our pipelines!) so I hit ctrl-c:

^C
(gdb) t
[Current thread is 1 (Thread 0x7ffff7fe6740 (LWP 9768))]
(gdb) bt
#0  0x00007ffff77d1065 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffffdc70, expected=0, futex_word=0xaa6a98 <_PyRuntime+1336>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fffffffdc70, mutex=0xaa6aa0 <_PyRuntime+1344>, cond=0xaa6a70 <_PyRuntime+1296>) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=0xaa6a70 <_PyRuntime+1296>, mutex=0xaa6aa0 <_PyRuntime+1344>, abstime=0x7fffffffdc70) at pthread_cond_wait.c:667
#3  0x000000000054e4f5 in ?? ()
#4  0x000000000063a382 in PyEval_AcquireThread ()
#5  0x00007fffd3b2099e in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /home/kasaur/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007fffd463cd44 in std::_Function_handler<void (void*), torch::utils::tensor_from_numpy(_object*, bool)::{lambda(void*)#2}>::_M_invoke(std::_Any_data const&, void*&&) () from /home/kasaur/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#7  0x00007fff771410fc in c10::deleteInefficientStdFunctionContext(void*) () from /home/kasaur/.local/lib/python3.7/site-packages/torch/lib/libc10.so
#8  0x00007fff77149314 in c10::TensorImpl::release_resources() () from /home/kasaur/.local/lib/python3.7/site-packages/torch/lib/libc10.so
#9  0x00007fffd43fcd33 in std::_Hashtable<std::string, std::pair<std::string const, at::Tensor>, std::allocator<std::pair<std::string const, at::Tensor> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /home/kasaur/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#10 0x00007fffd440f465 in torch::jit::ConstantValueMap::~ConstantValueMap() () from /home/kasaur/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#11 0x00007ffff7a25161 in __run_exit_handlers (status=0, listp=0x7ffff7dcd718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#12 0x00007ffff7a2525a in __GI_exit (status=<optimized out>) at exit.c:139
#13 0x00000000006342f7 in Py_Exit ()
#14 0x0000000000630e28 in ?? ()
#15 0x0000000000630e5c in PyErr_PrintEx ()
#16 0x00000000006312f3 in PyRun_SimpleFileExFlags ()
#17 0x0000000000654232 in ?? ()
#18 0x000000000065458e in _Py_UnixMain ()
#19 0x00007ffff7a03bf7 in __libc_start_main (main=0x4bbbf0 <main>, argc=2, argv=0x7fffffffe168, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffe158) at ../csu/libc-start.c:310
#20 0x00000000005e0cca in _start ()

Starting with line 6 (tensor_from_numpy), after some debugging with @interesaaat, we found an offending tensor_from_numpy (one of the many) calls on _label_encoder_implementations.py:41 ( classes_conv = torch.from_numpy(np...). Hard-coding this array makes the test end normally, so this is the offending line! But...why?

The unanswered question is....why does this line of code pass just fine for scikit-learn but stall forever with onnx? I verified that the very same tensor is created in both the onnx and skl tests.

@ksaur
Copy link
Collaborator Author

ksaur commented Dec 15, 2021

Trying to sort out if it's only with torch.from_numpy in the codepath+ONNX, or if some other issue.

Looking quickly at two test files (extra and label encoder):

SKL: finishes all

ONNX Hangs:

  • test_extra_conf.py: (most of these don't use strings, still hangs)
    • test_onnx_one_thread
    • test_onnx_deafault_n_threads
    • test_onnx_batch_transform
    • test_onnx_regression_batch
    • test_onnx_classification_batch
    • test_onnx_iforest_batch
    • test_lightgbm_pytorch_extra_config
  • test_onnxml_label_encoder_converter.py:
    • test_model_label_encoder_str_onnxml

ONNX Finishes:

  • test_extra_conf.py:
    • test_pandas_batch_onnx (does not use numpy)
  • test_onnxml_label_encoder_converter.py:
    • test_model_label_encoder_str_onnxml (uses numpy but no strings)

@ksaur
Copy link
Collaborator Author

ksaur commented Dec 15, 2021

Turns out, in _label_encoder_implementations.py#L44, changing:

self.condition_tensors = torch.nn.Parameter(torch.IntTensor((classes_conv)), requires_grad=False)

to

from copy import deepcopy
self.condition_tensors = torch.nn.Parameter(torch.IntTensor(deepcopy(classes_conv)), requires_grad=False)

works! Which...yay! But also...yikes! This seems bad. Also, it does not explain why this only is necesary in the ONNX code path but not SKL.

@ksaur
Copy link
Collaborator Author

ksaur commented Dec 16, 2021

Ok, using best tensor copying practices (ex: this commit), it's passing.

It's not clear how the ONNX session is using the tensor that requires this copy (vs SKL which doesn't), but fortunately this extra clone operation is on the convert path and not the forward (inference time) path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant