GIL-related deadlock with PyTorch 1.10.1 #3352

maxhgerlach · 2022-01-07T21:48:12Z

GIL-related deadlock with PyTorch 1.10.1

Environment:

Framework: PyTorch
Framework version: 1.10.1
Python version: 3.7.10

I built Horovod like this:

pip install -U torch==1.10.1+cpu pytorch-lightning==1.3.8 torchvision==0.11.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
HOROVOD_DEBUG=1 HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_GLOO=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITH_TENSORFLOW=1 pip install -v -e .

Bug report:
I've been looking at some test failures in PR #3261, that popped up after rebasing to master, with PyTorch 1.10.1 and Gloo. PR #3351 should fix some problems, I think, but when debugging locally I would occasionally run into hangs that I did not understand as easily. About 1 in 5 evocations of horovodrun -np 2 -H localhost:2 --gloo pytest -v -x test/parallel/test_torch.py would encounter such a problem. I've seen this with various test cases: test_async_sparse_allreduce, test_horovod_grouped_allreduce_grad_process_sets, and maybe test_broadcast_state (not sure if that one really had the same cause).

Example for test_horovod_grouped_allreduce_grad with HOROVOD_LOG_LEVEL=TRACE:

# ...
[0]<stdout>:[T horovod/common/operations.cc:783] [0]: Performing grouped_allreduce.noname.13_1of5, grouped_allreduce.noname.13_2of5, grouped_allreduce.noname.13_3of5, grouped_allreduce.noname.13_4of5, grouped_allreduce.noname.13_5of5
[1]<stdout>:[T horovod/common/operations.cc:783] [1]: Performing grouped_allreduce.noname.16_1of5, grouped_allreduce.noname.16_2of5, grouped_allreduce.noname.16_3of5, grouped_allreduce.noname.16_4of5, grouped_allreduce.noname.16_5of5
[0]<stdout>:[T horovod/common/operations.cc:785] [0]: Processing 5 tensors
[1]<stdout>:[T horovod/common/operations.cc:785] [1]: Processing 5 tensors
[0]<stdout>:[T horovod/common/operations.cc:788] [0]: Finished performing grouped_allreduce.noname.13_1of5, grouped_allreduce.noname.13_2of5, grouped_allreduce.noname.13_3of5, grouped_allreduce.noname.13_4of5, grouped_allreduce.noname.13_5of5
[1]<stdout>:[T horovod/common/operations.cc:788] [1]: Finished performing grouped_allreduce.noname.16_1of5, grouped_allreduce.noname.16_2of5, grouped_allreduce.noname.16_3of5, grouped_allreduce.noname.16_4of5, grouped_allreduce.noname.16_5of5
[0]<stdout>:[T horovod/common/operations.cc:1496] [0]: Enqueued grouped_allreduce.noname.14_1of5; grouped_allreduce.noname.14_2of5; grouped_allreduce.noname.14_3of5; grouped_allreduce.noname.14_4of5; grouped_allreduce.noname.14_5of5;
[1]<stdout>:[T horovod/common/operations.cc:1496] [1]: Enqueued grouped_allreduce.noname.17_1of5; grouped_allreduce.noname.17_2of5; grouped_allreduce.noname.17_3of5; grouped_allreduce.noname.17_4of5; grouped_allreduce.noname.17_5of5;
[0]<stdout>:[T horovod/common/controller.cc:187] [0]: Sent 5 messages to coordinator.
[1]<stdout>:[T horovod/common/controller.cc:187] [0]: Sent 5 messages to coordinator.
[0]<stdout>:[T horovod/common/controller.cc:262] Adding messages from process-set rank 0
[1]<stdout>:[T horovod/common/controller.cc:262] Adding messages from process-set rank 0
[0]<stdout>:[T horovod/common/controller.cc:947] Created response of size 98260
[1]<stdout>:[T horovod/common/controller.cc:947] Created response of size 98260
[0]<stdout>:[T horovod/common/controller.cc:450] Sending ready responses as grouped_allreduce.noname.14_1of5, grouped_allreduce.noname.14_2of5, grouped_allreduce.noname.14_3of5, grouped_allreduce.noname.14_4of5, grouped_allreduce.noname.14_5of5;
[1]<stdout>:[T horovod/common/controller.cc:450] Sending ready responses as grouped_allreduce.noname.17_1of5, grouped_allreduce.noname.17_2of5, grouped_allreduce.noname.17_3of5, grouped_allreduce.noname.17_4of5, grouped_allreduce.noname.17_5of5;
[0]<stdout>:[T horovod/common/operations.cc:782] [0]: Process set id 1
[1]<stdout>:[T horovod/common/operations.cc:782] [1]: Process set id 2
[0]<stdout>:[T horovod/common/operations.cc:783] [0]: Performing grouped_allreduce.noname.14_1of5, grouped_allreduce.noname.14_2of5, grouped_allreduce.noname.14_3of5, grouped_allreduce.noname.14_4of5, grouped_allreduce.noname.14_5of5
[1]<stdout>:[T horovod/common/operations.cc:783] [1]: Performing grouped_allreduce.noname.17_1of5, grouped_allreduce.noname.17_2of5, grouped_allreduce.noname.17_3of5, grouped_allreduce.noname.17_4of5, grouped_allreduce.noname.17_5of5
[0]<stdout>:[T horovod/common/operations.cc:785] [0]: Processing 5 tensors
[1]<stdout>:[T horovod/common/operations.cc:785] [1]: Processing 5 tensors
[0]<stdout>:[T horovod/common/operations.cc:1496] [0]: Enqueued grouped_allreduce.noname.15_1of5; grouped_allreduce.noname.15_2of5; grouped_allreduce.noname.15_3of5; grouped_allreduce.noname.15_4of5; grouped_allreduce.noname.15_5of5;
[1]<stdout>:[T horovod/common/operations.cc:788] [1]: Finished performing grouped_allreduce.noname.17_1of5, grouped_allreduce.noname.17_2of5, grouped_allreduce.noname.17_3of5, grouped_allreduce.noname.17_4of5, grouped_allreduce.noname.17_5of5
# Hangs

Investigating with GDB, I found that rank 0 was in a blocking collective operation (in this case ProcessSetTable::InitializeRegisteredAndRemoveMarkedIfReady, which is called at the beginning of each Horovod step), while rank 1 was stuck at the end of PerformOperation in the destructor of std::vector<TensorTableEntry> entries, which in Horovod code ultimately calls the destructor of a ::torch::Tensor.

Backtraces for rank 0:

(gdb) t 18
[Switching to thread 18 (Thread 0x7fa34c3ee700 (LWP 291931))]
#0  0x00007fa440cb3709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fa440cb3709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa317b97566 in gloo::transport::tcp::UnboundBuffer::waitSend(int*, std::chrono::duration<long, std::ratio<1l, 1000l> >) () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#2  0x00007fa317b6ca3a in gloo::allgather(gloo::AllgatherOptions&) () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007fa30ebe0d2d in horovod::common::GlooController::Allgather2Ints (this=0x592ad40, values=..., recv_values=std::vector of length 4, capacity 4 = {...}) at /mnt/data/max_temp/horovod/horovod/common/gloo/gloo_controller.cc:304
#4  0x00007fa30eb2ea36 in horovod::common::ProcessSetTable::InitializeRegisteredAndRemoveMarkedIfReady_<horovod::common::GlooContext> (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>, global_context=..., removal_status=...) at /mnt/data/max_temp/horovod/horovod/common/process_set.cc:187
#5  0x00007fa30eb2c505 in horovod::common::ProcessSetTable::InitializeRegisteredAndRemoveMarkedIfReady (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>, global_gloo_context=..., status=...) at /mnt/data/max_temp/horovod/horovod/common/process_set.cc:299
#6  0x00007fa30eaff479 in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:737
#7  0x00007fa30eafee5b in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:651
#8  0x00007fa30eb1ea9e in std::__invoke_impl<void, void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__f=@0x58f3fd0: 0x7fa30eafe19c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:60
#9  0x00007fa30eb1e9f9 in std::__invoke<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__fn=@0x58f3fd0: 0x7fa30eafe19c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:95
#10 0x00007fa30eb1e959 in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::_M_invoke<0ul, 1ul> (this=0x58f3fc8) at /usr/include/c++/9/thread:244
#11 0x00007fa30eb1e8ff in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::operator() (this=0x58f3fc8) at /usr/include/c++/9/thread:251
#12 0x00007fa30eb1e888 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > > >::_M_run (this=0x58f3fc0) at /usr/include/c++/9/thread:195
#13 0x00007fa432f1072f in execute_native_thread_routine () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libc10.so
#14 0x00007fa440cad6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fa43fe9051d in clone () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) t 1
[Switching to thread 1 (Thread 0x7fa4410d3700 (LWP 287801))]
#0  0x00007fa440cb626d in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fa440cb626d in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa440cafe42 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa30eb05082 in __gthread_mutex_lock (__mutex=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  0x00007fa30eb050d2 in __gthread_recursive_mutex_lock (__mutex=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4  0x00007fa30eb052f2 in std::recursive_mutex::lock (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>) at /usr/include/c++/9/mutex:106
#5  0x00007fa30eb08836 in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7fffcebaa080, __m=...) at /usr/include/c++/9/bits/std_mutex.h:159
#6  0x00007fa30eb2cb70 in horovod::common::ProcessSetTable::Contains (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>, id=2) at /mnt/data/max_temp/horovod/horovod/common/process_set.cc:369
#7  0x00007fa30eb0143e in horovod::common::EnqueueTensorAllreduces(std::vector<std::shared_ptr<horovod::common::OpContext>, std::allocator<std::shared_ptr<horovod::common::OpContext> > >&, std::vector<std::shared_ptr<horovod::common::Tensor>, std::allocator<std::shared_ptr<horovod::common::Tensor> > >&, std::vector<std::shared_ptr<horovod::common::Tensor>, std::allocator<std::shared_ptr<horovod::common::Tensor> > >&, std::vector<horovod::common::ReadyEventList, std::allocator<horovod::common::ReadyEventList> >&, std::vector<std::string, std::allocator<std::string> >&, int, std::vector<std::function<void (horovod::common::Status const&)>, std::allocator<std::function<void (horovod::common::Status const&)> > >&, horovod::common::ReduceOp, double, double, int) (contexts=std::vector of length 5, capacity 5 = {...}, tensors=std::vector of length 5, capacity 5 = {...}, outputs=std::vector of length 5, capacity 5 = {...}, ready_event_lists=std::vector of length 5, capacity 5 = {...}, names=std::vector of length 5, capacity 5 = {...}, device=-1, callbacks=std::vector of length 5, capacity 5 = {...}, reduce_op=horovod::common::SUM, prescale_factor=1, postscale_factor=1, process_set_id=2) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:1398
#8  0x00007fa30ebfeb04 in horovod::torch::DoGroupedAllreduce (tensors=std::vector of length 5, capacity 5 = {...}, outputs=std::vector of length 5, capacity 5 = {...}, divisor=1, name="", reduce_op_int=1, prescale_factor=1, postscale_factor=1, process_set_id=2) at /mnt/data/max_temp/horovod/horovod/torch/mpi_ops_v2.cc:225
#9  0x00007fa30ec34d32 in pybind11::detail::argument_loader<std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int>::call_impl<int, int (*&)(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, pybind11::detail::void_type>(int (*&)(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul>, pybind11::detail::void_type&&) && (this=0x7fffcebaa6d0, f=@0x54da568: 0x7fa30ebfe4a9 <horovod::torch::DoGroupedAllreduce(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int)>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/pybind11/cast.h:2042
# ...

Backtraces for rank 1:

(gdb) t 18
[Switching to thread 18 (Thread 0x7ff43bce1700 (LWP 291928))]
#0  0x00007ff4585a6709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007ff4585a6709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000061a240 in ?? ()
#2  0x000000000061a802 in PyEval_AcquireThread ()
#3  0x00007ff4499457ae in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#4  0x00007ff44a188a6c in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*, bool) () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007ff4495af9bb in c10::TensorImpl::release_resources() () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libc10.so
#6  0x00007ff327734fd1 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_ (this=0x4b2d998) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/c10/util/intrusive_ptr.h:268
#7  0x00007ff32772eb2a in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr (this=0x4b2d998, __in_chrg=<optimized out>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/c10/util/intrusive_ptr.h:349
#8  0x00007ff327720434 in at::TensorBase::~TensorBase (this=0x4b2d998, __in_chrg=<optimized out>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/ATen/core/TensorBase.h:76
#9  0x00007ff32772065c in at::Tensor::~Tensor (this=0x4b2d998, __in_chrg=<optimized out>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:75
#10 0x00007ff3277588a8 in horovod::torch::TorchTensor::~TorchTensor (this=0x4b2d990, __in_chrg=<optimized out>) at /mnt/data/max_temp/horovod/horovod/torch/adapter_v2.h:42
#11 0x00007ff327753b23 in __gnu_cxx::new_allocator<horovod::torch::TorchTensor>::destroy<horovod::torch::TorchTensor> (this=0x4b2d990, __p=0x4b2d990) at /usr/include/c++/9/ext/new_allocator.h:152
#12 0x00007ff327753a65 in std::allocator_traits<std::allocator<horovod::torch::TorchTensor> >::destroy<horovod::torch::TorchTensor> (__a=..., __p=0x4b2d990) at /usr/include/c++/9/bits/alloc_traits.h:496
#13 0x00007ff327753737 in std::_Sp_counted_ptr_inplace<horovod::torch::TorchTensor, std::allocator<horovod::torch::TorchTensor>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x4b2d980) at /usr/include/c++/9/bits/shared_ptr_base.h:557
#14 0x00007ff3275eff56 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x4b2d980) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#15 0x00007ff3275ed83f in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ff320031430, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#16 0x00007ff32761d700 in std::__shared_ptr<horovod::common::Tensor, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ff320031428, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#17 0x00007ff32761d742 in std::shared_ptr<horovod::common::Tensor>::~shared_ptr (this=0x7ff320031428 = {...}, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#18 0x00007ff32761d9a6 in horovod::common::TensorTableEntry::~TensorTableEntry (this=0x7ff320031410, __in_chrg=<optimized out>) at /mnt/data/max_temp/horovod/horovod/common/common.h:348
#19 0x00007ff32762f9b2 in std::_Destroy<horovod::common::TensorTableEntry> (__pointer=0x7ff320031410) at /usr/include/c++/9/bits/stl_construct.h:98
#20 0x00007ff32762cb27 in std::_Destroy_aux<false>::__destroy<horovod::common::TensorTableEntry*> (__first=0x7ff320031410, __last=0x7ff320031570) at /usr/include/c++/9/bits/stl_construct.h:108
#21 0x00007ff327628322 in std::_Destroy<horovod::common::TensorTableEntry*> (__first=0x7ff320031200, __last=0x7ff320031570) at /usr/include/c++/9/bits/stl_construct.h:137
#22 0x00007ff327623293 in std::_Destroy<horovod::common::TensorTableEntry*, horovod::common::TensorTableEntry> (__first=0x7ff320031200, __last=0x7ff320031570) at /usr/include/c++/9/bits/stl_construct.h:206
#23 0x00007ff32761f39f in std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> >::~vector (this=0x7ff43bce0610 = {...}, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:677
#24 0x00007ff327614c37 in horovod::common::(anonymous namespace)::PerformOperation (response=..., process_set=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:292
#25 0x00007ff3276169a8 in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:787
#26 0x00007ff327615e5b in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:651
#27 0x00007ff327635a9e in std::__invoke_impl<void, void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__f=@0x4c862e0: 0x7ff32761519c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:60
#28 0x00007ff3276359f9 in std::__invoke<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__fn=@0x4c862e0: 0x7ff32761519c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:95
#29 0x00007ff327635959 in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::_M_invoke<0ul, 1ul> (this=0x4c862d8) at /usr/include/c++/9/thread:244
#30 0x00007ff3276358ff in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::operator() (this=0x4c862d8) at /usr/include/c++/9/thread:251
#31 0x00007ff327635888 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > > >::_M_run (this=0x4c862d0) at /usr/include/c++/9/thread:195
#32 0x00007ff4495df72f in execute_native_thread_routine () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libc10.so
#33 0x00007ff4585a06ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#34 0x00007ff45778351d in clone () from /lib/x86_64-linux-gnu/libc.so.6


Thread 1 (Thread 0x7ff4589c6700 (LWP 287804) "pytest"):
#0  0x00007ff457766927 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff327661937 in __gthread_yield () at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:693
#2  0x00007ff327661977 in std::this_thread::yield () at /usr/include/c++/9/thread:356
#3  0x00007ff3277196cf in horovod::torch::WaitAndClear (handle=15) at /mnt/data/max_temp/horovod/horovod/torch/mpi_ops_v2.cc:609
#4  0x00007ff32774cda2 in pybind11::detail::argument_loader<int>::call_impl<void, void (*&)(int), 0ul, pybind11::detail::void_type>(void (*&)(int), std::integer_sequence<unsigned long, 0ul>, pybind11::detail::void_type&&) && (this=0x7fff2481f6ec, f=@0x4a679d8: 0x7ff32771969a <horovod::torch::WaitAndClear(int)>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/pybind11/cast.h:2042
#...

Analysis:

Rank 0: Thread 1 is blocked waiting for the TensorTable mutex. Thread 18 (Horovod background loop) holds that mutex and is blocked in an Allgather, waiting for rank 1 to join.

Rank 1: Thread 1 holds the Python GIL and is waiting (in synchronize() from mpi_ops.py). Thread 18 is waiting to
acquire the GIL, which it apparently needs to relase a PyTorch tensor.

It's unclear if the behavior of PyTorch has changed here in a recent release.

=> Releasing the GIL in thread 1 while it is yielding should help. Rank 1 would be unblocked then, which in turn would unblock rank 0.

The text was updated successfully, but these errors were encountered:

maxhgerlach added the bug label Jan 7, 2022

maxhgerlach mentioned this issue Jan 7, 2022

PyTorch: Release Python GIL when yielding in foreground thread #3353

Merged

4 tasks

maxhgerlach closed this as completed in #3353 Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GIL-related deadlock with PyTorch 1.10.1 #3352

GIL-related deadlock with PyTorch 1.10.1 #3352

maxhgerlach commented Jan 7, 2022 •

edited

GIL-related deadlock with PyTorch 1.10.1 #3352

GIL-related deadlock with PyTorch 1.10.1 #3352

Comments

maxhgerlach commented Jan 7, 2022 • edited

GIL-related deadlock with PyTorch 1.10.1

maxhgerlach commented Jan 7, 2022 •

edited