Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIL-related deadlock with PyTorch 1.10.1 #3352

Closed
maxhgerlach opened this issue Jan 7, 2022 · 0 comments · Fixed by #3353
Closed

GIL-related deadlock with PyTorch 1.10.1 #3352

maxhgerlach opened this issue Jan 7, 2022 · 0 comments · Fixed by #3353
Labels

Comments

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Jan 7, 2022

GIL-related deadlock with PyTorch 1.10.1

Environment:

  1. Framework: PyTorch
  2. Framework version: 1.10.1
  3. Python version: 3.7.10

I built Horovod like this:

pip install -U torch==1.10.1+cpu pytorch-lightning==1.3.8 torchvision==0.11.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
HOROVOD_DEBUG=1 HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_GLOO=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITH_TENSORFLOW=1 pip install -v -e .

Bug report:
I've been looking at some test failures in PR #3261, that popped up after rebasing to master, with PyTorch 1.10.1 and Gloo. PR #3351 should fix some problems, I think, but when debugging locally I would occasionally run into hangs that I did not understand as easily. About 1 in 5 evocations of horovodrun -np 2 -H localhost:2 --gloo pytest -v -x test/parallel/test_torch.py would encounter such a problem. I've seen this with various test cases: test_async_sparse_allreduce, test_horovod_grouped_allreduce_grad_process_sets, and maybe test_broadcast_state (not sure if that one really had the same cause).

Example for test_horovod_grouped_allreduce_grad with HOROVOD_LOG_LEVEL=TRACE:

# ...
[0]<stdout>:[T horovod/common/operations.cc:783] [0]: Performing grouped_allreduce.noname.13_1of5, grouped_allreduce.noname.13_2of5, grouped_allreduce.noname.13_3of5, grouped_allreduce.noname.13_4of5, grouped_allreduce.noname.13_5of5
[1]<stdout>:[T horovod/common/operations.cc:783] [1]: Performing grouped_allreduce.noname.16_1of5, grouped_allreduce.noname.16_2of5, grouped_allreduce.noname.16_3of5, grouped_allreduce.noname.16_4of5, grouped_allreduce.noname.16_5of5
[0]<stdout>:[T horovod/common/operations.cc:785] [0]: Processing 5 tensors
[1]<stdout>:[T horovod/common/operations.cc:785] [1]: Processing 5 tensors
[0]<stdout>:[T horovod/common/operations.cc:788] [0]: Finished performing grouped_allreduce.noname.13_1of5, grouped_allreduce.noname.13_2of5, grouped_allreduce.noname.13_3of5, grouped_allreduce.noname.13_4of5, grouped_allreduce.noname.13_5of5
[1]<stdout>:[T horovod/common/operations.cc:788] [1]: Finished performing grouped_allreduce.noname.16_1of5, grouped_allreduce.noname.16_2of5, grouped_allreduce.noname.16_3of5, grouped_allreduce.noname.16_4of5, grouped_allreduce.noname.16_5of5
[0]<stdout>:[T horovod/common/operations.cc:1496] [0]: Enqueued grouped_allreduce.noname.14_1of5; grouped_allreduce.noname.14_2of5; grouped_allreduce.noname.14_3of5; grouped_allreduce.noname.14_4of5; grouped_allreduce.noname.14_5of5;
[1]<stdout>:[T horovod/common/operations.cc:1496] [1]: Enqueued grouped_allreduce.noname.17_1of5; grouped_allreduce.noname.17_2of5; grouped_allreduce.noname.17_3of5; grouped_allreduce.noname.17_4of5; grouped_allreduce.noname.17_5of5;
[0]<stdout>:[T horovod/common/controller.cc:187] [0]: Sent 5 messages to coordinator.
[1]<stdout>:[T horovod/common/controller.cc:187] [0]: Sent 5 messages to coordinator.
[0]<stdout>:[T horovod/common/controller.cc:262] Adding messages from process-set rank 0
[1]<stdout>:[T horovod/common/controller.cc:262] Adding messages from process-set rank 0
[0]<stdout>:[T horovod/common/controller.cc:947] Created response of size 98260
[1]<stdout>:[T horovod/common/controller.cc:947] Created response of size 98260
[0]<stdout>:[T horovod/common/controller.cc:450] Sending ready responses as grouped_allreduce.noname.14_1of5, grouped_allreduce.noname.14_2of5, grouped_allreduce.noname.14_3of5, grouped_allreduce.noname.14_4of5, grouped_allreduce.noname.14_5of5;
[1]<stdout>:[T horovod/common/controller.cc:450] Sending ready responses as grouped_allreduce.noname.17_1of5, grouped_allreduce.noname.17_2of5, grouped_allreduce.noname.17_3of5, grouped_allreduce.noname.17_4of5, grouped_allreduce.noname.17_5of5;
[0]<stdout>:[T horovod/common/operations.cc:782] [0]: Process set id 1
[1]<stdout>:[T horovod/common/operations.cc:782] [1]: Process set id 2
[0]<stdout>:[T horovod/common/operations.cc:783] [0]: Performing grouped_allreduce.noname.14_1of5, grouped_allreduce.noname.14_2of5, grouped_allreduce.noname.14_3of5, grouped_allreduce.noname.14_4of5, grouped_allreduce.noname.14_5of5
[1]<stdout>:[T horovod/common/operations.cc:783] [1]: Performing grouped_allreduce.noname.17_1of5, grouped_allreduce.noname.17_2of5, grouped_allreduce.noname.17_3of5, grouped_allreduce.noname.17_4of5, grouped_allreduce.noname.17_5of5
[0]<stdout>:[T horovod/common/operations.cc:785] [0]: Processing 5 tensors
[1]<stdout>:[T horovod/common/operations.cc:785] [1]: Processing 5 tensors
[0]<stdout>:[T horovod/common/operations.cc:1496] [0]: Enqueued grouped_allreduce.noname.15_1of5; grouped_allreduce.noname.15_2of5; grouped_allreduce.noname.15_3of5; grouped_allreduce.noname.15_4of5; grouped_allreduce.noname.15_5of5;
[1]<stdout>:[T horovod/common/operations.cc:788] [1]: Finished performing grouped_allreduce.noname.17_1of5, grouped_allreduce.noname.17_2of5, grouped_allreduce.noname.17_3of5, grouped_allreduce.noname.17_4of5, grouped_allreduce.noname.17_5of5
# Hangs

Investigating with GDB, I found that rank 0 was in a blocking collective operation (in this case ProcessSetTable::InitializeRegisteredAndRemoveMarkedIfReady, which is called at the beginning of each Horovod step), while rank 1 was stuck at the end of PerformOperation in the destructor of std::vector<TensorTableEntry> entries, which in Horovod code ultimately calls the destructor of a ::torch::Tensor.

Backtraces for rank 0:

(gdb) t 18
[Switching to thread 18 (Thread 0x7fa34c3ee700 (LWP 291931))]
#0  0x00007fa440cb3709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fa440cb3709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa317b97566 in gloo::transport::tcp::UnboundBuffer::waitSend(int*, std::chrono::duration<long, std::ratio<1l, 1000l> >) () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#2  0x00007fa317b6ca3a in gloo::allgather(gloo::AllgatherOptions&) () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007fa30ebe0d2d in horovod::common::GlooController::Allgather2Ints (this=0x592ad40, values=..., recv_values=std::vector of length 4, capacity 4 = {...}) at /mnt/data/max_temp/horovod/horovod/common/gloo/gloo_controller.cc:304
#4  0x00007fa30eb2ea36 in horovod::common::ProcessSetTable::InitializeRegisteredAndRemoveMarkedIfReady_<horovod::common::GlooContext> (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>, global_context=..., removal_status=...) at /mnt/data/max_temp/horovod/horovod/common/process_set.cc:187
#5  0x00007fa30eb2c505 in horovod::common::ProcessSetTable::InitializeRegisteredAndRemoveMarkedIfReady (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>, global_gloo_context=..., status=...) at /mnt/data/max_temp/horovod/horovod/common/process_set.cc:299
#6  0x00007fa30eaff479 in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:737
#7  0x00007fa30eafee5b in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:651
#8  0x00007fa30eb1ea9e in std::__invoke_impl<void, void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__f=@0x58f3fd0: 0x7fa30eafe19c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:60
#9  0x00007fa30eb1e9f9 in std::__invoke<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__fn=@0x58f3fd0: 0x7fa30eafe19c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:95
#10 0x00007fa30eb1e959 in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::_M_invoke<0ul, 1ul> (this=0x58f3fc8) at /usr/include/c++/9/thread:244
#11 0x00007fa30eb1e8ff in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::operator() (this=0x58f3fc8) at /usr/include/c++/9/thread:251
#12 0x00007fa30eb1e888 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > > >::_M_run (this=0x58f3fc0) at /usr/include/c++/9/thread:195
#13 0x00007fa432f1072f in execute_native_thread_routine () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libc10.so
#14 0x00007fa440cad6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fa43fe9051d in clone () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) t 1
[Switching to thread 1 (Thread 0x7fa4410d3700 (LWP 287801))]
#0  0x00007fa440cb626d in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fa440cb626d in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa440cafe42 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa30eb05082 in __gthread_mutex_lock (__mutex=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  0x00007fa30eb050d2 in __gthread_recursive_mutex_lock (__mutex=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4  0x00007fa30eb052f2 in std::recursive_mutex::lock (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>) at /usr/include/c++/9/mutex:106
#5  0x00007fa30eb08836 in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7fffcebaa080, __m=...) at /usr/include/c++/9/bits/std_mutex.h:159
#6  0x00007fa30eb2cb70 in horovod::common::ProcessSetTable::Contains (this=0x7fa312740cd0 <horovod::common::(anonymous namespace)::horovod_global+58722512>, id=2) at /mnt/data/max_temp/horovod/horovod/common/process_set.cc:369
#7  0x00007fa30eb0143e in horovod::common::EnqueueTensorAllreduces(std::vector<std::shared_ptr<horovod::common::OpContext>, std::allocator<std::shared_ptr<horovod::common::OpContext> > >&, std::vector<std::shared_ptr<horovod::common::Tensor>, std::allocator<std::shared_ptr<horovod::common::Tensor> > >&, std::vector<std::shared_ptr<horovod::common::Tensor>, std::allocator<std::shared_ptr<horovod::common::Tensor> > >&, std::vector<horovod::common::ReadyEventList, std::allocator<horovod::common::ReadyEventList> >&, std::vector<std::string, std::allocator<std::string> >&, int, std::vector<std::function<void (horovod::common::Status const&)>, std::allocator<std::function<void (horovod::common::Status const&)> > >&, horovod::common::ReduceOp, double, double, int) (contexts=std::vector of length 5, capacity 5 = {...}, tensors=std::vector of length 5, capacity 5 = {...}, outputs=std::vector of length 5, capacity 5 = {...}, ready_event_lists=std::vector of length 5, capacity 5 = {...}, names=std::vector of length 5, capacity 5 = {...}, device=-1, callbacks=std::vector of length 5, capacity 5 = {...}, reduce_op=horovod::common::SUM, prescale_factor=1, postscale_factor=1, process_set_id=2) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:1398
#8  0x00007fa30ebfeb04 in horovod::torch::DoGroupedAllreduce (tensors=std::vector of length 5, capacity 5 = {...}, outputs=std::vector of length 5, capacity 5 = {...}, divisor=1, name="", reduce_op_int=1, prescale_factor=1, postscale_factor=1, process_set_id=2) at /mnt/data/max_temp/horovod/horovod/torch/mpi_ops_v2.cc:225
#9  0x00007fa30ec34d32 in pybind11::detail::argument_loader<std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int>::call_impl<int, int (*&)(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, pybind11::detail::void_type>(int (*&)(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul>, pybind11::detail::void_type&&) && (this=0x7fffcebaa6d0, f=@0x54da568: 0x7fa30ebfe4a9 <horovod::torch::DoGroupedAllreduce(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int, std::string const&, int, double, double, int)>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/pybind11/cast.h:2042
# ...

Backtraces for rank 1:

(gdb) t 18
[Switching to thread 18 (Thread 0x7ff43bce1700 (LWP 291928))]
#0  0x00007ff4585a6709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007ff4585a6709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000061a240 in ?? ()
#2  0x000000000061a802 in PyEval_AcquireThread ()
#3  0x00007ff4499457ae in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#4  0x00007ff44a188a6c in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*, bool) () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007ff4495af9bb in c10::TensorImpl::release_resources() () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libc10.so
#6  0x00007ff327734fd1 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_ (this=0x4b2d998) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/c10/util/intrusive_ptr.h:268
#7  0x00007ff32772eb2a in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr (this=0x4b2d998, __in_chrg=<optimized out>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/c10/util/intrusive_ptr.h:349
#8  0x00007ff327720434 in at::TensorBase::~TensorBase (this=0x4b2d998, __in_chrg=<optimized out>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/ATen/core/TensorBase.h:76
#9  0x00007ff32772065c in at::Tensor::~Tensor (this=0x4b2d998, __in_chrg=<optimized out>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:75
#10 0x00007ff3277588a8 in horovod::torch::TorchTensor::~TorchTensor (this=0x4b2d990, __in_chrg=<optimized out>) at /mnt/data/max_temp/horovod/horovod/torch/adapter_v2.h:42
#11 0x00007ff327753b23 in __gnu_cxx::new_allocator<horovod::torch::TorchTensor>::destroy<horovod::torch::TorchTensor> (this=0x4b2d990, __p=0x4b2d990) at /usr/include/c++/9/ext/new_allocator.h:152
#12 0x00007ff327753a65 in std::allocator_traits<std::allocator<horovod::torch::TorchTensor> >::destroy<horovod::torch::TorchTensor> (__a=..., __p=0x4b2d990) at /usr/include/c++/9/bits/alloc_traits.h:496
#13 0x00007ff327753737 in std::_Sp_counted_ptr_inplace<horovod::torch::TorchTensor, std::allocator<horovod::torch::TorchTensor>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x4b2d980) at /usr/include/c++/9/bits/shared_ptr_base.h:557
#14 0x00007ff3275eff56 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x4b2d980) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#15 0x00007ff3275ed83f in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ff320031430, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#16 0x00007ff32761d700 in std::__shared_ptr<horovod::common::Tensor, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ff320031428, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#17 0x00007ff32761d742 in std::shared_ptr<horovod::common::Tensor>::~shared_ptr (this=0x7ff320031428 = {...}, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#18 0x00007ff32761d9a6 in horovod::common::TensorTableEntry::~TensorTableEntry (this=0x7ff320031410, __in_chrg=<optimized out>) at /mnt/data/max_temp/horovod/horovod/common/common.h:348
#19 0x00007ff32762f9b2 in std::_Destroy<horovod::common::TensorTableEntry> (__pointer=0x7ff320031410) at /usr/include/c++/9/bits/stl_construct.h:98
#20 0x00007ff32762cb27 in std::_Destroy_aux<false>::__destroy<horovod::common::TensorTableEntry*> (__first=0x7ff320031410, __last=0x7ff320031570) at /usr/include/c++/9/bits/stl_construct.h:108
#21 0x00007ff327628322 in std::_Destroy<horovod::common::TensorTableEntry*> (__first=0x7ff320031200, __last=0x7ff320031570) at /usr/include/c++/9/bits/stl_construct.h:137
#22 0x00007ff327623293 in std::_Destroy<horovod::common::TensorTableEntry*, horovod::common::TensorTableEntry> (__first=0x7ff320031200, __last=0x7ff320031570) at /usr/include/c++/9/bits/stl_construct.h:206
#23 0x00007ff32761f39f in std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> >::~vector (this=0x7ff43bce0610 = {...}, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:677
#24 0x00007ff327614c37 in horovod::common::(anonymous namespace)::PerformOperation (response=..., process_set=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:292
#25 0x00007ff3276169a8 in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:787
#26 0x00007ff327615e5b in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /mnt/data/max_temp/horovod/horovod/common/operations.cc:651
#27 0x00007ff327635a9e in std::__invoke_impl<void, void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__f=@0x4c862e0: 0x7ff32761519c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:60
#28 0x00007ff3276359f9 in std::__invoke<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > (__fn=@0x4c862e0: 0x7ff32761519c <horovod::common::(anonymous namespace)::BackgroundThreadLoop(horovod::common::HorovodGlobalState&)>) at /usr/include/c++/9/bits/invoke.h:95
#29 0x00007ff327635959 in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::_M_invoke<0ul, 1ul> (this=0x4c862d8) at /usr/include/c++/9/thread:244
#30 0x00007ff3276358ff in std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > >::operator() (this=0x4c862d8) at /usr/include/c++/9/thread:251
#31 0x00007ff327635888 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(horovod::common::HorovodGlobalState&), std::reference_wrapper<horovod::common::HorovodGlobalState> > > >::_M_run (this=0x4c862d0) at /usr/include/c++/9/thread:195
#32 0x00007ff4495df72f in execute_native_thread_routine () from /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/lib/libc10.so
#33 0x00007ff4585a06ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#34 0x00007ff45778351d in clone () from /lib/x86_64-linux-gnu/libc.so.6


Thread 1 (Thread 0x7ff4589c6700 (LWP 287804) "pytest"):
#0  0x00007ff457766927 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff327661937 in __gthread_yield () at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:693
#2  0x00007ff327661977 in std::this_thread::yield () at /usr/include/c++/9/thread:356
#3  0x00007ff3277196cf in horovod::torch::WaitAndClear (handle=15) at /mnt/data/max_temp/horovod/horovod/torch/mpi_ops_v2.cc:609
#4  0x00007ff32774cda2 in pybind11::detail::argument_loader<int>::call_impl<void, void (*&)(int), 0ul, pybind11::detail::void_type>(void (*&)(int), std::integer_sequence<unsigned long, 0ul>, pybind11::detail::void_type&&) && (this=0x7fff2481f6ec, f=@0x4a679d8: 0x7ff32771969a <horovod::torch::WaitAndClear(int)>) at /learndata4/maxDev/horovod-dev-venv/lib/python3.7/site-packages/torch/include/pybind11/cast.h:2042
#...

Analysis:

Rank 0: Thread 1 is blocked waiting for the TensorTable mutex. Thread 18 (Horovod background loop) holds that mutex and is blocked in an Allgather, waiting for rank 1 to join.

Rank 1: Thread 1 holds the Python GIL and is waiting (in synchronize() from mpi_ops.py). Thread 18 is waiting to
acquire the GIL, which it apparently needs to relase a PyTorch tensor.

It's unclear if the behavior of PyTorch has changed here in a recent release.

=> Releasing the GIL in thread 1 while it is yielding should help. Rank 1 would be unblocked then, which in turn would unblock rank 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

1 participant