Horovod hangs on nccl/gloo network connection failure #3611

MrAta · 2022-07-21T01:01:50Z

Environment:

Framework: TensorFlow
Framework version: 2.4
Horovod version: 0.23
MPI version: 4.1.1
CUDA version:11.1
NCCL version: 2.12.12
Python version: 3.7
OS and version: RHEL7
GCC version: 8.3.1
CMake version: 3.10.2

Checklist:

Did you search issues to find if somebody asked this question before? Yes.
If your question is about hang, did you read this doc? Yes.
If your question is about docker, did you read this doc? Yes.
Did you check if you question is answered in the troubleshooting guide? Yes.

Bug report:
Running horovod with nccl on a k8s cluster with 32 Pods each with 6 V100 GPUs, nccl threads halt in persistentSocketThread due to failure in performing recv:

[15]worker-23:83:2526 [0] misc/socket.cc:503 NCCL WARN Net : Call to recv from 100.96.188.199<41468> failed : Connection timed out
[15]worker-23:83:2526 [0] NCCL INFO misc/socket.cc:520 -> 2
[15]23:83:2526 [0] transport/net_socket.cc:219 NCCL WARN NET/Socket : socket progress error
[15]worker-23:83:2520 [0] NCCL INFO include/net.h:32 -> 2
[15]worker-23:83:2520 [0] NCCL INFO transport/net.cc:870 -> 2
[15]worker-23:83:2520 [0] NCCL INFO proxy.cc:494 -> 2
[15]worker-23:83:2520 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread]

All of the ranks have this stack trace:

#0  0x00007f7a2620c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007f795f0e179b in persistentSocketThread (args_=0x7f61ac040c10) at transport/net_socket.cc:231
#2  0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

Some horovod threads are also waiting for GPU events:

#0  0x00007f7a2580dd47 in sched_yield () from /usr/lib64/libc.so.6
#1  0x00007f795f07b8a2 in __gthread_yield () at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:692
#2  std::this_thread::yield () at /opt/rh/devtoolset-8/root/usr/include/c++/8/thread:357
#3  horovod::common::GPUContext::impl::WaitForEvents(std::queue<std::pair<std::string, horovod::common::Event>, std::deque<std::pair<std::string, horovod::common::Event>, std::allocator<std::pair<std::string, horovod::common::Event> > > >&, std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> > const&, horovod::common::Timeline&, std::function<void ()> const&, bool) (this=0x54baeb0, event_queue=std::queue wrapping: std::deque with 0 elements, entries=std::vector of length 80, capacity 80 = {...}, timeline=..., error_check_callback=..., elastic=true) at /horovod/horovod/common/ops/cuda_operations.cc:131
#4  0x00007f795f07a220 in horovod::common::GPUContext::WaitForEvents(std::queue<std::pair<std::string, horovod::common::Event>, std::deque<std::pair<std::string, horovod::common::Event>, std::allocator<std::pair<std::string, horovod::common::Event> > > >&, std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> > const&, horovod::common::Timeline&, std::function<void ()> const&, bool) (this=<optimized out>, event_queue=..., entries=std::vector of length 80, capacity 80 = {...}, timeline=..., error_check_callback=..., elastic=<optimized out>) at /horovod/horovod/common/ops/gpu_context_impl.cc:27
#5  0x00007f795f07c3fc in horovod::common::GPUOpContext::<lambda()>::operator() (__closure=0x7f61a8000d00) at /horovod/horovod/common/ops/gpu_operations.cc:80
#6  std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(std::vector<horovod::common::TensorTableEntry>&, bool, const std::function<void()>&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297
#7  0x00007f795f03b72f in std::function<void ()>::operator()() const (this=0x7f795d739ea0) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260
#8  horovod::common::ThreadPool::loop (this=0x7f7961816258 <horovod::common::(anonymous namespace)::gpu_context+24>) at /horovod/horovod/common/thread_pool.cc:62
#9  0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#10 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#11 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

Because of the same connection issue, gloo threads are also facing a deadlock:

#0  0x00007f7a2620cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007f795f1682f9 in __gthread_cond_timedwait (__abs_timeout=0x7f795ef3bee0, __mutex=<optimized out>, __cond=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:871
#2  std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:178
#3  std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:106
#4  std::condition_variable::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, gloo::transport::tcp::UnboundBuffer::waitRecv(int*, std::chrono::milliseconds)::<lambda()> > (__p=..., __atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:129
#5  std::condition_variable::wait_for<long int, std::ratio<1, 1000>, gloo::transport::tcp::UnboundBuffer::waitRecv(int*, std::chrono::milliseconds)::<lambda()> > (__p=..., __rtime=<synthetic pointer>..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:156
#6  gloo::transport::tcp::UnboundBuffer::waitRecv (this=0x7f79584f2410, rank=0x0, timeout=...) at /horovod/third_party/compatible_gloo/gloo/transport/tcp/unbound_buffer.cc:61
#7  0x00007f795f1400ec in gloo::transport::UnboundBuffer::waitRecv (timeout=..., this=<optimized out>) at /horovod/third_party/gloo/gloo/transport/unbound_buffer.h:76
#8  gloo::(anonymous namespace)::ring (opts=..., reduceInputs=..., broadcastOutputs=...) at /horovod/third_party/compatible_gloo/gloo/allreduce.cc:366
#9  0x00007f795f141623 in gloo::(anonymous namespace)::allreduce (opts=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260
#10 0x00007f795f09469f in horovod::common::GlooController::CrossRankBitwiseAnd (this=<optimized out>, bitvector=..., count=<optimized out>) at /horovod/horovod/common/gloo/gloo_controller.cc:122
#11 0x00007f795f03594a in horovod::common::CacheCoordinator::sync (this=this@entry=0x7f795ef3c7f0, controller=std::shared_ptr<horovod::common::Controller> (use count 3, weak count 1) = {...}, timeline_enabled=<optimized out>) at /horovod/horovod/common/response_cache.cc:425
#12 0x00007f795eff9f07 in horovod::common::Controller::CoordinateCacheAndState (this=0x5546b20, cache_coordinator=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/shared_ptr_base.h:246
#13 0x00007f795effe37d in horovod::common::Controller::ComputeResponseList (this=0x5546b20, this_process_requested_shutdown=this_process_requested_shutdown@entry=false, state=..., process_set=...) at /horovod/horovod/common/controller.cc:158
#14 0x00007f795f01d55e in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /horovod/horovod/common/operations.cc:778
#15 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /horovod/horovod/common/operations.cc:660
#16 0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#17 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#18 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

and

#0  0x00007f7a25829483 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f795f152a3b in gloo::transport::tcp::Loop::run (this=0x7f7958000a00) at /horovod/third_party/compatible_gloo/gloo/transport/tcp/loop.cc:72
#2  0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#3  0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

This is while the GPU utilization and GPU memory utilization stays nearly at 100% while no training is going on:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   48C    P0    47W / 250W |  30765MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:60:00.0 Off |                    0 |
| N/A   38C    P0    40W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   40C    P0    41W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   52C    P0    43W / 250W |  30765MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:DA:00.0 Off |                    0 |
| N/A   42C    P0    43W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:DB:00.0 Off |                    0 |
| N/A   45C    P0    43W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

This is pretty much similar to nccl #193 where the nccl authors believe that if NCCL returns an error, it's the job of the frameworks on top to abort the process.

The text was updated successfully, but these errors were encountered:

maxhgerlach · 2022-07-27T15:40:28Z

Hi @MrAta,
are you able to reproduce the problem on the latest release of Horovod? After 0.23 some improvements in NCCL error handling have been contributed: #3112

Lifann · 2022-10-10T12:18:23Z

Meet the same issue in version 0.23

MrAta added the bug label Jul 21, 2022

MrAta closed this as completed Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod hangs on nccl/gloo network connection failure #3611

Horovod hangs on nccl/gloo network connection failure #3611

MrAta commented Jul 21, 2022

maxhgerlach commented Jul 27, 2022 •

edited

Lifann commented Oct 10, 2022

Horovod hangs on nccl/gloo network connection failure #3611

Horovod hangs on nccl/gloo network connection failure #3611

Comments

MrAta commented Jul 21, 2022

maxhgerlach commented Jul 27, 2022 • edited

Lifann commented Oct 10, 2022

maxhgerlach commented Jul 27, 2022 •

edited