Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod hangs on nccl/gloo network connection failure #3611

Closed
MrAta opened this issue Jul 21, 2022 · 2 comments
Closed

Horovod hangs on nccl/gloo network connection failure #3611

MrAta opened this issue Jul 21, 2022 · 2 comments
Labels

Comments

@MrAta
Copy link
Contributor

MrAta commented Jul 21, 2022

Environment:

  1. Framework: TensorFlow
  2. Framework version: 2.4
  3. Horovod version: 0.23
  4. MPI version: 4.1.1
  5. CUDA version:11.1
  6. NCCL version: 2.12.12
  7. Python version: 3.7
  8. OS and version: RHEL7
  9. GCC version: 8.3.1
  10. CMake version: 3.10.2

Checklist:

  1. Did you search issues to find if somebody asked this question before? Yes.
  2. If your question is about hang, did you read this doc? Yes.
  3. If your question is about docker, did you read this doc? Yes.
  4. Did you check if you question is answered in the troubleshooting guide? Yes.

Bug report:
Running horovod with nccl on a k8s cluster with 32 Pods each with 6 V100 GPUs, nccl threads halt in persistentSocketThread due to failure in performing recv:

[15]worker-23:83:2526 [0] misc/socket.cc:503 NCCL WARN Net : Call to recv from 100.96.188.199<41468> failed : Connection timed out
[15]worker-23:83:2526 [0] NCCL INFO misc/socket.cc:520 -> 2
[15]23:83:2526 [0] transport/net_socket.cc:219 NCCL WARN NET/Socket : socket progress error
[15]worker-23:83:2520 [0] NCCL INFO include/net.h:32 -> 2
[15]worker-23:83:2520 [0] NCCL INFO transport/net.cc:870 -> 2
[15]worker-23:83:2520 [0] NCCL INFO proxy.cc:494 -> 2
[15]worker-23:83:2520 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread]

All of the ranks have this stack trace:

#0  0x00007f7a2620c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007f795f0e179b in persistentSocketThread (args_=0x7f61ac040c10) at transport/net_socket.cc:231
#2  0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

Some horovod threads are also waiting for GPU events:

#0  0x00007f7a2580dd47 in sched_yield () from /usr/lib64/libc.so.6
#1  0x00007f795f07b8a2 in __gthread_yield () at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:692
#2  std::this_thread::yield () at /opt/rh/devtoolset-8/root/usr/include/c++/8/thread:357
#3  horovod::common::GPUContext::impl::WaitForEvents(std::queue<std::pair<std::string, horovod::common::Event>, std::deque<std::pair<std::string, horovod::common::Event>, std::allocator<std::pair<std::string, horovod::common::Event> > > >&, std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> > const&, horovod::common::Timeline&, std::function<void ()> const&, bool) (this=0x54baeb0, event_queue=std::queue wrapping: std::deque with 0 elements, entries=std::vector of length 80, capacity 80 = {...}, timeline=..., error_check_callback=..., elastic=true) at /horovod/horovod/common/ops/cuda_operations.cc:131
#4  0x00007f795f07a220 in horovod::common::GPUContext::WaitForEvents(std::queue<std::pair<std::string, horovod::common::Event>, std::deque<std::pair<std::string, horovod::common::Event>, std::allocator<std::pair<std::string, horovod::common::Event> > > >&, std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> > const&, horovod::common::Timeline&, std::function<void ()> const&, bool) (this=<optimized out>, event_queue=..., entries=std::vector of length 80, capacity 80 = {...}, timeline=..., error_check_callback=..., elastic=<optimized out>) at /horovod/horovod/common/ops/gpu_context_impl.cc:27
#5  0x00007f795f07c3fc in horovod::common::GPUOpContext::<lambda()>::operator() (__closure=0x7f61a8000d00) at /horovod/horovod/common/ops/gpu_operations.cc:80
#6  std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(std::vector<horovod::common::TensorTableEntry>&, bool, const std::function<void()>&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297
#7  0x00007f795f03b72f in std::function<void ()>::operator()() const (this=0x7f795d739ea0) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260
#8  horovod::common::ThreadPool::loop (this=0x7f7961816258 <horovod::common::(anonymous namespace)::gpu_context+24>) at /horovod/horovod/common/thread_pool.cc:62
#9  0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#10 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#11 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

Because of the same connection issue, gloo threads are also facing a deadlock:

#0  0x00007f7a2620cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007f795f1682f9 in __gthread_cond_timedwait (__abs_timeout=0x7f795ef3bee0, __mutex=<optimized out>, __cond=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:871
#2  std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:178
#3  std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:106
#4  std::condition_variable::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, gloo::transport::tcp::UnboundBuffer::waitRecv(int*, std::chrono::milliseconds)::<lambda()> > (__p=..., __atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:129
#5  std::condition_variable::wait_for<long int, std::ratio<1, 1000>, gloo::transport::tcp::UnboundBuffer::waitRecv(int*, std::chrono::milliseconds)::<lambda()> > (__p=..., __rtime=<synthetic pointer>..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:156
#6  gloo::transport::tcp::UnboundBuffer::waitRecv (this=0x7f79584f2410, rank=0x0, timeout=...) at /horovod/third_party/compatible_gloo/gloo/transport/tcp/unbound_buffer.cc:61
#7  0x00007f795f1400ec in gloo::transport::UnboundBuffer::waitRecv (timeout=..., this=<optimized out>) at /horovod/third_party/gloo/gloo/transport/unbound_buffer.h:76
#8  gloo::(anonymous namespace)::ring (opts=..., reduceInputs=..., broadcastOutputs=...) at /horovod/third_party/compatible_gloo/gloo/allreduce.cc:366
#9  0x00007f795f141623 in gloo::(anonymous namespace)::allreduce (opts=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260
#10 0x00007f795f09469f in horovod::common::GlooController::CrossRankBitwiseAnd (this=<optimized out>, bitvector=..., count=<optimized out>) at /horovod/horovod/common/gloo/gloo_controller.cc:122
#11 0x00007f795f03594a in horovod::common::CacheCoordinator::sync (this=this@entry=0x7f795ef3c7f0, controller=std::shared_ptr<horovod::common::Controller> (use count 3, weak count 1) = {...}, timeline_enabled=<optimized out>) at /horovod/horovod/common/response_cache.cc:425
#12 0x00007f795eff9f07 in horovod::common::Controller::CoordinateCacheAndState (this=0x5546b20, cache_coordinator=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/shared_ptr_base.h:246
#13 0x00007f795effe37d in horovod::common::Controller::ComputeResponseList (this=0x5546b20, this_process_requested_shutdown=this_process_requested_shutdown@entry=false, state=..., process_set=...) at /horovod/horovod/common/controller.cc:158
#14 0x00007f795f01d55e in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /horovod/horovod/common/operations.cc:778
#15 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /horovod/horovod/common/operations.cc:660
#16 0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#17 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#18 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

and

#0  0x00007f7a25829483 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f795f152a3b in gloo::transport::tcp::Loop::run (this=0x7f7958000a00) at /horovod/third_party/compatible_gloo/gloo/transport/tcp/loop.cc:72
#2  0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#3  0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6

This is while the GPU utilization and GPU memory utilization stays nearly at 100% while no training is going on:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   48C    P0    47W / 250W |  30765MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:60:00.0 Off |                    0 |
| N/A   38C    P0    40W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   40C    P0    41W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   52C    P0    43W / 250W |  30765MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:DA:00.0 Off |                    0 |
| N/A   42C    P0    43W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:DB:00.0 Off |                    0 |
| N/A   45C    P0    43W / 250W |  30789MiB / 32510MiB |    100%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

This is pretty much similar to nccl #193 where the nccl authors believe that if NCCL returns an error, it's the job of the frameworks on top to abort the process.

@MrAta MrAta added the bug label Jul 21, 2022
@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Jul 27, 2022

Hi @MrAta,
are you able to reproduce the problem on the latest release of Horovod? After 0.23 some improvements in NCCL error handling have been contributed: #3112

@MrAta MrAta closed this as completed Jul 28, 2022
@Lifann
Copy link

Lifann commented Oct 10, 2022

Meet the same issue in version 0.23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants