You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug report:
Running horovod with nccl on a k8s cluster with 32 Pods each with 6 V100 GPUs, nccl threads halt in persistentSocketThread due to failure in performing recv:
[15]worker-23:83:2526 [0] misc/socket.cc:503 NCCL WARN Net : Call to recv from 100.96.188.199<41468> failed : Connection timed out
[15]worker-23:83:2526 [0] NCCL INFO misc/socket.cc:520 -> 2
[15]23:83:2526 [0] transport/net_socket.cc:219 NCCL WARN NET/Socket : socket progress error
[15]worker-23:83:2520 [0] NCCL INFO include/net.h:32 -> 2
[15]worker-23:83:2520 [0] NCCL INFO transport/net.cc:870 -> 2
[15]worker-23:83:2520 [0] NCCL INFO proxy.cc:494 -> 2
[15]worker-23:83:2520 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread]
All of the ranks have this stack trace:
#0 0x00007f7a2620c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0#1 0x00007f795f0e179b in persistentSocketThread (args_=0x7f61ac040c10) at transport/net_socket.cc:231#2 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0#3 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6
Some horovod threads are also waiting for GPU events:
#0 0x00007f7a2580dd47 in sched_yield () from /usr/lib64/libc.so.6#1 0x00007f795f07b8a2 in __gthread_yield () at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:692#2 std::this_thread::yield () at /opt/rh/devtoolset-8/root/usr/include/c++/8/thread:357#3 horovod::common::GPUContext::impl::WaitForEvents(std::queue<std::pair<std::string, horovod::common::Event>, std::deque<std::pair<std::string, horovod::common::Event>, std::allocator<std::pair<std::string, horovod::common::Event> > > >&, std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> > const&, horovod::common::Timeline&, std::function<void ()> const&, bool) (this=0x54baeb0, event_queue=std::queue wrapping: std::deque with 0 elements, entries=std::vector of length 80, capacity 80 = {...}, timeline=..., error_check_callback=..., elastic=true) at /horovod/horovod/common/ops/cuda_operations.cc:131#4 0x00007f795f07a220 in horovod::common::GPUContext::WaitForEvents(std::queue<std::pair<std::string, horovod::common::Event>, std::deque<std::pair<std::string, horovod::common::Event>, std::allocator<std::pair<std::string, horovod::common::Event> > > >&, std::vector<horovod::common::TensorTableEntry, std::allocator<horovod::common::TensorTableEntry> > const&, horovod::common::Timeline&, std::function<void ()> const&, bool) (this=<optimized out>, event_queue=..., entries=std::vector of length 80, capacity 80 = {...}, timeline=..., error_check_callback=..., elastic=<optimized out>) at /horovod/horovod/common/ops/gpu_context_impl.cc:27#5 0x00007f795f07c3fc in horovod::common::GPUOpContext::<lambda()>::operator() (__closure=0x7f61a8000d00) at /horovod/horovod/common/ops/gpu_operations.cc:80#6 std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(std::vector<horovod::common::TensorTableEntry>&, bool, const std::function<void()>&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:297#7 0x00007f795f03b72f in std::function<void ()>::operator()() const (this=0x7f795d739ea0) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260#8 horovod::common::ThreadPool::loop (this=0x7f7961816258 <horovod::common::(anonymous namespace)::gpu_context+24>) at /horovod/horovod/common/thread_pool.cc:62#9 0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2#10 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0#11 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6
Because of the same connection issue, gloo threads are also facing a deadlock:
#0 0x00007f7a2620cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0#1 0x00007f795f1682f9 in __gthread_cond_timedwait (__abs_timeout=0x7f795ef3bee0, __mutex=<optimized out>, __cond=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:871#2 std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:178#3 std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:106#4 std::condition_variable::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, gloo::transport::tcp::UnboundBuffer::waitRecv(int*, std::chrono::milliseconds)::<lambda()> > (__p=..., __atime=..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:129#5 std::condition_variable::wait_for<long int, std::ratio<1, 1000>, gloo::transport::tcp::UnboundBuffer::waitRecv(int*, std::chrono::milliseconds)::<lambda()> > (__p=..., __rtime=<synthetic pointer>..., __lock=..., this=0x7f79584f2460) at /opt/rh/devtoolset-8/root/usr/include/c++/8/condition_variable:156#6 gloo::transport::tcp::UnboundBuffer::waitRecv (this=0x7f79584f2410, rank=0x0, timeout=...) at /horovod/third_party/compatible_gloo/gloo/transport/tcp/unbound_buffer.cc:61#7 0x00007f795f1400ec in gloo::transport::UnboundBuffer::waitRecv (timeout=..., this=<optimized out>) at /horovod/third_party/gloo/gloo/transport/unbound_buffer.h:76#8 gloo::(anonymous namespace)::ring (opts=..., reduceInputs=..., broadcastOutputs=...) at /horovod/third_party/compatible_gloo/gloo/allreduce.cc:366#9 0x00007f795f141623 in gloo::(anonymous namespace)::allreduce (opts=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_function.h:260#10 0x00007f795f09469f in horovod::common::GlooController::CrossRankBitwiseAnd (this=<optimized out>, bitvector=..., count=<optimized out>) at /horovod/horovod/common/gloo/gloo_controller.cc:122#11 0x00007f795f03594a in horovod::common::CacheCoordinator::sync (this=this@entry=0x7f795ef3c7f0, controller=std::shared_ptr<horovod::common::Controller> (use count 3, weak count 1) = {...}, timeline_enabled=<optimized out>) at /horovod/horovod/common/response_cache.cc:425#12 0x00007f795eff9f07 in horovod::common::Controller::CoordinateCacheAndState (this=0x5546b20, cache_coordinator=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/shared_ptr_base.h:246#13 0x00007f795effe37d in horovod::common::Controller::ComputeResponseList (this=0x5546b20, this_process_requested_shutdown=this_process_requested_shutdown@entry=false, state=..., process_set=...) at /horovod/horovod/common/controller.cc:158#14 0x00007f795f01d55e in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at /horovod/horovod/common/operations.cc:778#15 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at /horovod/horovod/common/operations.cc:660#16 0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2#17 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0#18 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6
and
#0 0x00007f7a25829483 in epoll_wait () from /usr/lib64/libc.so.6#1 0x00007f795f152a3b in gloo::transport::tcp::Loop::run (this=0x7f7958000a00) at /horovod/third_party/compatible_gloo/gloo/transport/tcp/loop.cc:72#2 0x00007f79fc36278f in execute_native_thread_routine () from /opt/code-fetcher-system/tf-benchmark-azkaban_2d680d557648cbf50c785ee0639c5d54f5335585ecb5e8b7696a52c4db0d76b8/site-packages/tensorflow/python/../libtensorflow_framework.so.2#3 0x00007f7a26208dd5 in start_thread () from /usr/lib64/libpthread.so.0#4 0x00007f7a25828ead in clone () from /usr/lib64/libc.so.6
This is while the GPU utilization and GPU memory utilization stays nearly at 100% while no training is going on:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================|| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 || N/A 48C P0 47W / 250W | 30765MiB / 32510MiB | 100% E. Process |||| N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:60:00.0 Off | 0 || N/A 38C P0 40W / 250W | 30789MiB / 32510MiB | 100% E. Process |||| N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:61:00.0 Off | 0 || N/A 40C P0 41W / 250W | 30789MiB / 32510MiB | 100% E. Process |||| N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:86:00.0 Off | 0 || N/A 52C P0 43W / 250W | 30765MiB / 32510MiB | 100% E. Process |||| N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... On | 00000000:DA:00.0 Off | 0 || N/A 42C P0 43W / 250W | 30789MiB / 32510MiB | 100% E. Process |||| N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-PCIE... On | 00000000:DB:00.0 Off | 0 || N/A 45C P0 43W / 250W | 30789MiB / 32510MiB | 100% E. Process |||| N/A |
+-------------------------------+----------------------+----------------------+
This is pretty much similar to nccl #193 where the nccl authors believe that if NCCL returns an error, it's the job of the frameworks on top to abort the process.
The text was updated successfully, but these errors were encountered:
Hi @MrAta,
are you able to reproduce the problem on the latest release of Horovod? After 0.23 some improvements in NCCL error handling have been contributed: #3112
Environment:
Checklist:
Bug report:
Running horovod with nccl on a k8s cluster with 32 Pods each with 6 V100 GPUs, nccl threads halt in
persistentSocketThread
due to failure in performingrecv
:All of the ranks have this stack trace:
Some horovod threads are also waiting for GPU events:
Because of the same connection issue, gloo threads are also facing a deadlock:
and
This is while the GPU utilization and GPU memory utilization stays nearly at 100% while no training is going on:
This is pretty much similar to nccl #193 where the nccl authors believe that if NCCL returns an error, it's the job of the frameworks on top to abort the process.
The text was updated successfully, but these errors were encountered: