New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All workers failed on failure with Elastic Horovod #3264
Comments
FYI. We print some log when the worker catches the |
@vycezhong sorry for late reply, I will take a look at this as soon as possible. |
It's probably due to the background thread is doing finalization while the training program keeps enqueuing tensors within which ProcessSet::IsCurrentProcessIncluded() is called and assert that process_set.initialization_done == true. |
@woodlgz I have tested your PR and it works. Thanks for your help! |
Environment:
Bug report:
We found that this PR #3112 solved the some problems of NCCL. So we tested it with the latest code. However, we found sometimes all workers failed on failure.
We have two machines, each equipped with 2 P100 GPUs. We run the program in the master node (10.28.1.16) with the following command:
During the execution, we intentionally killed the workers on the host (10.28.1.17) with
pkill python
. The workers on that host died immediatelly.However, sometimes the workers in the master host also failed and exited with the status code 134. From the log, it seems that the workers did not re-initialize since the
initialization_done
isfalse
. This is weird because the alive workers should re-init (https://github.com/horovod/horovod/blob/master/horovod/torch/elastic/__init__.py#L48).Expected behaviour:
The program continues to execute normally in the master host. Sometimes we found it succeed as shown in the log below.
Other information:
We build horovod from source. Here is our installation command.
HOROVOD_DEBUG=1 CXX=/usr/bin/g++ CC=/usr/bin/gcc HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_GLOO=1 HOROVOD_NCCL_HOME=/usr/local/cuda HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir -v -e .
~ horovodrun --check-build Horovod v0.23.0: Available Frameworks: [ ] TensorFlow [X] PyTorch [ ] MXNet Available Controllers: [ ] MPI [X] Gloo Available Tensor Operations: [X] NCCL [ ] DDL [ ] CCL [ ] MPI [X] Gloo
cc: @woodlgz @tgaddair
The text was updated successfully, but these errors were encountered: