New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP training timeout #19487
Comments
@pengzhangzhi Can you describe the steps to reproduce this? There are several notebooks in the examples folder https://github.com/NVIDIA/BioNeMo/tree/main/examples/service/notebooks but I doubt you are running these. Where is the code and config that you are running? |
Hi @awaelchli, the github repo does not have the whole training code, I got it from their docker containers. If you want to reproduce their code, here is the doc https://docs.nvidia.com/bionemo-framework/latest/quickstart-fw.html
|
If you insist on reproducing it, I am happy to help and give a detailed guide. For simplicity, it would be great to guide me how to debug this error. Thanks!! |
Hey @pengzhangzhi I implemented a system check utility to help with such problems: |
Thanks! |
Yes the docs will only generate once the PR is ready. The easiest for you to try it right now is to just copy this file |
Thanks!!
|
This output shows that distributed PyTorch won't work on your system. It can't synchronize at the barrier, which is a very basic requirement. There should be a folder |
Thanks!! Since the error is in process 6, I am showing the log of nccl-rank-6 below:
FYI, I am using a docker container. It can be reproduced by the following steps.
Pull the Bionemo container:
Run the container:
To reproduce my error: |
I won't have the bandwidth to help much here. Maybe try disabling plugins: |
I think the problem I have on nccl-rank-6 is just OOM based on the log?
|
if you ran my system check, that's not possible. It allocates very little memory on the GPU:
If what you show me there is the output of another program, then yes it looks like one rank runs out of memory. If one rank dies, the others will wait and hang forever. |
Yeah. I think it is because some of the GPUs are already heavily utilized and triggers the OOM problem as shown in the log. I only ran your program in the container and the host, both of log showing OOM on two utilized GPUs. |
Bug description
I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Epoch 0: 6%|██ | 32040/500150 [6:28:43<94:39:17, 1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
a03-zpeng@m3dgx01:~$ pip list
Package Version Location
absl-py 1.4.0
accessible-pygments 0.0.4
aiohttp 3.9.0
aiosignal 1.3.1
alabaster 0.7.13
aniso8601 9.0.1
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
apex 0.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.2.1
astunparse 1.6.3
async-timeout 4.0.2
attrdict 2.0.1
attrs 23.1.0
audioread 3.0.0
awscli 1.29.67
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
bionemo 0.2.0.dev0 /workspace/bionemo
biopandas 0.4.1
biopython 1.79
black 23.1.0
bleach 6.0.0
blinker 1.6.2
blis 0.7.9
boto3 1.28.10
botocore 1.31.67
braceexpand 0.1.7
Brotli 1.1.0
cachetools 5.3.1
catalogue 2.0.8
cdifflib 1.2.6
certifi 2023.7.22
cffi 1.15.1
cfgv 3.4.0
charset-normalizer 3.1.0
click 8.1.7
cloudpickle 2.2.1
cmake 3.24.1.1
colorama 0.4.4
coloredlogs 15.0.1
comm 0.1.3
commonmark 0.9.1
confection 0.0.4
contourpy 1.0.7
coverage 7.4.0
crc32c 2.3.post0
cubinlinker 0.3.0+2.g87b01ae
cuda-python 12.1.0rc5+1.g38940ef
cudf 23.4.0
cugraph 23.4.0
cugraph-dgl 23.4.0
cugraph-service-client 23.4.0
cugraph-service-server 23.4.0
cuml 23.4.0
cupy-cuda12x 12.0.0b3
cycler 0.11.0
cymem 2.0.7
Cython 0.29.35
dacite 1.8.1
dask 2023.3.2
dask-cuda 23.4.0
dask-cudf 23.4.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dgl 1.1.3
dgllife 0.2.8
diffdock 0.0.5
dill 0.3.7
Distance 0.1.3
distlib 0.3.8
distributed 2023.3.2.1
DLLogger 1.0.0
docker-pycreds 0.4.0
docopt 0.6.2
docutils 0.16
e3nn 0.5.1
editdistance 0.6.2
einops 0.6.1
exceptiongroup 1.1.1
execnet 1.9.0
executing 1.2.0
expecttest 0.1.3
fair-esm 2.0.0
faiss-cpu 1.7.4
fastjsonschema 2.17.1
fastrlock 0.8.1
fasttext 0.9.2
filelock 3.12.2
fire 0.5.0
flash-attn 1.0.7
Flask 2.2.5
Flask-RESTful 0.3.10
flatbuffers 23.5.26
fonttools 4.47.2
frozenlist 1.3.3
fsspec 2023.5.0
ftfy 6.1.1
future 0.18.3
g2p-en 2.1.0
gast 0.4.0
gdown 4.7.1
gevent 23.9.1
geventhttpclient 2.0.2
gitdb 4.0.10
GitPython 3.1.41
google-auth 2.20.0
google-auth-oauthlib 0.4.6
graphsurgeon 0.4.6
graphviz 0.20.1
greenlet 3.0.3
grpcio 1.56.0
h5py 3.9.0
huggingface-hub 0.20.2
humanfriendly 10.0
hydra-core 1.2.0
hyperopt 0.2.7
hypothesis 5.35.1
identify 2.5.33
idna 3.4
ijson 3.2.3
imagesize 1.4.1
importlib-metadata 6.6.0
inflect 7.0.0
iniconfig 2.0.0
intel-openmp 2021.4.0
ipadic 1.0.0
ipdb 0.13.11
ipykernel 6.23.3
ipython 8.14.0
ipython-genutils 0.2.0
ipywidgets 8.0.7
isort 5.12.0
itsdangerous 2.1.2
jedi 0.18.2
jieba 0.42.1
Jinja2 3.1.2
jiwer 2.5.2
jmespath 1.0.1
joblib 1.2.0
json5 0.9.14
jsonlines 4.0.0
jsonschema 4.17.3
jupyter_client 8.3.0
jupyter_core 5.3.1
jupyter-tensorboard 0.2.0
jupyterlab 2.3.2
jupyterlab-pygments 0.2.2
jupyterlab-server 1.2.0
jupyterlab-widgets 3.0.8
jupytext 1.14.6
k2 1.24.3.dev20230725+cuda12.1.torch2.1.0a0
kaldi-python-io 1.2.2
kaldiio 2.18.0
kiwisolver 1.4.4
kornia 0.6.12
langcodes 3.3.0
latexcodec 2.0.1
Levenshtein 0.21.1
librosa 0.9.2
lightning-utilities 0.9.0
llvmlite 0.39.1
locket 1.0.0
loguru 0.7.0
lxml 4.9.3
Markdown 3.4.3
markdown-it-py 2.2.0
markdown2 2.4.9
MarkupSafe 2.1.3
marshmallow 3.20.1
matplotlib 3.4.3
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
mecab-python3 1.0.5
megatron-core 0.2.0
mistune 3.0.1
mkl 2021.1.1
mkl-devel 2021.1.1
mkl-include 2021.1.1
mock 5.0.2
more-itertools 10.1.0
mpmath 0.19
msgpack 1.0.5
multidict 6.0.4
murmurhash 1.0.9
mypy-extensions 1.0.0
nbclient 0.8.0
nbconvert 7.6.0
nbformat 5.9.0
nemo-text-processing 0.1.8rc0
nemo-toolkit 1.20.0
nest-asyncio 1.5.6
networkx 2.6.3
ninja 1.11.1
nltk 3.8.1
nodeenv 1.8.0
notebook 6.4.10
numba 0.56.4+1.g5f1bc7084
numpy 1.22.2
nvidia-dali-cuda120 1.26.0
nvidia-pyindex 1.0.9
nvidia-pytriton 0.4.0
nvtx 0.2.5
oauthlib 3.2.2
omegaconf 2.2.3
onnx 1.14.1
onnx-graphsurgeon 0.3.27
onnxruntime-gpu 1.16.3
onnxscript 0.1.0.dev20240113
OpenCC 1.1.6
opencv 4.6.0
opt-einsum 3.3.0
opt-einsum-fx 0.1.4
packaging 23.1
pandas 1.5.2
pandocfilters 1.5.0
pangu 4.0.6.1
parameterized 0.9.0
parso 0.8.3
partd 1.4.0
pathspec 0.11.1
pathtools 0.1.2
pathy 0.10.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.1
pip 21.2.4
pipdeptree 2.13.0
plac 1.3.5
platformdirs 4.1.0
pluggy 1.2.0
ply 3.11
polars 0.16.7
polygraphy 0.47.1
pooch 1.7.0
portalocker 2.7.0
POT 0.7.0
pre-commit 3.4.0
preshed 3.0.8
prettytable 3.8.0
progress 1.6
prometheus-client 0.17.0
prompt-toolkit 3.0.38
protobuf 3.20.3
psutil 5.9.4
ptxcompiler 0.8.1+1.gbe9fca5
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
py-cpuinfo 9.0.0
py4j 0.10.9.7
pyannote.core 5.0.0
pyannote.database 5.0.1
pyannote.metrics 3.2.1
pyarrow 14.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.10.4
pybtex 0.24.0
pybtex-docutils 1.0.2
pycocotools 2.0+nv0.7.3
pycparser 2.21
pydantic 2.5.3
pydantic_core 2.14.6
pydata-sphinx-theme 0.13.1
pydub 0.25.1
pyfaidx 0.7.2
pyfastx 1.1.0
Pygments 2.15.1
pylibcugraph 23.4.0
pylibcugraphops 23.4.0
pylibraft 23.4.0
Pympler 1.0.1
pynini 2.1.5
pynvml 11.4.1
pyparsing 3.0.9
pypinyin 0.49.0
pypinyin-dict 0.6.0
pyrsistent 0.19.3
PySocks 1.7.1
pytest 7.4.0
pytest-cov 4.1.0
pytest-dependency 0.5.1
pytest-forked 1.6.0
pytest-rerunfailures 11.1.2
pytest-runner 6.0.0
pytest-shard 0.1.2
pytest-timeout 2.2.0
pytest-xdist 3.3.1
python-dateutil 2.8.2
python-hostlist 1.23.0
python-rapidjson 1.14
python-slugify 8.0.1
pytorch-lightning 1.9.4
pytorch-quantization 2.1.2
pytz 2023.3
PyYAML 6.0
pyzmq 23.2.1
raft-dask 23.4.0
rapidfuzz 2.13.7
rdkit 2023.9.1
rdkit-pypi 2022.9.5
regex 2023.6.3
requests 2.31.0
requests-mock 1.11.0
requests-oauthlib 1.3.1
resampy 0.4.2
rich 12.6.0
rmm 23.4.0
rouge-score 0.1.2
rsa 4.7.2
ruamel.yaml 0.17.32
ruamel.yaml.clib 0.2.7
ruff 0.0.292
s3transfer 0.7.0
sacrebleu 2.3.1
sacremoses 0.0.53
safetensors 0.3.1
scikit-learn 1.2.0
scipy 1.10.1
seaborn 0.12.2
Send2Trash 1.8.2
sentence-transformers 2.2.2
sentencepiece 0.1.99
sentry-sdk 1.28.1
setproctitle 1.3.2
setuptools 65.5.1
sh 1.14.3
shellingham 1.5.0.post1
six 1.16.0
smart-open 6.3.0
smmap 5.0.0
snowballstemmer 2.2.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.4.1
sox 1.4.1
spacy 3.5.3
spacy-legacy 3.0.12
spacy-loggers 1.0.4
Sphinx 5.3.0
sphinx-book-theme 1.0.0
sphinx-copybutton 0.5.2
sphinx-glpi-theme 0.3
sphinxcontrib-applehelp 1.0.4
sphinxcontrib-bibtex 2.5.0
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.1
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
sphinxext-opengraph 0.8.2
spyrmsd 0.5.2
srsly 2.4.6
stack-data 0.6.2
sympy 1.12
tabulate 0.9.0
tbb 2021.9.0
tblib 1.7.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorrt 8.6.1
termcolor 2.3.0
terminado 0.17.1
testbook 0.4.2
text-unidecode 1.3
textdistance 4.5.0
texterrors 0.4.4
tfrecord 1.14.1
thinc 8.1.10
threadpoolctl 3.1.0
thriftpy2 0.4.16
tinycss2 1.2.1
tokenizers 0.15.0
toml 0.10.2
tomli 2.0.1
toolz 0.12.0
torch 2.1.0a0+4136153
torch-cluster 1.6.1
torch-geometric 2.3.0
torch-scatter 2.0.9
torch-sparse 0.6.17
torch-tensorrt 1.5.0.dev0
torchaudio 2.1.0
torchdata 0.7.0a0
torchmetrics 1.0.1
torchvision 0.16.0a0
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformer-engine 0.9.0
transformers 4.36.0
treelite 3.2.0
treelite-runtime 3.2.0
triton 2.0.0.dev20221202
triton-model-navigator 0.7.4
tritonclient 2.41.1
typed-ast 1.5.5
typer 0.7.0
types-dataclasses 0.6.6
typing_extensions 4.6.3
typing-inspect 0.6.0
ucx-py 0.31.0
uff 0.6.9
urllib3 1.26.16
virtualenv 20.25.0
wandb 0.15.6
wasabi 1.1.2
wcwidth 0.2.6
webdataset 0.2.33
webencodings 0.5.1
Werkzeug 2.3.6
wget 3.2
wheel 0.40.0
widgetsnbextension 4.0.8
wrapt 1.14.1
xdoctest 1.0.2
xgboost 1.7.5
yarl 1.9.2
youtokentome 1.0.6
zict 3.0.0
zipp 3.15.0
zope.event 5.0
zope.interface 6.1
The text was updated successfully, but these errors were encountered: