DDP training timeout #19487

pengzhangzhi · 2024-02-16T15:49:00Z

Bug description

I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

The configs might relate to the training:

trainer:
  devices: 8 # number of GPUs or CPUs
  num_nodes: 1 
  accelerator: gpu #gpu or cpu
  precision: 16 #16 or 32
  logger: False # logger is provided by NeMo exp_manager
  enable_checkpointing: False # checkpointing is done by NeMo exp_manager
  replace_sampler_ddp: False # use NeMo Megatron samplers
  max_epochs: null # # use max_steps instead with NeMo Megatron model
  log_every_n_steps: 10  # number of interations between logging
  val_check_interval: 15e4
  limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
  limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  benchmark: False
  max_steps: 500000



### Error messages and logs

Epoch 0: 6%|██ | 32040/500150 [6:28:43<94:39:17, 1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.


### Environment

a03-zpeng@m3dgx01:~$ pip list
Package Version Location

absl-py 1.4.0
accessible-pygments 0.0.4
aiohttp 3.9.0
aiosignal 1.3.1
alabaster 0.7.13
aniso8601 9.0.1
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
apex 0.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.2.1
astunparse 1.6.3
async-timeout 4.0.2
attrdict 2.0.1
attrs 23.1.0
audioread 3.0.0
awscli 1.29.67
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
bionemo 0.2.0.dev0 /workspace/bionemo
biopandas 0.4.1
biopython 1.79
black 23.1.0
bleach 6.0.0
blinker 1.6.2
blis 0.7.9
boto3 1.28.10
botocore 1.31.67
braceexpand 0.1.7
Brotli 1.1.0
cachetools 5.3.1
catalogue 2.0.8
cdifflib 1.2.6
certifi 2023.7.22
cffi 1.15.1
cfgv 3.4.0
charset-normalizer 3.1.0
click 8.1.7
cloudpickle 2.2.1
cmake 3.24.1.1
colorama 0.4.4
coloredlogs 15.0.1
comm 0.1.3
commonmark 0.9.1
confection 0.0.4
contourpy 1.0.7
coverage 7.4.0
crc32c 2.3.post0
cubinlinker 0.3.0+2.g87b01ae
cuda-python 12.1.0rc5+1.g38940ef
cudf 23.4.0
cugraph 23.4.0
cugraph-dgl 23.4.0
cugraph-service-client 23.4.0
cugraph-service-server 23.4.0
cuml 23.4.0
cupy-cuda12x 12.0.0b3
cycler 0.11.0
cymem 2.0.7
Cython 0.29.35
dacite 1.8.1
dask 2023.3.2
dask-cuda 23.4.0
dask-cudf 23.4.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dgl 1.1.3
dgllife 0.2.8
diffdock 0.0.5
dill 0.3.7
Distance 0.1.3
distlib 0.3.8
distributed 2023.3.2.1
DLLogger 1.0.0
docker-pycreds 0.4.0
docopt 0.6.2
docutils 0.16
e3nn 0.5.1
editdistance 0.6.2
einops 0.6.1
exceptiongroup 1.1.1
execnet 1.9.0
executing 1.2.0
expecttest 0.1.3
fair-esm 2.0.0
faiss-cpu 1.7.4
fastjsonschema 2.17.1
fastrlock 0.8.1
fasttext 0.9.2
filelock 3.12.2
fire 0.5.0
flash-attn 1.0.7
Flask 2.2.5
Flask-RESTful 0.3.10
flatbuffers 23.5.26
fonttools 4.47.2
frozenlist 1.3.3
fsspec 2023.5.0
ftfy 6.1.1
future 0.18.3
g2p-en 2.1.0
gast 0.4.0
gdown 4.7.1
gevent 23.9.1
geventhttpclient 2.0.2
gitdb 4.0.10
GitPython 3.1.41
google-auth 2.20.0
google-auth-oauthlib 0.4.6
graphsurgeon 0.4.6
graphviz 0.20.1
greenlet 3.0.3
grpcio 1.56.0
h5py 3.9.0
huggingface-hub 0.20.2
humanfriendly 10.0
hydra-core 1.2.0
hyperopt 0.2.7
hypothesis 5.35.1
identify 2.5.33
idna 3.4
ijson 3.2.3
imagesize 1.4.1
importlib-metadata 6.6.0
inflect 7.0.0
iniconfig 2.0.0
intel-openmp 2021.4.0
ipadic 1.0.0
ipdb 0.13.11
ipykernel 6.23.3
ipython 8.14.0
ipython-genutils 0.2.0
ipywidgets 8.0.7
isort 5.12.0
itsdangerous 2.1.2
jedi 0.18.2
jieba 0.42.1
Jinja2 3.1.2
jiwer 2.5.2
jmespath 1.0.1
joblib 1.2.0
json5 0.9.14
jsonlines 4.0.0
jsonschema 4.17.3
jupyter_client 8.3.0
jupyter_core 5.3.1
jupyter-tensorboard 0.2.0
jupyterlab 2.3.2
jupyterlab-pygments 0.2.2
jupyterlab-server 1.2.0
jupyterlab-widgets 3.0.8
jupytext 1.14.6
k2 1.24.3.dev20230725+cuda12.1.torch2.1.0a0
kaldi-python-io 1.2.2
kaldiio 2.18.0
kiwisolver 1.4.4
kornia 0.6.12
langcodes 3.3.0
latexcodec 2.0.1
Levenshtein 0.21.1
librosa 0.9.2
lightning-utilities 0.9.0
llvmlite 0.39.1
locket 1.0.0
loguru 0.7.0
lxml 4.9.3
Markdown 3.4.3
markdown-it-py 2.2.0
markdown2 2.4.9
MarkupSafe 2.1.3
marshmallow 3.20.1
matplotlib 3.4.3
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
mecab-python3 1.0.5
megatron-core 0.2.0
mistune 3.0.1
mkl 2021.1.1
mkl-devel 2021.1.1
mkl-include 2021.1.1
mock 5.0.2
more-itertools 10.1.0
mpmath 0.19
msgpack 1.0.5
multidict 6.0.4
murmurhash 1.0.9
mypy-extensions 1.0.0
nbclient 0.8.0
nbconvert 7.6.0
nbformat 5.9.0
nemo-text-processing 0.1.8rc0
nemo-toolkit 1.20.0
nest-asyncio 1.5.6
networkx 2.6.3
ninja 1.11.1
nltk 3.8.1
nodeenv 1.8.0
notebook 6.4.10
numba 0.56.4+1.g5f1bc7084
numpy 1.22.2
nvidia-dali-cuda120 1.26.0
nvidia-pyindex 1.0.9
nvidia-pytriton 0.4.0
nvtx 0.2.5
oauthlib 3.2.2
omegaconf 2.2.3
onnx 1.14.1
onnx-graphsurgeon 0.3.27
onnxruntime-gpu 1.16.3
onnxscript 0.1.0.dev20240113
OpenCC 1.1.6
opencv 4.6.0
opt-einsum 3.3.0
opt-einsum-fx 0.1.4
packaging 23.1
pandas 1.5.2
pandocfilters 1.5.0
pangu 4.0.6.1
parameterized 0.9.0
parso 0.8.3
partd 1.4.0
pathspec 0.11.1
pathtools 0.1.2
pathy 0.10.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.1
pip 21.2.4
pipdeptree 2.13.0
plac 1.3.5
platformdirs 4.1.0
pluggy 1.2.0
ply 3.11
polars 0.16.7
polygraphy 0.47.1
pooch 1.7.0
portalocker 2.7.0
POT 0.7.0
pre-commit 3.4.0
preshed 3.0.8
prettytable 3.8.0
progress 1.6
prometheus-client 0.17.0
prompt-toolkit 3.0.38
protobuf 3.20.3
psutil 5.9.4
ptxcompiler 0.8.1+1.gbe9fca5
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
py-cpuinfo 9.0.0
py4j 0.10.9.7
pyannote.core 5.0.0
pyannote.database 5.0.1
pyannote.metrics 3.2.1
pyarrow 14.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.10.4
pybtex 0.24.0
pybtex-docutils 1.0.2
pycocotools 2.0+nv0.7.3
pycparser 2.21
pydantic 2.5.3
pydantic_core 2.14.6
pydata-sphinx-theme 0.13.1
pydub 0.25.1
pyfaidx 0.7.2
pyfastx 1.1.0
Pygments 2.15.1
pylibcugraph 23.4.0
pylibcugraphops 23.4.0
pylibraft 23.4.0
Pympler 1.0.1
pynini 2.1.5
pynvml 11.4.1
pyparsing 3.0.9
pypinyin 0.49.0
pypinyin-dict 0.6.0
pyrsistent 0.19.3
PySocks 1.7.1
pytest 7.4.0
pytest-cov 4.1.0
pytest-dependency 0.5.1
pytest-forked 1.6.0
pytest-rerunfailures 11.1.2
pytest-runner 6.0.0
pytest-shard 0.1.2
pytest-timeout 2.2.0
pytest-xdist 3.3.1
python-dateutil 2.8.2
python-hostlist 1.23.0
python-rapidjson 1.14
python-slugify 8.0.1
pytorch-lightning 1.9.4
pytorch-quantization 2.1.2
pytz 2023.3
PyYAML 6.0
pyzmq 23.2.1
raft-dask 23.4.0
rapidfuzz 2.13.7
rdkit 2023.9.1
rdkit-pypi 2022.9.5
regex 2023.6.3
requests 2.31.0
requests-mock 1.11.0
requests-oauthlib 1.3.1
resampy 0.4.2
rich 12.6.0
rmm 23.4.0
rouge-score 0.1.2
rsa 4.7.2
ruamel.yaml 0.17.32
ruamel.yaml.clib 0.2.7
ruff 0.0.292
s3transfer 0.7.0
sacrebleu 2.3.1
sacremoses 0.0.53
safetensors 0.3.1
scikit-learn 1.2.0
scipy 1.10.1
seaborn 0.12.2
Send2Trash 1.8.2
sentence-transformers 2.2.2
sentencepiece 0.1.99
sentry-sdk 1.28.1
setproctitle 1.3.2
setuptools 65.5.1
sh 1.14.3
shellingham 1.5.0.post1
six 1.16.0
smart-open 6.3.0
smmap 5.0.0
snowballstemmer 2.2.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.4.1
sox 1.4.1
spacy 3.5.3
spacy-legacy 3.0.12
spacy-loggers 1.0.4
Sphinx 5.3.0
sphinx-book-theme 1.0.0
sphinx-copybutton 0.5.2
sphinx-glpi-theme 0.3
sphinxcontrib-applehelp 1.0.4
sphinxcontrib-bibtex 2.5.0
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.1
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
sphinxext-opengraph 0.8.2
spyrmsd 0.5.2
srsly 2.4.6
stack-data 0.6.2
sympy 1.12
tabulate 0.9.0
tbb 2021.9.0
tblib 1.7.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorrt 8.6.1
termcolor 2.3.0
terminado 0.17.1
testbook 0.4.2
text-unidecode 1.3
textdistance 4.5.0
texterrors 0.4.4
tfrecord 1.14.1
thinc 8.1.10
threadpoolctl 3.1.0
thriftpy2 0.4.16
tinycss2 1.2.1
tokenizers 0.15.0
toml 0.10.2
tomli 2.0.1
toolz 0.12.0
torch 2.1.0a0+4136153
torch-cluster 1.6.1
torch-geometric 2.3.0
torch-scatter 2.0.9
torch-sparse 0.6.17
torch-tensorrt 1.5.0.dev0
torchaudio 2.1.0
torchdata 0.7.0a0
torchmetrics 1.0.1
torchvision 0.16.0a0
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformer-engine 0.9.0
transformers 4.36.0
treelite 3.2.0
treelite-runtime 3.2.0
triton 2.0.0.dev20221202
triton-model-navigator 0.7.4
tritonclient 2.41.1
typed-ast 1.5.5
typer 0.7.0
types-dataclasses 0.6.6
typing_extensions 4.6.3
typing-inspect 0.6.0
ucx-py 0.31.0
uff 0.6.9
urllib3 1.26.16
virtualenv 20.25.0
wandb 0.15.6
wasabi 1.1.2
wcwidth 0.2.6
webdataset 0.2.33
webencodings 0.5.1
Werkzeug 2.3.6
wget 3.2
wheel 0.40.0
widgetsnbextension 4.0.8
wrapt 1.14.1
xdoctest 1.0.2
xgboost 1.7.5
yarl 1.9.2
youtokentome 1.0.6
zict 3.0.0
zipp 3.15.0
zope.event 5.0
zope.interface 6.1


### More info

I am using the nvidia BioNeMo framework.

The text was updated successfully, but these errors were encountered:

awaelchli · 2024-02-18T02:40:55Z

@pengzhangzhi Can you describe the steps to reproduce this? There are several notebooks in the examples folder https://github.com/NVIDIA/BioNeMo/tree/main/examples/service/notebooks but I doubt you are running these. Where is the code and config that you are running?

pengzhangzhi · 2024-02-18T04:09:19Z

Hi @awaelchli, the github repo does not have the whole training code, I got it from their docker containers. If you want to reproduce their code, here is the doc https://docs.nvidia.com/bionemo-framework/latest/quickstart-fw.html
I am afraid it is too much work for you to reproduce it since you have to download and prepare all the data. My problem is kind of tricky that the nccl timeout happens during the training, sometimes earlier sometimes later, but it just out of nowhere. I would like to know how to track down the errors bc I have no clue given the error log. I have tried many solutions but no luck, such as:

increase the timeout to a year
set a bunch of nccl variables.


export NCCL_DEBUG=INFO

export NCCL_P2P_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_GID_INDEX=3

pengzhangzhi · 2024-02-18T04:11:06Z

If you insist on reproducing it, I am happy to help and give a detailed guide. For simplicity, it would be great to guide me how to debug this error. Thanks!!

pengzhangzhi · 2024-03-14T15:10:34Z

Running into the same problem. I think it is hardware-independent. The code here uses the pytorch-lightning and NeMo frameworks. It happens after 8 hours of training.

awaelchli · 2024-03-14T15:25:08Z

Hey @pengzhangzhi
Sorry, lots going on recently, trying to balance priorities. Sorry for missing or delayed replies.

I implemented a system check utility to help with such problems:
Feel free to test it out if you have the time #19609
The idea of this system check is that it is implemented in raw PyTorch, so if issues arise we would know whether it is in Lightning or not.

pengzhangzhi · 2024-03-14T16:06:09Z

Thanks!
Exactly what I need for debugging! Do u have a document for how to use this tool? The link in that pr isn't valid :(
📚 Documentation preview 📚: pytorch-lightning--19609.org.readthedocs.build/en/19609

awaelchli · 2024-03-14T16:58:42Z

Yes the docs will only generate once the PR is ready.

The easiest for you to try it right now is to just copy this file
https://github.com/Lightning-AI/pytorch-lightning/blob/feature/system-check/src/lightning/fabric/utilities/system_check.py
locally and just run it as python system_check.py. It doesn't require any dependencies other than torch and psutil.

pengzhangzhi · 2024-03-14T17:12:32Z

Thanks!!
Here is log... I can't make sense of it..
FYI, the code that I am having a problem with has been employed in two systems and both of them have the same timeout problem. Additionally I have another pytorch-lightning, torch DDP project, it works well in the current system without any timeout error.

Below is the output of `nvidia-smi`. It shows information about the GPUs that are installed on this machine, the driver version, and the maximum supported CUDA version it can run.

Thu Mar 14 17:08:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              59W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              55W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:47:00.0 Off |                    0 |
| N/A   22C    P0              58W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:4E:00.0 Off |                    0 |
| N/A   22C    P0              58W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P0              59W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:90:00.0 Off |                    0 |
| N/A   27C    P0              59W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:B7:00.0 Off |                    0 |
| N/A   51C    P0             259W / 400W |  80363MiB / 81920MiB |     90%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:BD:00.0 Off |                    0 |
| N/A   50C    P0             213W / 400W |  76741MiB / 81920MiB |     71%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

The matrix below shows how the GPUs in this machine are connected. NVLink (NV) is the fastest connection, and is only available on high-end systems like V100, A100, etc.

	�[4mGPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	NIC10	NIC11	CPU Affinity	NUMA Affinity	GPU NUMA ID�[0m
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5		N/A
NIC0	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC1	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC2	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC3	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC4	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS				
NIC5	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS				
NIC6	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS				
NIC7	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS				
NIC8	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS				
NIC9	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS				
NIC10	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX				
NIC11	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11


NCCL version 2.18.1+cuda12.1
Traceback (most recent call last):
  File "/workspace/bionemo/debug.py", line 179, in <module>
    main()
  File "/workspace/bionemo/debug.py", line 48, in main
    success = _check_cuda_distributed(timeout)
  File "/workspace/bionemo/debug.py", line 84, in _check_cuda_distributed
    success = context.join(timeout=5)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/bionemo/debug.py", line 116, in _run_all_reduce_test
    torch.distributed.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 145, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3553, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1164, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7f8be9d30b00

awaelchli · 2024-03-14T18:09:39Z

This output shows that distributed PyTorch won't work on your system. It can't synchronize at the barrier, which is a very basic requirement.

There should be a folder system_check, it might have additional logs from NCCL with warnings. In rare occasions, a driver update or downgrade can help, or reinstalling PyTorch in a fresh environment.

pengzhangzhi · 2024-03-14T18:41:53Z

Thanks!!

Since the error is in process 6, I am showing the log of nccl-rank-6 below:

pbg-dgx-1:1243335:1243335 [6] NCCL INFO cudaDriverVersion 12020
pbg-dgx-1:1243335:1243335 [6] NCCL INFO Bootstrap : Using enp226s0:10.148.54.242<0>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pbg-dgx-1:1243335:1244249 [6] NCCL INFO P2P plugin IBext
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/IB : No device found.
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/IB : No device found.
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/Socket : Using [0]enp226s0:10.148.54.242<0> [1]vethe42dbfb:fe80::a0ec:26ff:fe84:f001%vethe42dbfb<0> [2]vethd8013ee:fe80::f8e4:5dff:febd:1409%vethd8013ee<0> [3]vethfa79788:fe80::6813:94ff:fe9a:17e4%vethfa79788<0>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Using network Socket
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NVLS multicast support is not available on dev 6
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
pbg-dgx-1:1243335:1244249 [6] NCCL INFO P2P Chunksize set to 524288
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read

pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'

pbg-dgx-1:1243335:1244302 [6] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes
pbg-dgx-1:1243335:1244302 [6] NCCL INFO transport/p2p.cc:204 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO transport/p2p.cc:584 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO proxy.cc:1303 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO proxy.cc:1377 -> 1

pbg-dgx-1:1243335:1244302 [6] proxy.cc:1518 NCCL WARN [Proxy Service 6] Failed to execute operation Setup from rank 6, retcode 1

pbg-dgx-1:1243335:1244249 [6] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer pbg-dgx-1.egr.duke.edu<50889>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO misc/socket.cc:746 -> 6

pbg-dgx-1:1243335:1244249 [6] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f8be9d30b00
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport/p2p.cc:386 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport.cc:33 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport.cc:106 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO init.cc:1032 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO init.cc:1309 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO group.cc:64 -> 3 [Async thread]
pbg-dgx-1:1243335:1243335 [6] NCCL INFO group.cc:422 -> 3
pbg-dgx-1:1243335:1243335 [6] NCCL INFO group.cc:106 -> 3
pbg-dgx-1:1243335:1243335 [6] NCCL INFO comm 0x55a0a637e960 rank 6 nranks 8 cudaDev 6 busId b7000 - Abort COMPLETE

FYI, I am using a docker container. It can be reproduced by the following steps.

docker login nvcr.io
Username: $oauthtoken
Password NGc3bWIxM21mbTI0dTBraHE5N2U0NG1saWg6ZTY4MzlhZmUtYTJlZC00NDVmLThjYmEtNjA2ZTMzMzRkZTYy

Pull the Bionemo container:

docker pull nvcr.io/nvidia/clara/bionemo-framework:1.2

Run the container:

CONTAINER="nvcr.io/nvidia/clara/bionemo-framework:1.2"
DEST_PATH="."
CONTAINER_NAME=bionemo
docker run --name $CONTAINER_NAME -itd --rm $CONTAINER bash

To reproduce my error:
You can copy this file feature/system-check/src/lightning/fabric/utilities/system_check.py to the container and run it.

awaelchli · 2024-03-14T18:59:18Z

I won't have the bandwidth to help much here. Maybe try disabling plugins: NCCL_NET_PLUGIN=none. And if you are running inside the docker, please make sure that it's picking the correct network interface. Run the system check outside the container in a clean environment to see if it's related to the container or not.

pengzhangzhi · 2024-03-14T19:01:12Z

I think the problem I have on nccl-rank-6 is just OOM based on the log?

pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'

awaelchli · 2024-03-15T00:02:13Z

if you ran my system check, that's not possible. It allocates very little memory on the GPU:

pytorch-lightning/src/lightning/fabric/utilities/system_check.py

Line 118 in 297e980

payload = torch.rand(100, 100, device=device)

If what you show me there is the output of another program, then yes it looks like one rank runs out of memory. If one rank dies, the others will wait and hang forever.

pengzhangzhi · 2024-03-15T00:25:55Z

Yeah. I think it is because some of the GPUs are already heavily utilized and triggers the OOM problem as shown in the log. I only ran your program in the container and the host, both of log showing OOM on two utilized GPUs.

pengzhangzhi added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 16, 2024

awaelchli added 3rd party Related to a 3rd-party repro needed The issue is missing a reproducible example and removed needs triage Waiting to be triaged by maintainers labels Feb 18, 2024

awaelchli mentioned this issue Mar 14, 2024

[WIP] Basic system check for troubleshooting multi-GPU issues #19609

Draft

awaelchli added question Further information is requested and removed bug Something isn't working repro needed The issue is missing a reproducible example labels Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP training timeout #19487

DDP training timeout #19487

pengzhangzhi commented Feb 16, 2024

awaelchli commented Feb 18, 2024

pengzhangzhi commented Feb 18, 2024

pengzhangzhi commented Feb 18, 2024

pengzhangzhi commented Mar 14, 2024 •

edited

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024 •

edited

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024

awaelchli commented Mar 15, 2024

pengzhangzhi commented Mar 15, 2024

DDP training timeout #19487

DDP training timeout #19487

Comments

pengzhangzhi commented Feb 16, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

awaelchli commented Feb 18, 2024

pengzhangzhi commented Feb 18, 2024

pengzhangzhi commented Feb 18, 2024

pengzhangzhi commented Mar 14, 2024 • edited

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024 • edited

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024

awaelchli commented Mar 14, 2024

pengzhangzhi commented Mar 14, 2024

awaelchli commented Mar 15, 2024

pengzhangzhi commented Mar 15, 2024

pengzhangzhi commented Mar 14, 2024 •

edited

pengzhangzhi commented Mar 14, 2024 •

edited