Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Basic system check for troubleshooting multi-GPU issues #19609

Draft
wants to merge 25 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 9 additions & 2 deletions .azure/gpu-tests-fabric.yml
Expand Up @@ -134,14 +134,21 @@ jobs:

- bash: python -m coverage run --source ${COVERAGE_SOURCE} -m pytest . -v --durations=50
workingDirectory: tests/tests_fabric/
displayName: "Testing: fabric standard"
displayName: "Testing: Fabric standard"
timeoutInMinutes: "10"

- bash: bash ../run_standalone_tests.sh "."
workingDirectory: tests/tests_fabric/
env:
PL_STANDALONE_TESTS_SOURCE: $(COVERAGE_SOURCE)
displayName: "Testing: fabric standalone"
displayName: "Testing: Fabric standalone tests"
timeoutInMinutes: "10"

- bash: bash run_standalone_tasks.sh
workingDirectory: tests/tests_fabric
env:
PL_USE_MOCKED_MNIST: "1"
displayName: "Testing: Fabric standalone tasks"
timeoutInMinutes: "10"

- bash: |
Expand Down
1 change: 1 addition & 0 deletions .azure/gpu-tests-pytorch.yml
Expand Up @@ -24,6 +24,7 @@ pr:
- "examples/run_pl_examples.sh"
- "examples/pytorch/basics/backbone_image_classifier.py"
- "examples/pytorch/basics/autoencoder.py"
- "tests/run_standalone_*.sh"
- "requirements/pytorch/**"
- "src/lightning/__init__.py"
- "src/lightning/__setup__.py"
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -175,6 +175,7 @@ wandb
*.prof
*.tar.gz
.neptune/
system_check/

# dataset generated from bolts in examples.
cifar-10-batches-py
Expand Down
9 changes: 9 additions & 0 deletions docs/source-fabric/fundamentals/launch.rst
Expand Up @@ -237,6 +237,15 @@ Next steps
:height: 160
:tag: advanced

.. displayitem::
:header: Troubleshooting
:description: Learn how to troubleshoot common multi-GPU issues
:button_link: ../guide/troubleshooting.html
:col_css: col-md-4
:height: 160
:tag: advanced


.. raw:: html

</div>
Expand Down
6 changes: 6 additions & 0 deletions docs/source-fabric/glossary/index.rst
Expand Up @@ -8,6 +8,7 @@ Glossary

Checkpoint <../guide/checkpoint/index>
Weights and Biases <../guide/loggers/wandb>
Troubleshooting <../guide/troubleshooting>


.. raw:: html
Expand Down Expand Up @@ -150,6 +151,11 @@ Glossary
:button_link: ../fundamentals/launch.html
:col_css: col-md-4

.. displayitem::
:header: NCCL
:button_link: ../guide/troubleshooting.html
:col_css: col-md-4

.. displayitem::
:header: Notebook
:button_link: ../launch/notebook.html
Expand Down
50 changes: 2 additions & 48 deletions docs/source-fabric/guide/multi_node/barebones.rst
Expand Up @@ -110,52 +110,6 @@ After executing these commands, you should immediately see an output like this:
Troubleshooting
***************


**My program is stuck initializing at startup. What is causing this?**

You are seeing a message like this in the logs, but nothing happens:

.. code-block::

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

The most likely reasons and how to fix it:

- **Wrong network interface:** Some servers have multiple network interfaces.
There is usually only one that can send and receive traffic from the network of the other nodes, but sometimes it is not set as the default.
In this case, you need to set it manually:

.. code-block:: bash

export GLOO_SOCKET_IFNAME=eno1
export NCCL_SOCKET_IFNAME=eno1
fabric run ...

You can find the interface name by parsing the output of the ``ifconfig`` command.
The name of this interface **may differ on each node**.

- **NCCL can't communicate between the nodes:**

Follow the steps in the `NCCL troubleshooting guide <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html>`_.
In particular, take note of the network section that describes restricting the port range and firewall rules.

.. code-block:: bash

echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf
sysctl --system
ufw allow 50000:51000/tcp


**My program crashes with an NCCL error, but it is not helpful**

Launch your command by prepending ``NCCL_DEBUG=INFO`` to get more info.

.. code-block:: bash

NCCL_DEBUG=INFO fabric run ...


----

If you are sick of troubleshooting cluster problems, give :doc:`Lightning cloud <./cloud>` a try!
Please refer to the :doc:`troubleshooting guide <../troubleshooting>` guide if you are experiencing issues related to multi-node training hanging or crashing.
If you are sick of troubleshooting cluster problems, give :doc:`Lightning Studios <./cloud>` a try!
For other questions, please don't hesitate to join the `Discord <https://discord.gg/VptPCZkGNa>`_.
87 changes: 87 additions & 0 deletions docs/source-fabric/guide/troubleshooting.rst
@@ -0,0 +1,87 @@
###############
Troubleshooting
###############

Learn how to troubleshoot possible causes for common issues related to CUDA, NCCL, and distributed training.


----


*********
Multi-GPU
*********

If your program is stuck at

.. code-block::

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

it indicates that PyTorch can't set up the communication between GPUs, and that your system is not configured correctly.
Run the `diagnose` command from the Fabric CLI to investigate:

.. code-block:: bash

fabric diagnose

This tool will run basic multi-GPU tests using only PyTorch.
Any issues raised here will confirm that the problem is with your system and not with Lightning.
Common solutions:

- **Wrong driver version:** The NVIDIA driver for your GPU is too old or too new.
You can check the version of the driver by running

.. code-block:: bash

nvidia-smi --id=0 --query-gpu=driver_version --format=csv,noheader

*Solution*: Install a recent driver.
Search online for instructions how to update the driver on your platform.

- **Peer-to-peer connection is broken:** The GPUs can't communicate with each other.
*Solution*: Try to set the environment variable ``NCCL_P2P_DISABLE=1``.
If you rerun your scipt and it fixes the problem, this means that peer-to-peer transport is not working properly (your training will run but it will be slow).
This is likely because of driver compatibility issues (see above) or because your GPU does not support peer-to-peer (e.g., certain RTX cards).


----


**********
Multi-node
**********

Before troubleshooting multi-node connectivity issues, first ensure that multi-GPU within a single machine is working correctly by following the steps above.
If single-node execution works, but multi-node hangs at

.. code-block::

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

it indicates that there is a connection issue between the nodes.
Common solutions:

- **Wrong network interface:** Some servers have multiple network interfaces.
There is usually only one that can send and receive traffic from the network of the other nodes, but sometimes it is not set as the default.
In this case, you need to set it manually:

.. code-block:: bash

export GLOO_SOCKET_IFNAME=eno1
export NCCL_SOCKET_IFNAME=eno1
fabric run ...

You can find the interface name by parsing the output of the ``ifconfig`` command.
The name of this interface **may differ on each node**.

- **NCCL can't communicate between the nodes:**

Follow the steps in the `NCCL troubleshooting guide <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html>`_.
In particular, take note of the network section that describes restricting the port range and firewall rules.

.. code-block:: bash

echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf
sysctl --system
ufw allow 50000:51000/tcp
6 changes: 6 additions & 0 deletions src/lightning/fabric/cli.py
Expand Up @@ -26,6 +26,7 @@
from lightning.fabric.accelerators import CPUAccelerator, CUDAAccelerator, MPSAccelerator
from lightning.fabric.plugins.precision.precision import _PRECISION_INPUT_STR, _PRECISION_INPUT_STR_ALIAS
from lightning.fabric.strategies import STRATEGY_REGISTRY
from lightning.fabric.utilities import system_check
from lightning.fabric.utilities.consolidate_checkpoint import _process_cli_args
from lightning.fabric.utilities.device_parser import _parse_gpu_ids
from lightning.fabric.utilities.distributed import _suggested_max_num_threads
Expand Down Expand Up @@ -188,6 +189,11 @@ def _consolidate(checkpoint_folder: str, output_file: Optional[str]) -> None:
checkpoint = _load_distributed_checkpoint(config.checkpoint_folder)
torch.save(checkpoint, config.output_file)

@_main.command("diagnose")
def _diagnose() -> None:
"""Diagnose issues with your multi-GPU setup."""
system_check.main()


def _set_env_variables(args: Namespace) -> None:
"""Set the environment variables for the new processes.
Expand Down
13 changes: 13 additions & 0 deletions src/lightning/fabric/utilities/consolidate_checkpoint.py
@@ -1,3 +1,16 @@
# Copyright The Lightning AI team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from argparse import ArgumentParser, Namespace
from pathlib import Path
Expand Down
13 changes: 13 additions & 0 deletions src/lightning/fabric/utilities/distributed.py
@@ -1,3 +1,16 @@
# Copyright The Lightning AI team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import logging
import os
Expand Down
13 changes: 13 additions & 0 deletions src/lightning/fabric/utilities/seed.py
@@ -1,3 +1,16 @@
# Copyright The Lightning AI team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
import random
Expand Down
13 changes: 13 additions & 0 deletions src/lightning/fabric/utilities/spike.py
@@ -1,3 +1,16 @@
# Copyright The Lightning AI team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import operator
import os
Expand Down