Lightning-AI · awaelchli · Mar 10, 2024 · Mar 10, 2024 · Mar 10, 2024 · Mar 10, 2024
@@ -134,14 +134,21 @@ jobs:
 
       - bash: python -m coverage run --source ${COVERAGE_SOURCE} -m pytest . -v --durations=50
         workingDirectory: tests/tests_fabric/
-        displayName: "Testing: fabric standard"
+        displayName: "Testing: Fabric standard"
         timeoutInMinutes: "10"
 
       - bash: bash ../run_standalone_tests.sh "."
         workingDirectory: tests/tests_fabric/
         env:
           PL_STANDALONE_TESTS_SOURCE: $(COVERAGE_SOURCE)
-        displayName: "Testing: fabric standalone"
+        displayName: "Testing: Fabric standalone tests"
+        timeoutInMinutes: "10"
+
+      - bash: bash run_standalone_tasks.sh
+        workingDirectory: tests/tests_fabric
+        env:
+          PL_USE_MOCKED_MNIST: "1"
+        displayName: "Testing: Fabric standalone tasks"
         timeoutInMinutes: "10"
 
       - bash: |

@@ -24,6 +24,7 @@ pr:
       - "examples/run_pl_examples.sh"
       - "examples/pytorch/basics/backbone_image_classifier.py"
       - "examples/pytorch/basics/autoencoder.py"
+      - "tests/run_standalone_*.sh"
       - "requirements/pytorch/**"
       - "src/lightning/__init__.py"
       - "src/lightning/__setup__.py"

@@ -175,6 +175,7 @@ wandb
 *.prof
 *.tar.gz
 .neptune/
+system_check/
 
 # dataset generated from bolts in examples.
 cifar-10-batches-py

@@ -237,6 +237,15 @@ Next steps
     :height: 160
     :tag: advanced
 
+.. displayitem::
+    :header: Troubleshooting
+    :description: Learn how to troubleshoot common multi-GPU issues
+    :button_link: ../guide/troubleshooting.html
+    :col_css: col-md-4
+    :height: 160
+    :tag: advanced
+
+
 .. raw:: html
 
         </div>

@@ -8,6 +8,7 @@ Glossary
 
    Checkpoint <../guide/checkpoint/index>
    Weights and Biases <../guide/loggers/wandb>
+   Troubleshooting <../guide/troubleshooting>
 
 
 .. raw:: html
@@ -150,6 +151,11 @@ Glossary
     :button_link: ../fundamentals/launch.html
     :col_css: col-md-4
 
+.. displayitem::
+    :header: NCCL
+    :button_link: ../guide/troubleshooting.html
+    :col_css: col-md-4
+
 .. displayitem::
     :header: Notebook
     :button_link: ../launch/notebook.html

@@ -110,52 +110,6 @@ After executing these commands, you should immediately see an output like this:
 Troubleshooting
 ***************
 
-
-**My program is stuck initializing at startup. What is causing this?**
-
-You are seeing a message like this in the logs, but nothing happens:
-
-.. code-block::
-
-    Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
-
-The most likely reasons and how to fix it:
-
-- **Wrong network interface:** Some servers have multiple network interfaces.
-  There is usually only one that can send and receive traffic from the network of the other nodes, but sometimes it is not set as the default.
-  In this case, you need to set it manually:
-
-  .. code-block:: bash
-
-    export GLOO_SOCKET_IFNAME=eno1
-    export NCCL_SOCKET_IFNAME=eno1
-    fabric run ...
-
-  You can find the interface name by parsing the output of the ``ifconfig`` command.
-  The name of this interface **may differ on each node**.
-
-- **NCCL can't communicate between the nodes:**
-
-  Follow the steps in the `NCCL troubleshooting guide <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html>`_.
-  In particular, take note of the network section that describes restricting the port range and firewall rules.
-
-  .. code-block:: bash
-
-      echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf
-      sysctl --system
-      ufw allow 50000:51000/tcp
-
-
-**My program crashes with an NCCL error, but it is not helpful**
-
-Launch your command by prepending ``NCCL_DEBUG=INFO`` to get more info.
-
-.. code-block:: bash
-
-    NCCL_DEBUG=INFO fabric run ...
-
-
-----
-
-If you are sick of troubleshooting cluster problems, give :doc:`Lightning cloud <./cloud>` a try!
+Please refer to the :doc:`troubleshooting guide <../troubleshooting>` guide if you are experiencing issues related to multi-node training hanging or crashing.
+If you are sick of troubleshooting cluster problems, give :doc:`Lightning Studios <./cloud>` a try!
 For other questions, please don't hesitate to join the `Discord <https://discord.gg/VptPCZkGNa>`_.
@@ -0,0 +1,87 @@
+###############
+Troubleshooting
+###############
+
+Learn how to troubleshoot possible causes for common issues related to CUDA, NCCL, and distributed training.
+
+
+----
+
+
+*********
+Multi-GPU
+*********
+
+If your program is stuck at
+
+.. code-block::
+
+    Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
+
+it indicates that PyTorch can't set up the communication between GPUs, and that your system is not configured correctly.
+Run the `diagnose` command from the Fabric CLI to investigate:
+
+.. code-block:: bash
+
+    fabric diagnose
+
+This tool will run basic multi-GPU tests using only PyTorch.
+Any issues raised here will confirm that the problem is with your system and not with Lightning.
+Common solutions:
+
+- **Wrong driver version:** The NVIDIA driver for your GPU is too old or too new.
+  You can check the version of the driver by running
+
+  .. code-block:: bash
+
+      nvidia-smi --id=0 --query-gpu=driver_version --format=csv,noheader
+
+  *Solution*: Install a recent driver.
+  Search online for instructions how to update the driver on your platform.
+
+- **Peer-to-peer connection is broken:** The GPUs can't communicate with each other.
+  *Solution*: Try to set the environment variable ``NCCL_P2P_DISABLE=1``.
+  If you rerun your scipt and it fixes the problem, this means that peer-to-peer transport is not working properly (your training will run but it will be slow).
+  This is likely because of driver compatibility issues (see above) or because your GPU does not support peer-to-peer (e.g., certain RTX cards).
+
+
+----
+
+
+**********
+Multi-node
+**********
+
+Before troubleshooting multi-node connectivity issues, first ensure that multi-GPU within a single machine is working correctly by following the steps above.
+If single-node execution works, but multi-node hangs at
+
+.. code-block::
+
+    Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
+
+it indicates that there is a connection issue between the nodes.
+Common solutions:
+
+- **Wrong network interface:** Some servers have multiple network interfaces.
+  There is usually only one that can send and receive traffic from the network of the other nodes, but sometimes it is not set as the default.
+  In this case, you need to set it manually:
+
+  .. code-block:: bash
+
+    export GLOO_SOCKET_IFNAME=eno1
+    export NCCL_SOCKET_IFNAME=eno1
+    fabric run ...
+
+  You can find the interface name by parsing the output of the ``ifconfig`` command.
+  The name of this interface **may differ on each node**.
+
+- **NCCL can't communicate between the nodes:**
+
+  Follow the steps in the `NCCL troubleshooting guide <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html>`_.
+  In particular, take note of the network section that describes restricting the port range and firewall rules.
+
+  .. code-block:: bash
+
+      echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf
+      sysctl --system
+      ufw allow 50000:51000/tcp
@@ -26,6 +26,7 @@
 from lightning.fabric.accelerators import CPUAccelerator, CUDAAccelerator, MPSAccelerator
 from lightning.fabric.plugins.precision.precision import _PRECISION_INPUT_STR, _PRECISION_INPUT_STR_ALIAS
 from lightning.fabric.strategies import STRATEGY_REGISTRY
+from lightning.fabric.utilities import system_check
 from lightning.fabric.utilities.consolidate_checkpoint import _process_cli_args
 from lightning.fabric.utilities.device_parser import _parse_gpu_ids
 from lightning.fabric.utilities.distributed import _suggested_max_num_threads
@@ -188,6 +189,11 @@ def _consolidate(checkpoint_folder: str, output_file: Optional[str]) -> None:
         checkpoint = _load_distributed_checkpoint(config.checkpoint_folder)
         torch.save(checkpoint, config.output_file)
 
+    @_main.command("diagnose")
+    def _diagnose() -> None:
+        """Diagnose issues with your multi-GPU setup."""
+        system_check.main()
+
 
 def _set_env_variables(args: Namespace) -> None:
     """Set the environment variables for the new processes.

@@ -1,3 +1,16 @@
+# Copyright The Lightning AI team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import logging
 from argparse import ArgumentParser, Namespace
 from pathlib import Path

@@ -1,3 +1,16 @@
+# Copyright The Lightning AI team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import contextlib
 import logging
 import os

@@ -1,3 +1,16 @@
+# Copyright The Lightning AI team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import logging
 import os
 import random

@@ -1,3 +1,16 @@
+# Copyright The Lightning AI team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import json
 import operator
 import os