Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Basic system check for troubleshooting multi-GPU issues #19609

Draft
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

awaelchli
Copy link
Member

@awaelchli awaelchli commented Mar 10, 2024

What does this PR do?

WIP

Users frequently contact us about their multi-GPU training not working (showing as "hanging" or "stuck"). In most cases, this is because of a bad multi-GPU setup. This PR adds a PyTorch-only test script that troubleshoots issues and runs tests. Right now, it only runs a simple all-reduce test and whether P2P is an issue. Future iterations can do more checks. Run it like so:

fabric diagnose

For future bug reports or channel questions, we can ask users to run this check. For example:
https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1712878650285279

Issues this might close:
#17389
#18490
#19606
#19487


📚 Documentation preview 📚: https://pytorch-lightning--19609.org.readthedocs.build/en/19609/

@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Mar 10, 2024
@github-actions github-actions bot added the ci Continuous Integration label Mar 10, 2024
@github-actions github-actions bot added the docs Documentation related label Mar 11, 2024
@awaelchli awaelchli added the fun Staff contributions outside working hours - to differentiate from the "community" label label Mar 15, 2024
@awaelchli awaelchli added the strategy: ddp DistributedDataParallel label Mar 15, 2024
@awaelchli awaelchli added this to the 2.3 milestone Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration docs Documentation related fabric lightning.fabric.Fabric fun Staff contributions outside working hours - to differentiate from the "community" label strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant