Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hydra.run.dir (not os.getcwd) for DDP subprocesses' run dir #18145

Merged
merged 11 commits into from Jul 30, 2023

Conversation

nisheethlahoti
Copy link
Contributor

@nisheethlahoti nisheethlahoti commented Jul 24, 2023

What does this PR do?

Fixes #16694

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Jul 24, 2023
@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Merging #18145 (a4457f1) into master (3d573d5) will decrease coverage by 22%.
The diff coverage is 0%.

❗ Current head a4457f1 differs from pull request most recent head 65caaa5. Consider uploading reports for the commit 65caaa5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #18145     +/-   ##
=========================================
- Coverage      83%      61%    -22%     
=========================================
  Files         431      426      -5     
  Lines       32799    32710     -89     
=========================================
- Hits        27364    19936   -7428     
- Misses       5435    12774   +7339     

@nisheethlahoti nisheethlahoti marked this pull request as draft July 24, 2023 08:33
@nisheethlahoti nisheethlahoti marked this pull request as ready for review July 25, 2023 04:05
@nisheethlahoti
Copy link
Contributor Author

nisheethlahoti commented Jul 25, 2023

All the failing tests are failing with "ModuleNotFound: torch", which cannot possibly be caused by this PR.
Locally, the PR is running successfully.

@ishandutta0098
Copy link
Contributor

@aniketmaurya can you look into this once? We discussed this in our call yesterday.

@awaelchli
Copy link
Member

@nisheethlahoti Thanks for the PR. The change here looks good overall, but we need to make sure we don't regress on the issue here: #15689
There was a PR a while ago that change the behavior of the launcher wrt. Hydra too (#11617) but we had to revert it (see discussion here #15689).

If you would be so kind to go over these two issues to check whether there are any conflicts between your approach and the concerns raised there. That would be very helpful (forgive me, I'm a bit unfamiliar with hydra).

All the failing tests are failing with "ModuleNotFound: torch", which cannot possibly be caused by this PR.

Don't worry about this, we will take care of it once the PR can be merged :)

@awaelchli awaelchli added community This PR is from the community 3rd party Related to a 3rd-party feature Is an improvement or enhancement labels Jul 26, 2023
@nisheethlahoti
Copy link
Contributor Author

nisheethlahoti commented Jul 27, 2023

If you would be so kind to go over these two issues to check whether there are any conflicts between your approach and the concerns raised there.

The behaviour defined in #11617 actually seems better than what's proposed here (since it handles multi-run properly and has separate subdirectories for all the different PL processes), with 1 exception: that it actually tries to read the config from the .hydra directory (the problem raised in #15689). This PR doesn't have that issue since it doesn't try to read any stored config (It accesses run.dir through HydraConfig.get(), which is present whether or not a config is stored).

However, this could also be fixed by reinstating the code in #11617, removing the line command += ["-cp", hydra_output, "-cn", "config.yaml"], and running the new commands with cwd=get_original_cwd() like in #15737. This would add multirun support again, while not breaking the NeMo use-case, IMO. Should I modify this PR (or send a new one) to do that instead?

@nisheethlahoti
Copy link
Contributor Author

Update to previous comment: Removing the -cp hydra_output line will actually disable multi-run support, but we will still have:

  1. Creation of different subdirectories for all spawned processes.
  2. Support for multi-run if user has enabled hydra's experimental-rerun callback

@awaelchli
Copy link
Member

Thanks @nisheethlahoti for your helpful analysis! Based on this I suggest we move forward with your PR as is.

However, this could also be fixed by reinstating the code in #11617 [...] This would add multirun support again, while not breaking the NeMo use-case, IMO. Should I modify this PR (or send a new one) to do that instead?

This is good news. I would suggest that we break it in two parts, this PR + multi-run in a different PR.

@awaelchli
Copy link
Member

Would it be possible to add a small unit test in tests/tests_fabric/strategies/launchers/test_subprocess_script.py? Happy to help out if needed.

@awaelchli awaelchli added this to the 2.1 milestone Jul 27, 2023
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Jul 28, 2023
@nisheethlahoti nisheethlahoti marked this pull request as draft July 28, 2023 02:02
@nisheethlahoti nisheethlahoti marked this pull request as ready for review July 28, 2023 02:20
@nisheethlahoti
Copy link
Contributor Author

Would it be possible to add a small unit test in tests/tests_fabric/strategies/launchers/test_subprocess_script.py? Happy to help out if needed.

Done

@mergify mergify bot added the ready PRs ready to be merged label Jul 28, 2023
@awaelchli awaelchli merged commit ebbd538 into Lightning-AI:master Jul 30, 2023
102 checks passed
@awaelchli
Copy link
Member

Awesome! Thank you @nisheethlahoti 🎉

Borda pushed a commit that referenced this pull request Aug 14, 2023
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

(cherry picked from commit ebbd538)
lexierule pushed a commit that referenced this pull request Aug 14, 2023
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

(cherry picked from commit ebbd538)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party community This PR is from the community fabric lightning.fabric.Fabric feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change the command to initialize DDP subprocess with Hydra
4 participants