Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra+DDP causing infinite hang #15545

Closed
lminer opened this issue Nov 4, 2022 · 7 comments
Closed

Hydra+DDP causing infinite hang #15545

lminer opened this issue Nov 4, 2022 · 7 comments
Labels
bug Something isn't working strategy: ddp DistributedDataParallel
Milestone

Comments

@lminer
Copy link

lminer commented Nov 4, 2022

Bug description

I use hydra+ddp and upon installing 1.8 have found that any run I start with ddp, it now hangs with the following error:

sys:1: UserWarning:
'config.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
[rank: 0] Global seed set to 43
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
/home/lminer/miniforge3/envs/separate_torch/lib/python3.9/site-packages/hydra/main.py:90: UserWarning:
'config.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
Error merging 'config.yaml' with schema
Key 'augs' not in 'Config'
    full_key: augs
    object_type=Config

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I used git bisect and have traced the problem to this commit:

45ca78167efaa98f5e78ca73d79d4e71946db253 is the first bad commit
commit 45ca78167efaa98f5e78ca73d79d4e71946db253
Author: Justin Goodwin <jgoodwin@ll.mit.edu>
Date:   Thu Sep 22 12:03:13 2022 -0400

    Improving Hydra+DDP support (#11617)

    Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
    Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
    Co-authored-by: Jirka <jirka.borovec@seznam.cz>

 .../strategies/launchers/subprocess_script.py      |  89 +++++++-----
 .../strategies/launchers/test_subprocess_script.py | 161 +++++++++++++++++++++
 2 files changed, 213 insertions(+), 37 deletions(-)
 create mode 100644 tests/tests_pytorch/strategies/launchers/test_subprocess_script.py

How to reproduce the bug

No response

Error messages and logs


# Error messages and logs here please

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

@lminer lminer added the needs triage Waiting to be triaged by maintainers label Nov 4, 2022
@awaelchli
Copy link
Member

@lminer Thanks for the report and the effort of doing the bisect! As you found, this was introduced in #11617.
@jgbos would you mind taking a look here?

@awaelchli awaelchli added strategy: ddp DistributedDataParallel bug Something isn't working and removed needs triage Waiting to be triaged by maintainers labels Nov 4, 2022
@awaelchli awaelchli added this to the v1.8.x milestone Nov 4, 2022
@awaelchli awaelchli changed the title lightning 1.8 causing infinite hang with DDP Hydra+DDP causing infinite hang Nov 5, 2022
@jgbos
Copy link
Contributor

jgbos commented Nov 7, 2022

@lminer is it possible to provide a config to test with and also provide the version of Hydra you are using?

@lminer
Copy link
Author

lminer commented Nov 7, 2022

Is there a way to send it to you privately @jgbos

@jgbos
Copy link
Contributor

jgbos commented Nov 8, 2022

@lminer You can try sending it to jgoodwin314@gmail.com if you want to do that. FYI, I have very limited time to debug though, so here are a couple things you could try out:

  1. Make a small example config that causes the same error. What is special about augs that causes this error?
  2. Can you try running python <fn with task>.py -cn <experiment path to .hydra> -cp config.yaml
  3. You can try setting up Re-run (which Lightning supports)

2 is essentially what lightning is doing under the hood (see here). If you get errors executing 2 there may be something wrong with your configs.

@awaelchli
Copy link
Member

@lminer the behavior was restored to how it was prior to 1.8 in #15737
Lightning 1.8.3 will have this change included.

@lminer
Copy link
Author

lminer commented Nov 26, 2022

@awaelchli I actually got it to work. It was indeed a problem with my configuration!

@awaelchli
Copy link
Member

Thank you @lminer for confirming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strategy: ddp DistributedDataParallel
Projects
None yet
Development

No branches or pull requests

3 participants