Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to extend FSDPStrategy to HPU accelerator #19753

Closed
jyothisambolu opened this issue Apr 10, 2024 · 7 comments · Fixed by #19781 · May be fixed by #19704
Closed

Unable to extend FSDPStrategy to HPU accelerator #19753

jyothisambolu opened this issue Apr 10, 2024 · 7 comments · Fixed by #19781 · May be fixed by #19704
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@jyothisambolu
Copy link

jyothisambolu commented Apr 10, 2024

Bug description

We are trying to support FSDP strategy on HPU accelerators. But due to a restriction from accelerator connector, we are unable to use FSDPStrategy on HPU.

) and self._accelerator_flag not in ("cuda", "gpu"):

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs
Traceback (most recent call last):
  File "/home/jsambolu/github/fsdp_integration/lightning-Habana/tests/test_pytorch/../../examples/pytorch/language_model.py", line 92, in <module>
    trainer = Trainer(accelerator=HPUAccelerator(), strategy=_strategy, fast_dev_run=10, enable_model_summary=True)
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 159, in __init__
    self._check_strategy_and_fallback()
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 487, in _check_strategy_and_fallback
    raise MisconfigurationException(
lightning.fabric.utilities.exceptions.MisconfigurationException: You selected strategy to be `fsdp`, but GPU accelerator is not used.

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@jyothisambolu jyothisambolu added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 10, 2024
@jerome-habana
Copy link
Contributor

@Borda FYA

@awaelchli
Copy link
Member

@jyothisambolu Isn't PyTorch's FSDP only compatible with CUDA? Do you have a custom FSDPStrategy? I can't find one in lightning-habana.

@jerome-habana
Copy link
Contributor

@jyothisambolu Isn't PyTorch's FSDP only compatible with CUDA? Do you have a custom FSDPStrategy? I can't find one in lightning-habana.

@awaelchli - to push HPUFSDPStrategy, we need this change to be in place else we get the error mentioned in the issue above. Once this is pushed in, we'll push the FSDP PR to lightning-habana. I can share a draft if it helps.

@awaelchli
Copy link
Member

@jerome-habana Yes please, that would be very insightful. I'd like to know whether your FSDP strategy subclasses ours and if so, to what extend.

@jerome-habana
Copy link
Contributor

@jerome-habana Yes please, that would be very insightful. I'd like to know whether your FSDP strategy subclasses ours and if so, to what extend.

@awaelchli checkout Lightning-AI/lightning-Habana#174 . It mostly includes changes wrt initialization and device usage instead of index

@awaelchli
Copy link
Member

@jerome-habana Thanks for sharing. I opened an alternative approach in #19781 that would unblock you. #19781 is closer to the original intent of the error checking and doesn't require a check against the integration. After the FSDP strategy in habana-lightning is released, we can still add more error checking if we want to.

Regarding the HPU FSDP implementation: I see there are several overrides of internals. Please note that we don't provide backward compatibility for internals of strategies, especially the protected methods and members. We will make changes to the FSDP strategy that may break your overrides. If this is undesired, I recommend not subclassing FSDPStrategy and instead subclassing ParallelStrategy.

@jerome-habana
Copy link
Contributor

@jerome-habana Thanks for sharing. I opened an alternative approach in #19781 that would unblock you. #19781 is closer to the original intent of the error checking and doesn't require a check against the integration. After the FSDP strategy in habana-lightning is released, we can still add more error checking if we want to.

Regarding the HPU FSDP implementation: I see there are several overrides of internals. Please note that we don't provide backward compatibility for internals of strategies, especially the protected methods and members. We will make changes to the FSDP strategy that may break your overrides. If this is undesired, I recommend not subclassing FSDPStrategy and instead subclassing ParallelStrategy.

@awaelchli in the long run, we want to move away from inheriting native strategies. But for now, we are dependent on base class fsdp strategy with the gpu checks removed to enable fsdp with hpu. This is to avoid some code duplication and have accelerator specific change alone inside hpufsdpstrategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
3 participants