Unable to extend FSDPStrategy to HPU accelerator #19753

jyothisambolu · 2024-04-10T11:07:06Z

Bug description

We are trying to support FSDP strategy on HPU accelerators. But due to a restriction from accelerator connector, we are unable to use FSDPStrategy on HPU.

pytorch-lightning/src/lightning/pytorch/trainer/connectors/accelerator_connector.py

Line 459 in 76b691d

) and self._accelerator_flag not in ("cuda", "gpu"):

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs
Traceback (most recent call last):
  File "/home/jsambolu/github/fsdp_integration/lightning-Habana/tests/test_pytorch/../../examples/pytorch/language_model.py", line 92, in <module>
    trainer = Trainer(accelerator=HPUAccelerator(), strategy=_strategy, fast_dev_run=10, enable_model_summary=True)
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 159, in __init__
    self._check_strategy_and_fallback()
  File "/home/jsambolu/venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 487, in _check_strategy_and_fallback
    raise MisconfigurationException(
lightning.fabric.utilities.exceptions.MisconfigurationException: You selected strategy to be `fsdp`, but GPU accelerator is not used.

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

The text was updated successfully, but these errors were encountered:

jerome-habana · 2024-04-12T05:34:04Z

@Borda FYA

awaelchli · 2024-04-14T10:10:39Z

@jyothisambolu Isn't PyTorch's FSDP only compatible with CUDA? Do you have a custom FSDPStrategy? I can't find one in lightning-habana.

jerome-habana · 2024-04-15T01:39:01Z

@jyothisambolu Isn't PyTorch's FSDP only compatible with CUDA? Do you have a custom FSDPStrategy? I can't find one in lightning-habana.

@awaelchli - to push HPUFSDPStrategy, we need this change to be in place else we get the error mentioned in the issue above. Once this is pushed in, we'll push the FSDP PR to lightning-habana. I can share a draft if it helps.

awaelchli · 2024-04-15T14:19:50Z

@jerome-habana Yes please, that would be very insightful. I'd like to know whether your FSDP strategy subclasses ours and if so, to what extend.

jerome-habana · 2024-04-15T15:20:01Z

@jerome-habana Yes please, that would be very insightful. I'd like to know whether your FSDP strategy subclasses ours and if so, to what extend.

@awaelchli checkout Lightning-AI/lightning-Habana#174 . It mostly includes changes wrt initialization and device usage instead of index

awaelchli · 2024-04-16T10:04:31Z

@jerome-habana Thanks for sharing. I opened an alternative approach in #19781 that would unblock you. #19781 is closer to the original intent of the error checking and doesn't require a check against the integration. After the FSDP strategy in habana-lightning is released, we can still add more error checking if we want to.

Regarding the HPU FSDP implementation: I see there are several overrides of internals. Please note that we don't provide backward compatibility for internals of strategies, especially the protected methods and members. We will make changes to the FSDP strategy that may break your overrides. If this is undesired, I recommend not subclassing FSDPStrategy and instead subclassing ParallelStrategy.

jerome-habana · 2024-04-16T12:11:44Z

@jerome-habana Thanks for sharing. I opened an alternative approach in #19781 that would unblock you. #19781 is closer to the original intent of the error checking and doesn't require a check against the integration. After the FSDP strategy in habana-lightning is released, we can still add more error checking if we want to.

Regarding the HPU FSDP implementation: I see there are several overrides of internals. Please note that we don't provide backward compatibility for internals of strategies, especially the protected methods and members. We will make changes to the FSDP strategy that may break your overrides. If this is undesired, I recommend not subclassing FSDPStrategy and instead subclassing ParallelStrategy.

@awaelchli in the long run, we want to move away from inheriting native strategies. But for now, we are dependent on base class fsdp strategy with the gpu checks removed to enable fsdp with hpu. This is to avoid some code duplication and have accelerator specific change alone inside hpufsdpstrategy.

jyothisambolu added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 10, 2024

jyothisambolu mentioned this issue Apr 10, 2024

Allowing FSDP strategy for hpu accelerator #19704

Open

7 tasks

awaelchli mentioned this issue Apr 16, 2024

Remove the requirement for FSDPStrategy subclasses to only support GPU #19781

Merged

carmocca closed this as completed in #19781 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to extend FSDPStrategy to HPU accelerator #19753

Unable to extend FSDPStrategy to HPU accelerator #19753

jyothisambolu commented Apr 10, 2024 •

edited

jerome-habana commented Apr 12, 2024

awaelchli commented Apr 14, 2024

jerome-habana commented Apr 15, 2024

awaelchli commented Apr 15, 2024

jerome-habana commented Apr 15, 2024

awaelchli commented Apr 16, 2024

jerome-habana commented Apr 16, 2024

Unable to extend FSDPStrategy to HPU accelerator #19753

Unable to extend FSDPStrategy to HPU accelerator #19753

Comments

jyothisambolu commented Apr 10, 2024 • edited

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

jerome-habana commented Apr 12, 2024

awaelchli commented Apr 14, 2024

jerome-habana commented Apr 15, 2024

awaelchli commented Apr 15, 2024

jerome-habana commented Apr 15, 2024

awaelchli commented Apr 16, 2024

jerome-habana commented Apr 16, 2024

jyothisambolu commented Apr 10, 2024 •

edited