New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to extend FSDPStrategy to HPU accelerator #19753
Unable to extend FSDPStrategy to HPU accelerator #19753
Comments
@Borda FYA |
@jyothisambolu Isn't PyTorch's FSDP only compatible with CUDA? Do you have a custom FSDPStrategy? I can't find one in lightning-habana. |
@awaelchli - to push HPUFSDPStrategy, we need this change to be in place else we get the error mentioned in the issue above. Once this is pushed in, we'll push the FSDP PR to lightning-habana. I can share a draft if it helps. |
@jerome-habana Yes please, that would be very insightful. I'd like to know whether your FSDP strategy subclasses ours and if so, to what extend. |
@awaelchli checkout Lightning-AI/lightning-Habana#174 . It mostly includes changes wrt initialization and device usage instead of index |
@jerome-habana Thanks for sharing. I opened an alternative approach in #19781 that would unblock you. #19781 is closer to the original intent of the error checking and doesn't require a check against the integration. After the FSDP strategy in habana-lightning is released, we can still add more error checking if we want to. Regarding the HPU FSDP implementation: I see there are several overrides of internals. Please note that we don't provide backward compatibility for internals of strategies, especially the protected methods and members. We will make changes to the FSDP strategy that may break your overrides. If this is undesired, I recommend not subclassing FSDPStrategy and instead subclassing ParallelStrategy. |
@awaelchli in the long run, we want to move away from inheriting native strategies. But for now, we are dependent on base class fsdp strategy with the gpu checks removed to enable fsdp with hpu. This is to avoid some code duplication and have accelerator specific change alone inside hpufsdpstrategy. |
Bug description
We are trying to support FSDP strategy on HPU accelerators. But due to a restriction from accelerator connector, we are unable to use FSDPStrategy on HPU.
pytorch-lightning/src/lightning/pytorch/trainer/connectors/accelerator_connector.py
Line 459 in 76b691d
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: