New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow flexible and easy to configure HSDP #19502
Comments
I agree we need a better way to specify this. PyTorch 2.2 introduced the device mesh, so we should probably use that to specify the size, rather than having the user construct the process group matrix themselves. Having an argument like you suggest could work, but it might be confusing to have this for anything other than hybrid sharding. |
Since passing in a device mesh already works from torch.distributed.device_mesh import init_device_mesh
mesh = init_device_mesh("cuda", (2, 4))
strategy = FSDPStrategy(..., device_mesh=mesh) I suggest that we simplify this by allowing the user to set a tuple |
This seems reasonable as well, and simpler. In pytorch code But pytorch FSDP doc does not document the |
Great to hear you like it. Would you be interested to draft it? It would be relatively straightforward:
Yes agreed. This is typical for PyTorch, their distributed features are always very short on docs. I think we would want to document this well on our side (both the API and the user guide). We have a relatively thorough guide already: https://lightning.ai/docs/fabric/stable/advanced/model_parallel/fsdp.html |
There's a device_mesh recipe available at https://pytorch.org/tutorials/recipes/distributed_device_mesh.html |
Sure. I will iterate in the draft PR above, when I have some bandwidth. |
I updated the PR #19504 as suggested. |
#QuantumDominator |
Description & Motivation
The
FSDPStrategy
can use hybrid sharding strategy to shard across smaller sets of ranks in the global dist group. However, it is not flexible enough to let user easily specify the sharding scale.Pitch
The
FSDPStrategy
can use hybrid sharding strategy to shard across smaller sets of ranks in the global dist group. Currently there are two path to use it in Lightning:sharding_strategy
as one of the hybrid sharding strategies. This will shard within one node, and replicate across nodes.sharding_strategy
as one of the hybrid sharding strategies, and provideprocess_group
askwards
toFSDPStrategy
. This let user specify how large the sharding scale is. However, it is not easy for user to insert torch dist groups creation code and prepare theprocess_group
ahead of time, because Lightning handles torch dist init_process_group automatically in trainer, or the fabric launcher.So I'm looking forward to a easier way to use HSDP within Lightning, like:
FSDPStrategy(..., sharding_strategy="HYBRID_SHARD", fsdp_size=16)
to easily shard at specified scale, and let Lightning handle
process_group
preparation for PyTorch FSPD wrapper.Alternatives
No response
Additional context
No response
cc @Borda @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: