Allow flexible and easy to configure HSDP #19502

Liyang90 · 2024-02-20T22:22:51Z

Description & Motivation

The FSDPStrategy can use hybrid sharding strategy to shard across smaller sets of ranks in the global dist group. However, it is not flexible enough to let user easily specify the sharding scale.

Pitch

The FSDPStrategy can use hybrid sharding strategy to shard across smaller sets of ranks in the global dist group. Currently there are two path to use it in Lightning:

Specify sharding_strategy as one of the hybrid sharding strategies. This will shard within one node, and replicate across nodes.
Specify sharding_strategy as one of the hybrid sharding strategies, and provide process_group as kwards to FSDPStrategy. This let user specify how large the sharding scale is. However, it is not easy for user to insert torch dist groups creation code and prepare the process_group ahead of time, because Lightning handles torch dist init_process_group automatically in trainer, or the fabric launcher.

So I'm looking forward to a easier way to use HSDP within Lightning, like:
FSDPStrategy(..., sharding_strategy="HYBRID_SHARD", fsdp_size=16)
to easily shard at specified scale, and let Lightning handle process_group preparation for PyTorch FSPD wrapper.

Alternatives

No response

Additional context

No response

cc @Borda @awaelchli @carmocca

The text was updated successfully, but these errors were encountered:

awaelchli · 2024-02-21T11:18:10Z

I agree we need a better way to specify this. PyTorch 2.2 introduced the device mesh, so we should probably use that to specify the size, rather than having the user construct the process group matrix themselves.

Having an argument like you suggest could work, but it might be confusing to have this for anything other than hybrid sharding.

awaelchli · 2024-02-21T11:30:04Z

Since passing in a device mesh already works

from torch.distributed.device_mesh import init_device_mesh
mesh = init_device_mesh("cuda", (2, 4))

strategy = FSDPStrategy(..., device_mesh=mesh)

I suggest that we simplify this by allowing the user to set a tuple device_mesh=(2, 4) (in addition to DeviceMesh) and internally we initialize the device mesh for them if it's a tuple. Then we don't need to introduce a new argument.

Liyang90 · 2024-02-22T00:18:54Z

Since passing in a device mesh already works
from torch.distributed.device_mesh import init_device_mesh
mesh = init_device_mesh("cuda", (2, 4))

strategy = FSDPStrategy(..., device_mesh=mesh)
I suggest that we simplify this by allowing the user to set a tuple device_mesh=(2, 4) (in addition to DeviceMesh) and internally we initialize the device mesh for them if it's a tuple. Then we don't need to introduce a new argument.

This seems reasonable as well, and simpler. In pytorch codeprocess_group and device_mesh end up being handled in the same function: https://github.com/pytorch/pytorch/blob/1d14adfa66e2ae437253eebe223710588648eee7/torch/distributed/fsdp/_init_utils.py#L152C5-L152C47

But pytorch FSDP doc does not document the device_mesh argument very well, and users would need to know what the numbers in device_mesh tuple means (which is the fsdp size and which is ddp size).

awaelchli · 2024-02-23T02:02:10Z

Great to hear you like it. Would you be interested to draft it? It would be relatively straightforward:

Store the device_mesh argument as attribute in the strategy
in Strategy setup initialize it and pass it to the FSDP wrapper:

pytorch-lightning/src/lightning/fabric/strategies/fsdp.py

Line 297 in a6273d1

module = FullyShardedDataParallel(

But pytorch FSDP doc does not document the device_mesh argument very well, and users would need to know what the numbers in device_mesh tuple means (which is the fsdp size and which is ddp size).

Yes agreed. This is typical for PyTorch, their distributed features are always very short on docs. I think we would want to document this well on our side (both the API and the user guide). We have a relatively thorough guide already: https://lightning.ai/docs/fabric/stable/advanced/model_parallel/fsdp.html

carmocca · 2024-02-23T14:42:35Z

There's a device_mesh recipe available at https://pytorch.org/tutorials/recipes/distributed_device_mesh.html

Liyang90 · 2024-02-23T22:22:34Z

Great to hear you like it. Would you be interested to draft it? It would be relatively straightforward:

Store the device_mesh argument as attribute in the strategy

in Strategy setup initialize it and pass it to the FSDP wrapper:

pytorch-lightning/src/lightning/fabric/strategies/fsdp.py

Line 297 in a6273d1

module = FullyShardedDataParallel(

But pytorch FSDP doc does not document the device_mesh argument very well, and users would need to know what the numbers in device_mesh tuple means (which is the fsdp size and which is ddp size).

Yes agreed. This is typical for PyTorch, their distributed features are always very short on docs. I think we would want to document this well on our side (both the API and the user guide). We have a relatively thorough guide already: https://lightning.ai/docs/fabric/stable/advanced/model_parallel/fsdp.html

Sure. I will iterate in the draft PR above, when I have some bandwidth.

Liyang90 · 2024-03-05T00:23:27Z

I updated the PR #19504 as suggested.

Quantum1921 · 2024-03-06T14:55:06Z

#QuantumDominator

Liyang90 added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Feb 20, 2024

Liyang90 linked a pull request Feb 20, 2024 that will close this issue

Flexible and easy to use HSDP setting #19504

Open

7 tasks

awaelchli added discussion In a discussion stage strategy: fsdp Fully Sharded Data Parallel and removed needs triage Waiting to be triaged by maintainers labels Feb 21, 2024

awaelchli added this to the 2.3 milestone Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow flexible and easy to configure HSDP #19502

Allow flexible and easy to configure HSDP #19502

Liyang90 commented Feb 20, 2024 •

edited by github-actions bot

awaelchli commented Feb 21, 2024 •

edited

awaelchli commented Feb 21, 2024 •

edited

Liyang90 commented Feb 22, 2024 •

edited

awaelchli commented Feb 23, 2024

carmocca commented Feb 23, 2024

Liyang90 commented Feb 23, 2024

Liyang90 commented Mar 5, 2024

Quantum1921 commented Mar 6, 2024

Allow flexible and easy to configure HSDP #19502

Allow flexible and easy to configure HSDP #19502

Comments

Liyang90 commented Feb 20, 2024 • edited by github-actions bot

Description & Motivation

Pitch

Alternatives

Additional context

awaelchli commented Feb 21, 2024 • edited

awaelchli commented Feb 21, 2024 • edited

Liyang90 commented Feb 22, 2024 • edited

awaelchli commented Feb 23, 2024

carmocca commented Feb 23, 2024

Liyang90 commented Feb 23, 2024

Liyang90 commented Mar 5, 2024

Quantum1921 commented Mar 6, 2024

Liyang90 commented Feb 20, 2024 •

edited by github-actions bot

awaelchli commented Feb 21, 2024 •

edited

awaelchli commented Feb 21, 2024 •

edited

Liyang90 commented Feb 22, 2024 •

edited