FSDP checkpoint saving raises internal deprecation warnings #119802

carmocca · 2024-02-13T17:47:10Z

🐛 Describe the bug

The messages are:

/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py:1132: UserWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  warnings.warn(DEPRECATE_MSG)

/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/filesystem.py:148: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if tensor.storage().size() != tensor.numel():

Minimal repro:

import os
import torch.cuda
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.distributed.fsdp import FullyShardedDataParallel

def get_sharded_state_dict_context(module):
    from torch.distributed.fsdp.api import ShardedOptimStateDictConfig, ShardedStateDictConfig, StateDictType

    state_dict_config = ShardedStateDictConfig(offload_to_cpu=True)
    optim_state_dict_config = ShardedOptimStateDictConfig(offload_to_cpu=True)
    state_dict_type_context = FullyShardedDataParallel.state_dict_type(
        module=module,
        state_dict_type=StateDictType.SHARDED_STATE_DICT,
        state_dict_config=state_dict_config,
        optim_state_dict_config=optim_state_dict_config,
    )
    return state_dict_type_context  # type: ignore[return-value]

def work(rank):
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = "1234"
    dist.init_process_group("nccl", world_size=2, rank=rank)
    torch.cuda.set_device(rank)
    device = torch.device("cuda", rank)

    model = nn.Linear(100, 50).to(device)
    model = FullyShardedDataParallel(model)
    x = torch.rand(2, 100, device=device)

    y = model(x)

    from torch.distributed.checkpoint import save
    with get_sharded_state_dict_context(model):
        state = {"model": model.state_dict()}
    save(state, checkpoint_id="fsdp_model.pt")

def run():
    mp.spawn(work, nprocs=2)

if __name__ == "__main__":
    run()

First reported in Lightning-AI/pytorch-lightning#19462 (comment)

Versions

2.3.0.dev20240212+cu121

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @LucasLLC

fegin · 2024-02-15T08:24:21Z

@LucasLLC We should fix the filesystem.py warning. @carmocca We are switching to DTensor and would like to move to DTensor. init_device_mesh is beta released in 2.2. cc., @wz337

wz337 · 2024-02-15T19:48:05Z

@LucasLLC We should fix the filesystem.py warning. @carmocca We are switching to DTensor and would like to move to DTensor. init_device_mesh is beta released in 2.2. cc., @wz337

@carmocca If you are interested in finding out more about DTensor, here is a get started page. https://pytorch.org/tutorials/recipes/distributed_device_mesh.html

carmocca mentioned this issue Feb 13, 2024

FSDP checkpointing uses deprecated APIs with PyTorch 2.2 Lightning-AI/pytorch-lightning#19462

Open

colesbury added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 14, 2024

fegin added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: distributed_checkpoint labels Feb 15, 2024

fegin assigned LucasLLC Feb 15, 2024

fegin assigned wz337 and fegin Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP checkpoint saving raises internal deprecation warnings #119802

FSDP checkpoint saving raises internal deprecation warnings #119802

carmocca commented Feb 13, 2024 •

edited by pytorch-bot bot

fegin commented Feb 15, 2024

wz337 commented Feb 15, 2024

FSDP checkpoint saving raises internal deprecation warnings #119802

FSDP checkpoint saving raises internal deprecation warnings #119802

Comments

carmocca commented Feb 13, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

fegin commented Feb 15, 2024

wz337 commented Feb 15, 2024

carmocca commented Feb 13, 2024 •

edited by pytorch-bot bot