FSDP checkpointing uses deprecated APIs with PyTorch 2.2 #19462

carmocca · 2024-02-13T13:34:10Z

Bug description

See added deprecation warnings in pytorch/pytorch#113867

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Originated from

pytorch-lightning/src/lightning/fabric/strategies/fsdp.py

Line 496 in b097a4d

save_state_dict(converted_state, writer)

We already use the newer API for loading

pytorch-lightning/src/lightning/fabric/strategies/fsdp.py

Lines 563 to 566 in b097a4d

    
           if _TORCH_GREATER_EQUAL_2_2: 
        
               from torch.distributed.checkpoint import load 
        
           else: 
        
               from torch.distributed.checkpoint import load_state_dict as load  # deprecated

Error messages and logs

/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py:31: UserWarning: 'save_state_dict' is deprecated and will be removed in future versions.Please use 'save' instead.
  warnings.warn(

Environment

No response

More info

No response

cc @awaelchli @carmocca

The text was updated successfully, but these errors were encountered:

carmocca · 2024-02-13T15:38:18Z

Two more which probably need to be fixed in PyTorch

/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py:1132: UserWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  warnings.warn(DEPRECATE_MSG)

From (print_stack added by me):

  File "/home/carlos/stuff.py", line 29, in <module>
    fabric.save(f"{compile}_before_fwd", {"model": fmodel})
  File "/home/carlos/lightning/src/lightning/fabric/fabric.py", line 770, in save
    self._strategy.save_checkpoint(path=path, state=_unwrap_objects(state), filter=filter)
  File "/home/carlos/lightning/src/lightning/fabric/strategies/fsdp.py", line 484, in save_checkpoint
    converted = obj.state_dict()
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1922, in state_dict
    hook_result = hook(self, destination, prefix, local_metadata)
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 737, in _post_state_dict_hook
    local_shape = tensor.shape
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in __torch_function__
    traceback.print_stack()

/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/filesystem.py:151: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if tensor.storage().size() != tensor.numel():

From (print_stack added by me):

  File "/home/carlos/stuff.py", line 29, in <module>
    fabric.save(f"{compile}_before_fwd", {"model": fmodel})
  File "/home/carlos/lightning/src/lightning/fabric/fabric.py", line 770, in save
    self._strategy.save_checkpoint(path=path, state=_unwrap_objects(state), filter=filter)
  File "/home/carlos/lightning/src/lightning/fabric/strategies/fsdp.py", line 496, in save_checkpoint
    save_state_dict(converted_state, writer)
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 40, in save_state_dict
    return _save_state_dict(
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 280, in _save_state_dict
    return distW.all_reduce("write", write_data, finish_checkpoint)
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/utils.py", line 210, in all_reduce
    local_data = map_fun()
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 270, in write_data
    all_writes = storage_writer.write_data(final_local_plan, planner)
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/filesystem.py", line 470, in write_data
    _write_files_from_queue(
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/filesystem.py", line 284, in _write_files_from_queue
    loader.start_loading()
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/filesystem.py", line 179, in start_loading
    self._refill()
  File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/filesystem.py", line 150, in _refill
    traceback.print_stack()

carmocca · 2024-02-13T16:14:31Z

If the newer save is used, the argument order seems to have changed in pytorch/pytorch#117772

/home/carlos/nightly-env/lib/python3.10/site-packages/torch/distributed/checkpoint/utils.py:409: UserWarning: The argument order of save has been changed. Please check the document to avoid future breakages.
  warnings.warn(

This probably applies to load too. I haven't tried it

awaelchli · 2024-02-13T16:19:43Z

I agree we need to update these imports.
The change in argument order is only in nightly, but since lit-gpt relies on that, we should start incorporating this asap.

carmocca · 2024-02-13T16:28:31Z

Technically lit-gpt doesn't rely on nightly since the 2.2 release.

I opened #19463

carmocca · 2024-02-13T17:48:05Z

Also opened pytorch/pytorch#119802 upstream. We might want to silence these after this is resolved

carmocca · 2024-02-13T19:02:52Z

pytorch/pytorch#119800 (comment) suggests that we should replace (in 2.2+) most of what we have with {get,set}_{model,optimizer}_state_dict functions in https://github.com/pytorch/pytorch/blob/v2.2.0/torch/distributed/checkpoint/state_dict.py

carmocca added bug Something isn't working strategy: fsdp Fully Sharded Data Parallel labels Feb 13, 2024

carmocca added this to the 2.2.x milestone Feb 13, 2024

carmocca mentioned this issue Feb 13, 2024

Avoid FSDP deprecations during save/load with newer torch versions #19463

Merged

carmocca mentioned this issue Feb 13, 2024

FSDP checkpoint saving raises internal deprecation warnings pytorch/pytorch#119802

Open

carmocca added the checkpointing Related to checkpointing label Feb 17, 2024

carmocca linked a pull request Feb 19, 2024 that will close this issue

[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP #19497

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP checkpointing uses deprecated APIs with PyTorch 2.2 #19462

FSDP checkpointing uses deprecated APIs with PyTorch 2.2 #19462

carmocca commented Feb 13, 2024 •

edited

carmocca commented Feb 13, 2024 •

edited

carmocca commented Feb 13, 2024

awaelchli commented Feb 13, 2024

carmocca commented Feb 13, 2024

carmocca commented Feb 13, 2024

carmocca commented Feb 13, 2024

FSDP checkpointing uses deprecated APIs with PyTorch 2.2 #19462

FSDP checkpointing uses deprecated APIs with PyTorch 2.2 #19462

Comments

carmocca commented Feb 13, 2024 • edited

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

carmocca commented Feb 13, 2024 • edited

carmocca commented Feb 13, 2024

awaelchli commented Feb 13, 2024

carmocca commented Feb 13, 2024

carmocca commented Feb 13, 2024

carmocca commented Feb 13, 2024

carmocca commented Feb 13, 2024 •

edited

carmocca commented Feb 13, 2024 •

edited