No clear way to load models #78

stephenroller · 2022-05-10T00:47:20Z

🚀 Feature Request

Loading models is a bit of a pain right now. It's done differently in multiple scripts (including our internal eval scripts). Not all ways are compatible with all checkpoint forms.

This typically requires setting a TON of command line args based on what the model checkpoints need (--model-parallel, --ddp-backend fully_sharded, --distributed-port, etc.). Many of these args can be picked up by just looking at the files.

Afterwards we should refactor a few scripts to use this One True Method

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-05-10T08:48:51Z

Any way to provide the different eval scripts? :-)

Is this related to #73 ?

nickums · 2022-05-12T18:13:12Z

I can not find metaseq-api-local.py anywhere in OPT/

punitkoura · 2022-08-01T20:11:27Z

From #277

We should make model loading "just work". I shouldn't need to pass so many args to get it to find the right checkpoint.
I should be able to specify sharded checkpoints by pointing to the shard0-rank0 pt.

punitkoura · 2022-08-04T17:48:59Z

Types of model checkpoints

We currently have three types of model checkpoints -
1. Singleton checkpoint - For example, the 355M checkpoint. The file format here is like reshard.pt .
2. Unsharded model parallel checkpoint - The file format here is like reshard-model_part-*.pt where * goes from 0 to number_of_model_parts - 1 .
3. Sharded model parallel checkpoint - The file format here is like reshard-model_part-0-shard0.pt , where the model part and shard numbers range over the number of model parallel parts and fully sharded data parallel shards respectively.

Here, the name "reshard" is just a convention. It can be any name. For example - "125m-model_part-0-shard0.pt"

punitkoura · 2022-08-04T17:54:04Z

How do we determine the type of model checkpoint?

cfg.common.model_parallel_size - Which determines the model parallel size. If this is 1, we can infer that the model is not model parallel. However, it might still be sharded through FSDP.

cfg.checkpoint.checkpoint_shard_count - Which determines the number of FSDP shards we have for the model. For model parallel models, each model part has these many shards.

If both these parameters are 1, we have a singleton model.

Both these config values can be determined from the model checkpoint itself.

punitkoura · 2022-08-04T18:36:02Z

cfg.distributed_training.use_sharded_state - if True, then state_dict will return FSDP.local_state_dict and load_state_dict will call FSDP.load_local_state_dict. Otherwise, state_dict will return the full model weights on data parallel rank 0 (empty on other ranks) and load_state_dict will broadcast model weights from rank 0 to other ranks.

From metaseq/distributed/fully_sharded_data_parallel.py

stephenroller added the bug Something isn't working label May 10, 2022

shijie-wu mentioned this issue Jun 3, 2022

End-to-end regression test between metaseq and huggingface #136

Closed

suchenzang mentioned this issue Jul 28, 2022

Add tests around checkpoint loading #268

Closed

punitkoura self-assigned this Aug 1, 2022

punitkoura mentioned this issue Aug 1, 2022

List of Usability Changes #277

Open

16 tasks

ruanslv mentioned this issue Oct 3, 2022

Create a consolidated resharding logic #376

Closed

suchenzang added the better-eng Things that can help make things sane label Oct 12, 2022

suchenzang mentioned this issue Jan 3, 2023

Remove build_model from task, switch to build_model from confs/args #584

Closed

punitkoura removed their assignment Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No clear way to load models #78

No clear way to load models #78

stephenroller commented May 10, 2022 •

edited

patrickvonplaten commented May 10, 2022

nickums commented May 12, 2022

punitkoura commented Aug 1, 2022

punitkoura commented Aug 4, 2022 •

edited

punitkoura commented Aug 4, 2022 •

edited

punitkoura commented Aug 4, 2022

No clear way to load models #78

No clear way to load models #78

Comments

stephenroller commented May 10, 2022 • edited

🚀 Feature Request

patrickvonplaten commented May 10, 2022

nickums commented May 12, 2022

punitkoura commented Aug 1, 2022

punitkoura commented Aug 4, 2022 • edited

Types of model checkpoints

punitkoura commented Aug 4, 2022 • edited

How do we determine the type of model checkpoint?

punitkoura commented Aug 4, 2022

stephenroller commented May 10, 2022 •

edited

punitkoura commented Aug 4, 2022 •

edited

punitkoura commented Aug 4, 2022 •

edited