[scripts] Convert resharded MP checkpoints to unflattened. #60

stephenroller · 2022-05-08T01:44:52Z

Patch Description
Adds a new script which meets the requirements of #31.

Testing steps
Ran on 125m. The usage example in the docstring shows real output.

stephenroller · 2022-05-08T01:53:19Z

Alright I take it back, this script doesn't work at all. Going back to draft.

stephenroller · 2022-05-08T02:49:29Z

Okay I think I got it now, @patrickvonplaten can you try this with the approach used for the 350M?

DGideas · 2022-05-08T11:03:07Z

If you have RuntimeError: Ninja is required to load C++ extensions error message during running this script, please consider using this script to install newer ninja-build

DGideas · 2022-05-08T12:09:33Z

I failed to run this script with 125M model and got this error:

...
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 115, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 111, in main
    dist_utils.call_main(cfg, worker_main)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 256, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 234, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 203, in distributed_main
    main(cfg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 55, in worker_main
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/checkpoint_utils.py", line 507, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 49, in _build_model
    return fsdp_wrap(model)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 146, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fairscale/nn/wrap/auto_wrap.py", line 187, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 49, in __init__
    super().__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'gradient_predivide_factor'

Has anyone run into this problem?

metaseq/scripts/convert_to_singleton.py

stephenroller · 2022-05-08T13:25:19Z

I failed to run this script with 125M model and got this error:

...
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 115, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 111, in main
    dist_utils.call_main(cfg, worker_main)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 256, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 234, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 203, in distributed_main
    main(cfg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 55, in worker_main
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/checkpoint_utils.py", line 507, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 49, in _build_model
    return fsdp_wrap(model)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 146, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fairscale/nn/wrap/auto_wrap.py", line 187, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 49, in __init__
    super().__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'gradient_predivide_factor'

Has anyone run into this problem?

this is a fairscale mismatch.

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

thomasw21 · 2022-05-08T13:31:03Z

Hey @stephenroller , thanks for the script. I have a few questions:

what is dict.txt? I currently replace it with an empty file and it seems to run fine (until next error).
when running the script I get:

RuntimeError: Error(s) in loading state_dict for FlattenParamsWrapper:
   size mismatch for flat_param_0: copying a param with shape torch.Size([63435264]) from checkpoint, the shape in current model is torch.Size([44133888]).

I suppose the config isn't loaded correctly as the size of the model is not the expected one?

EDIT: Okay I think I found that dict needs to store all the tokens in fact to correctly initialize the vocabulary layer:

import json
with open("gpt2-vocab.json", "r") as fi:
   vocab = json.load(fi)
inv_vocab = {v: k for k,v in vocab.items()}
indices = [i for i in range(4,len(vocab))] # the first 4 tokens are special tokens: <s>, <pad>. </s>. <unk>
with open("dict.txt", "w") as fo:
   for i in indices:
       fo.write(f"{inv_vocab[i]} 1\n") # "1" is a count, which I'm guessing is the one used when building the tokenizer, I think we can default to 1 here.

suchenzang · 2022-05-08T14:56:29Z

@thomasw21 Opened issue to remove this dict logic for now: #64

It currently is just a file that looks like:

(base) susanz@ip-<redacted>:/<redacted>$ cat dict.txt
4 1
5 1
6 1
7 1
...
50268 1
50269 1
50270 1
50271 1

thomasw21 · 2022-05-08T15:33:10Z

Thanks @suchenzang !
This script does fail with 1B3 with:

   File "fairscale/fairscale/nn/misc/flatten_params_wrapper.py", line 276, in _init_flatten_params
    assert len(set(p.dtype for p in params)) == 1, "expects all parameters to have same dtype"
AssertionError: expects all parameters to have same dtype

Looking at the init, layer norm weight/biases are stored in fp32 while the rest is in fp16. This looks like fairscale branch: prefetch_fsdp_params_simple has that assert https://github.com/facebookresearch/fairscale/blob/8820049331331c773077c257667aa81baf4cc9f9/fairscale/nn/misc/flatten_params_wrapper.py#L276 . While looking around I saw that ngoyal2707/Megatron-LM@06bd10e changed the way parameters are loaded, tried reverting to ae0b844c1f6725c3433a95e42cac760b3885170b and it seems to have fixed that issue (looking at the actual code, I'm unclear why though)

M-B-Lee · 2022-05-08T16:53:07Z

Hey @stephenroller , thanks for the script. I have a few questions:

what is dict.txt? I currently replace it with an empty file and it seems to run fine (until next error).

when running the script I get:
RuntimeError: Error(s) in loading state_dict for FlattenParamsWrapper:
   size mismatch for flat_param_0: copying a param with shape torch.Size([63435264]) from checkpoint, the shape in current model is torch.Size([44133888]).
I suppose the config isn't loaded correctly as the size of the model is not the expected one?

EDIT: Okay I think I found that dict needs to store all the tokens in fact to correctly initialize the vocabulary layer:
import json
with open("gpt2-vocab.json", "r") as fi:
   vocab = json.load(fi)
inv_vocab = {v: k for k,v in vocab.items()}
indices = [i for i in range(4,len(vocab))] # the first 4 tokens are special tokens: <s>, <pad>. </s>. <unk>
with open("dict.txt", "w") as fo:
   for i in indices:
       fo.write(f"{inv_vocab[i]} 1\n") # "1" is a count, which I'm guessing is the one used when building the tokenizer, I think we can default to 1 here.

I also meet the same size mismatch error when loading the model state dict.

patrickvonplaten · 2022-05-09T09:24:42Z

Hey @stephenroller,

Thanks a lot, the script works very well to merge the sharded checkpoints into a single checkpoint.
However, when testing the checkpoint on a generation task, the model only gives gibberish - see: #73

This script works well with the 350m checkpoint. Any ideas what could be the problem here?

stephenroller · 2022-05-09T15:21:11Z

Double checking this work since people are reporting gibberish.

) * Recursively unwrap fully sharded model * Update metaseq/scripts/convert_to_singleton.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

ftamburin · 2022-05-10T15:23:49Z

Hi, thanks for this wonderful project!
I had a lot of problems in using the script "convert_to_singletons.py" for joining shards because a lot of model parameters (especially 'cfg') are not copied in the new model. Maybe I am doing something wrong, but the only way to join the shards and building a unique .pt file structurally similar to the 350m model and working with the patrickvonplaten's script 'run_model.py' for using the model was to add the following lines to the last version of the script, just before saving the new model:

        _model_args['model'] = vars(_model_args['model'])
        _model_args['model']['_name'] = 'transformer_lm'
        _model_args['criterion'] = vars(_model_args['criterion'])
        glued = {'cfg': _model_args, 'model': glued}

This worked for me both for the 1.3B and 6.7B models

patrickvonplaten · 2022-05-10T15:33:19Z

Hi, thanks for this wonderful project! I had a lot of problems in using the script "convert_to_singletons.py" for joining shards because a lot of model parameters (especially 'cfg') are not copied in the new model. Maybe I am doing something wrong, but the only way to join the shards and building a unique .pt file structurally similar to the 350m model and working with the patrickvonplaten's script 'run_model.py' for using the model was to add the following lines to the last version of the script, just before saving the new model:
        _model_args['model'] = vars(_model_args['model'])
        _model_args['model']['_name'] = 'transformer_lm'
        _model_args['criterion'] = vars(_model_args['criterion'])
        glued = {'cfg': _model_args, 'model': glued}
This worked for me both for the 1.3B and 6.7B models

Thanks for the tip @ftamburin ! Are you able to generate sensible outputs when adding this with run_model.py ? E.g. is the next word prediction a sensible word?

ftamburin · 2022-05-10T15:40:07Z

Hi, thanks for this wonderful project! I had a lot of problems in using the script "convert_to_singletons.py" for joining shards because a lot of model parameters (especially 'cfg') are not copied in the new model. Maybe I am doing something wrong, but the only way to join the shards and building a unique .pt file structurally similar to the 350m model and working with the patrickvonplaten's script 'run_model.py' for using the model was to add the following lines to the last version of the script, just before saving the new model:
        _model_args['model'] = vars(_model_args['model'])
        _model_args['model']['_name'] = 'transformer_lm'
        _model_args['criterion'] = vars(_model_args['criterion'])
        glued = {'cfg': _model_args, 'model': glued}
This worked for me both for the 1.3B and 6.7B models
Thanks for the tip @ftamburin ! Are you able to generate sensible outputs when adding this with run_model.py ? E.g. is the next word prediction a sensible word?

Yes, sorry, I forgot to write it explicitly. Nice outputs as for the 350m model.

You also have to hack the procedure "_upgrade_state_dict" in metaseq/checkpoint_utils.py

def _upgrade_state_dict(state):
    return state

returning immediately the state without any upgrade. Not so elegant, but it works.

ftamburin · 2022-05-10T16:01:16Z

Is there a simple way for generating longer sequences after the prompt for this kind of models?
Could you suggest a script and/or a function for doing such task?
Thanks!

patrickvonplaten · 2022-05-10T18:20:26Z

Thanks @ftamburin and @stephenroller for all the help. The checkpoints up to 6.7B now work for us.

I've converted them and uploaded them to the Hub so that they are easy to use by everybody here: #88

Will try to tackle the larger ones tomorrow.

@ftamburin, it should be very easy with Hugging Face's Transformers. We're hoping to have the checkpoints ready by Thursday

patrickvonplaten · 2022-05-10T18:20:36Z

Sadly don't know how to do generate with metaseq

stephenroller · 2022-05-10T18:57:08Z

Do you need any help with the larger models?

In the meantime it seems like finalizing this PR and producing some generations for all scales for correctness would be my best value-add right now.

patrickvonplaten · 2022-05-10T19:25:24Z

@stephenroller, All models until 2.7GB seem to work well!

Tomorrow, I'll check the big ones until 80GB. Think it should all work though :-)

shijie-wu · 2022-06-02T08:25:14Z

Hi @patrickvonplaten and @stephenroller, I'm trying to do a regression test on the 1.3b model between HF and metaseq, as HF unittests only covers the 350m model.

https://github.com/huggingface/transformers/blob/58fb3c9f98877bf76efb03e376a5c92cf80f7952/tests/models/opt/test_modeling_opt.py#L269-L290

However, I don't have access to multiple GPUs to do the conversion with python -m metaseq.scripts.convert_to_singleton. Could you help confirm that the the converted model would pass the regression test? Thanks!

patrickvonplaten · 2022-06-02T16:51:36Z

Hey @shijie-wu,

Sure, actually you can find the regression test that I've done to make sure the models work correctly here: https://huggingface.co/patrickvonplaten/opt_metaseq_350m/blob/main/run_model.py

patrickvonplaten · 2022-06-02T16:52:45Z

https://huggingface.co/patrickvonplaten/opt_metaseq_350m shows how one can set it up. As you can see I'm checking that metaseq and transformers logits match with a precision of 1e-3. I've tested this on both CPU and GPU (not on multi-GPU though)

patrickvonplaten · 2022-06-02T16:53:20Z

There are also the same directories for the other sizes here: https://huggingface.co/models?other=opt_metasq

shijie-wu · 2022-06-02T19:59:37Z

Thanks for clarifying! I missed the models under your repo 😅. I have checked that facebook/opt-1.3b indeed pass the regression test against the merged model in patrickvonplaten/opt_metaseq_1300m 🎊

To confirm, restored.pt in https://huggingface.co/patrickvonplaten/opt_metaseq_1300m/tree/main/model is the output of the merging step python -m metaseq.scripts.convert_to_singleton? However, as far as I can tell, this set of regression tests do not cover the merging step, although I might miss something. It would be great if there's a test that also cover the merging step.

patrickvonplaten · 2022-06-03T09:36:20Z

You're exactly right the regression test doesn't cover the merging step, but maybe you could try it yourself with: #88 (comment) ? :-)

shijie-wu · 2022-06-03T18:15:39Z

I have tried this script but unfortunately it requires multiple GPUs. I have opened a new issue (#136) for full regression test between metaseq and huggingface, as this is outside of the scope of this PR.

patrickvonplaten · 2022-06-21T10:45:10Z

See: #164

[scripts] Convert resharded MP checkpoints to unflattened.

b35ab1a

stephenroller requested review from suchenzang, anj-s, ngoyal2707, punitkoura, moyapchen and m3rlin45 as code owners May 8, 2022 01:44

facebook-github-bot added the cla signed label May 8, 2022

stephenroller mentioned this pull request May 8, 2022

How to load sharded checkpoints? #31

Closed

Black

d606ba0

stephenroller marked this pull request as draft May 8, 2022 02:00

Check it out ma

01e4d36

stephenroller marked this pull request as ready for review May 8, 2022 02:48

Black

a922de0

This was referenced May 8, 2022

What's the difference between the two checkpoint name convention? #62

Closed

How to use 1.3B OPT weights in metaseq-api-local API? #61

Closed

thomasw21 reviewed May 8, 2022

View reviewed changes

metaseq/scripts/convert_to_singleton.py Outdated Show resolved Hide resolved

Update metaseq/scripts/convert_to_singleton.py

fed4c04

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

suchenzang approved these changes May 8, 2022

View reviewed changes

patrickvonplaten mentioned this pull request May 9, 2022

125m checkpoint outputting gibberish #73

Closed

thomasw21 mentioned this pull request May 9, 2022

Recursively unwrap fully sharded model in convert_to_singleton.py #72

Merged

thomasw21 and others added 5 commits May 9, 2022 13:57

temp

008b86b

Merge branch 'main' into convert_to_singleton

76ac93a

Whoops spawn too many

f81d45a

Add logging.

f3fdd3a

stephenroller mentioned this pull request May 10, 2022

[bug] Broken local spawning on all scripts #79

Merged

patrickvonplaten mentioned this pull request May 11, 2022

[Community] OPT Inference in HF Transformers #88

Closed

Black

b7ab565

stephenroller merged commit 1246d72 into main May 11, 2022

stephenroller deleted the convert_to_singleton branch May 11, 2022 16:32

toanngosy mentioned this pull request May 26, 2022

shard_metadata not found #125

Closed

lorr1 mentioned this pull request Jun 8, 2022

layer_norm is fp32, can't be wrapped inside half precision layers. #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scripts] Convert resharded MP checkpoints to unflattened. #60

[scripts] Convert resharded MP checkpoints to unflattened. #60

stephenroller commented May 8, 2022

stephenroller commented May 8, 2022 •

edited

stephenroller commented May 8, 2022

DGideas commented May 8, 2022

DGideas commented May 8, 2022

stephenroller commented May 8, 2022

thomasw21 commented May 8, 2022 •

edited

suchenzang commented May 8, 2022

thomasw21 commented May 8, 2022

M-B-Lee commented May 8, 2022

patrickvonplaten commented May 9, 2022

stephenroller commented May 9, 2022

ftamburin commented May 10, 2022

patrickvonplaten commented May 10, 2022

ftamburin commented May 10, 2022 •

edited

ftamburin commented May 10, 2022

patrickvonplaten commented May 10, 2022

patrickvonplaten commented May 10, 2022

stephenroller commented May 10, 2022

patrickvonplaten commented May 10, 2022

shijie-wu commented Jun 2, 2022

patrickvonplaten commented Jun 2, 2022

patrickvonplaten commented Jun 2, 2022

patrickvonplaten commented Jun 2, 2022

shijie-wu commented Jun 2, 2022 •

edited

patrickvonplaten commented Jun 3, 2022

shijie-wu commented Jun 3, 2022

patrickvonplaten commented Jun 21, 2022

[scripts] Convert resharded MP checkpoints to unflattened. #60

[scripts] Convert resharded MP checkpoints to unflattened. #60

Conversation

stephenroller commented May 8, 2022

stephenroller commented May 8, 2022 • edited

stephenroller commented May 8, 2022

DGideas commented May 8, 2022

DGideas commented May 8, 2022

stephenroller commented May 8, 2022

thomasw21 commented May 8, 2022 • edited

suchenzang commented May 8, 2022

thomasw21 commented May 8, 2022

M-B-Lee commented May 8, 2022

patrickvonplaten commented May 9, 2022

stephenroller commented May 9, 2022

ftamburin commented May 10, 2022

patrickvonplaten commented May 10, 2022

ftamburin commented May 10, 2022 • edited

ftamburin commented May 10, 2022

patrickvonplaten commented May 10, 2022

patrickvonplaten commented May 10, 2022

stephenroller commented May 10, 2022

patrickvonplaten commented May 10, 2022

shijie-wu commented Jun 2, 2022

patrickvonplaten commented Jun 2, 2022

patrickvonplaten commented Jun 2, 2022

patrickvonplaten commented Jun 2, 2022

shijie-wu commented Jun 2, 2022 • edited

patrickvonplaten commented Jun 3, 2022

shijie-wu commented Jun 3, 2022

patrickvonplaten commented Jun 21, 2022

stephenroller commented May 8, 2022 •

edited

thomasw21 commented May 8, 2022 •

edited

ftamburin commented May 10, 2022 •

edited

shijie-wu commented Jun 2, 2022 •

edited