Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scripts] Convert resharded MP checkpoints to unflattened. #60

Merged
merged 11 commits into from
May 11, 2022

Conversation

stephenroller
Copy link
Contributor

Patch Description
Adds a new script which meets the requirements of #31.

Testing steps
Ran on 125m. The usage example in the docstring shows real output.

@stephenroller
Copy link
Contributor Author

stephenroller commented May 8, 2022

Alright I take it back, this script doesn't work at all. Going back to draft.

@stephenroller stephenroller marked this pull request as draft May 8, 2022 02:00
@stephenroller stephenroller marked this pull request as ready for review May 8, 2022 02:48
@stephenroller
Copy link
Contributor Author

Okay I think I got it now, @patrickvonplaten can you try this with the approach used for the 350M?

@DGideas
Copy link

DGideas commented May 8, 2022

If you have RuntimeError: Ninja is required to load C++ extensions error message during running this script, please consider using this script to install newer ninja-build

@DGideas
Copy link

DGideas commented May 8, 2022

I failed to run this script with 125M model and got this error:

...
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 115, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 111, in main
    dist_utils.call_main(cfg, worker_main)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 256, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 234, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 203, in distributed_main
    main(cfg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 55, in worker_main
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/checkpoint_utils.py", line 507, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 49, in _build_model
    return fsdp_wrap(model)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 146, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fairscale/nn/wrap/auto_wrap.py", line 187, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 49, in __init__
    super().__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'gradient_predivide_factor'

Has anyone run into this problem?

@stephenroller
Copy link
Contributor Author

I failed to run this script with 125M model and got this error:

...
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dgideas/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 115, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 111, in main
    dist_utils.call_main(cfg, worker_main)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 256, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 234, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/utils.py", line 203, in distributed_main
    main(cfg, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 55, in worker_main
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/checkpoint_utils.py", line 507, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/scripts/convert_to_singleton.py", line 49, in _build_model
    return fsdp_wrap(model)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 146, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fairscale/nn/wrap/auto_wrap.py", line 187, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/usr/local/lib/python3.8/dist-packages/metaseq-0.0.1-py3.8-linux-x86_64.egg/metaseq/distributed/fully_sharded_data_parallel.py", line 49, in __init__
    super().__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'gradient_predivide_factor'

Has anyone run into this problem?

this is a fairscale mismatch.

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
@thomasw21
Copy link
Contributor

thomasw21 commented May 8, 2022

Hey @stephenroller , thanks for the script. I have a few questions:

  • what is dict.txt? I currently replace it with an empty file and it seems to run fine (until next error).
  • when running the script I get:
RuntimeError: Error(s) in loading state_dict for FlattenParamsWrapper:
   size mismatch for flat_param_0: copying a param with shape torch.Size([63435264]) from checkpoint, the shape in current model is torch.Size([44133888]).

I suppose the config isn't loaded correctly as the size of the model is not the expected one?

EDIT: Okay I think I found that dict needs to store all the tokens in fact to correctly initialize the vocabulary layer:

import json
with open("gpt2-vocab.json", "r") as fi:
   vocab = json.load(fi)
inv_vocab = {v: k for k,v in vocab.items()}
indices = [i for i in range(4,len(vocab))] # the first 4 tokens are special tokens: <s>, <pad>. </s>. <unk>
with open("dict.txt", "w") as fo:
   for i in indices:
       fo.write(f"{inv_vocab[i]} 1\n") # "1" is a count, which I'm guessing is the one used when building the tokenizer, I think we can default to 1 here.

@suchenzang
Copy link
Contributor

@thomasw21 Opened issue to remove this dict logic for now: #64

It currently is just a file that looks like:

(base) susanz@ip-<redacted>:/<redacted>$ cat dict.txt
4 1
5 1
6 1
7 1
...
50268 1
50269 1
50270 1
50271 1

@thomasw21
Copy link
Contributor

Thanks @suchenzang !
This script does fail with 1B3 with:

   File "fairscale/fairscale/nn/misc/flatten_params_wrapper.py", line 276, in _init_flatten_params
    assert len(set(p.dtype for p in params)) == 1, "expects all parameters to have same dtype"
AssertionError: expects all parameters to have same dtype

Looking at the init, layer norm weight/biases are stored in fp32 while the rest is in fp16. This looks like fairscale branch: prefetch_fsdp_params_simple has that assert https://github.com/facebookresearch/fairscale/blob/8820049331331c773077c257667aa81baf4cc9f9/fairscale/nn/misc/flatten_params_wrapper.py#L276 . While looking around I saw that ngoyal2707/Megatron-LM@06bd10e changed the way parameters are loaded, tried reverting to ae0b844c1f6725c3433a95e42cac760b3885170b and it seems to have fixed that issue (looking at the actual code, I'm unclear why though)

@M-B-Lee
Copy link

M-B-Lee commented May 8, 2022

Hey @stephenroller , thanks for the script. I have a few questions:

  • what is dict.txt? I currently replace it with an empty file and it seems to run fine (until next error).
  • when running the script I get:
RuntimeError: Error(s) in loading state_dict for FlattenParamsWrapper:
   size mismatch for flat_param_0: copying a param with shape torch.Size([63435264]) from checkpoint, the shape in current model is torch.Size([44133888]).

I suppose the config isn't loaded correctly as the size of the model is not the expected one?

EDIT: Okay I think I found that dict needs to store all the tokens in fact to correctly initialize the vocabulary layer:

import json
with open("gpt2-vocab.json", "r") as fi:
   vocab = json.load(fi)
inv_vocab = {v: k for k,v in vocab.items()}
indices = [i for i in range(4,len(vocab))] # the first 4 tokens are special tokens: <s>, <pad>. </s>. <unk>
with open("dict.txt", "w") as fo:
   for i in indices:
       fo.write(f"{inv_vocab[i]} 1\n") # "1" is a count, which I'm guessing is the one used when building the tokenizer, I think we can default to 1 here.

I also meet the same size mismatch error when loading the model state dict.

@patrickvonplaten
Copy link
Contributor

Hey @stephenroller,

Thanks a lot, the script works very well to merge the sharded checkpoints into a single checkpoint.
However, when testing the checkpoint on a generation task, the model only gives gibberish - see: #73

This script works well with the 350m checkpoint. Any ideas what could be the problem here?

@stephenroller
Copy link
Contributor Author

Double checking this work since people are reporting gibberish.

thomasw21 and others added 5 commits May 9, 2022 13:57
)

* Recursively unwrap fully sharded model

* Update metaseq/scripts/convert_to_singleton.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@ftamburin
Copy link

Hi, thanks for this wonderful project!
I had a lot of problems in using the script "convert_to_singletons.py" for joining shards because a lot of model parameters (especially 'cfg') are not copied in the new model. Maybe I am doing something wrong, but the only way to join the shards and building a unique .pt file structurally similar to the 350m model and working with the patrickvonplaten's script 'run_model.py' for using the model was to add the following lines to the last version of the script, just before saving the new model:

        _model_args['model'] = vars(_model_args['model'])
        _model_args['model']['_name'] = 'transformer_lm'
        _model_args['criterion'] = vars(_model_args['criterion'])
        glued = {'cfg': _model_args, 'model': glued}

This worked for me both for the 1.3B and 6.7B models

@patrickvonplaten
Copy link
Contributor

Hi, thanks for this wonderful project! I had a lot of problems in using the script "convert_to_singletons.py" for joining shards because a lot of model parameters (especially 'cfg') are not copied in the new model. Maybe I am doing something wrong, but the only way to join the shards and building a unique .pt file structurally similar to the 350m model and working with the patrickvonplaten's script 'run_model.py' for using the model was to add the following lines to the last version of the script, just before saving the new model:

        _model_args['model'] = vars(_model_args['model'])
        _model_args['model']['_name'] = 'transformer_lm'
        _model_args['criterion'] = vars(_model_args['criterion'])
        glued = {'cfg': _model_args, 'model': glued}

This worked for me both for the 1.3B and 6.7B models

Thanks for the tip @ftamburin ! Are you able to generate sensible outputs when adding this with run_model.py ? E.g. is the next word prediction a sensible word?

@ftamburin
Copy link

ftamburin commented May 10, 2022

Hi, thanks for this wonderful project! I had a lot of problems in using the script "convert_to_singletons.py" for joining shards because a lot of model parameters (especially 'cfg') are not copied in the new model. Maybe I am doing something wrong, but the only way to join the shards and building a unique .pt file structurally similar to the 350m model and working with the patrickvonplaten's script 'run_model.py' for using the model was to add the following lines to the last version of the script, just before saving the new model:

        _model_args['model'] = vars(_model_args['model'])
        _model_args['model']['_name'] = 'transformer_lm'
        _model_args['criterion'] = vars(_model_args['criterion'])
        glued = {'cfg': _model_args, 'model': glued}

This worked for me both for the 1.3B and 6.7B models

Thanks for the tip @ftamburin ! Are you able to generate sensible outputs when adding this with run_model.py ? E.g. is the next word prediction a sensible word?

Yes, sorry, I forgot to write it explicitly. Nice outputs as for the 350m model.

You also have to hack the procedure "_upgrade_state_dict" in metaseq/checkpoint_utils.py

def _upgrade_state_dict(state):
    return state

returning immediately the state without any upgrade. Not so elegant, but it works.

@ftamburin
Copy link

Is there a simple way for generating longer sequences after the prompt for this kind of models?
Could you suggest a script and/or a function for doing such task?
Thanks!

@patrickvonplaten
Copy link
Contributor

Thanks @ftamburin and @stephenroller for all the help. The checkpoints up to 6.7B now work for us.

I've converted them and uploaded them to the Hub so that they are easy to use by everybody here: #88

Will try to tackle the larger ones tomorrow.

@ftamburin, it should be very easy with Hugging Face's Transformers. We're hoping to have the checkpoints ready by Thursday

@patrickvonplaten
Copy link
Contributor

Sadly don't know how to do generate with metaseq

@stephenroller
Copy link
Contributor Author

Do you need any help with the larger models?

In the meantime it seems like finalizing this PR and producing some generations for all scales for correctness would be my best value-add right now.

@patrickvonplaten
Copy link
Contributor

@stephenroller, All models until 2.7GB seem to work well!

Tomorrow, I'll check the big ones until 80GB. Think it should all work though :-)

@stephenroller stephenroller merged commit 1246d72 into main May 11, 2022
@stephenroller stephenroller deleted the convert_to_singleton branch May 11, 2022 16:32
@shijie-wu
Copy link

Hi @patrickvonplaten and @stephenroller, I'm trying to do a regression test on the 1.3b model between HF and metaseq, as HF unittests only covers the 350m model.

https://github.com/huggingface/transformers/blob/58fb3c9f98877bf76efb03e376a5c92cf80f7952/tests/models/opt/test_modeling_opt.py#L269-L290

However, I don't have access to multiple GPUs to do the conversion with python -m metaseq.scripts.convert_to_singleton. Could you help confirm that the the converted model would pass the regression test? Thanks!

@patrickvonplaten
Copy link
Contributor

Hey @shijie-wu,

Sure, actually you can find the regression test that I've done to make sure the models work correctly here: https://huggingface.co/patrickvonplaten/opt_metaseq_350m/blob/main/run_model.py

@patrickvonplaten
Copy link
Contributor

https://huggingface.co/patrickvonplaten/opt_metaseq_350m shows how one can set it up. As you can see I'm checking that metaseq and transformers logits match with a precision of 1e-3. I've tested this on both CPU and GPU (not on multi-GPU though)

@patrickvonplaten
Copy link
Contributor

There are also the same directories for the other sizes here: https://huggingface.co/models?other=opt_metasq

@shijie-wu
Copy link

shijie-wu commented Jun 2, 2022

Thanks for clarifying! I missed the models under your repo 😅. I have checked that facebook/opt-1.3b indeed pass the regression test against the merged model in patrickvonplaten/opt_metaseq_1300m 🎊

To confirm, restored.pt in https://huggingface.co/patrickvonplaten/opt_metaseq_1300m/tree/main/model is the output of the merging step python -m metaseq.scripts.convert_to_singleton? However, as far as I can tell, this set of regression tests do not cover the merging step, although I might miss something. It would be great if there's a test that also cover the merging step.

@patrickvonplaten
Copy link
Contributor

You're exactly right the regression test doesn't cover the merging step, but maybe you could try it yourself with: #88 (comment) ? :-)

@shijie-wu
Copy link

I have tried this script but unfortunately it requires multiple GPUs. I have opened a new issue (#136) for full regression test between metaseq and huggingface, as this is outside of the scope of this PR.

@patrickvonplaten
Copy link
Contributor

See: #164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants