-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scripts] Convert resharded MP checkpoints to unflattened. #60
Conversation
Alright I take it back, this script doesn't work at all. Going back to draft. |
Okay I think I got it now, @patrickvonplaten can you try this with the approach used for the 350M? |
If you have |
I failed to run this script with 125M model and got this error:
Has anyone run into this problem? |
this is a fairscale mismatch. |
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Hey @stephenroller , thanks for the script. I have a few questions:
I suppose the config isn't loaded correctly as the size of the model is not the expected one? EDIT: Okay I think I found that dict needs to store all the tokens in fact to correctly initialize the vocabulary layer: import json
with open("gpt2-vocab.json", "r") as fi:
vocab = json.load(fi)
inv_vocab = {v: k for k,v in vocab.items()}
indices = [i for i in range(4,len(vocab))] # the first 4 tokens are special tokens: <s>, <pad>. </s>. <unk>
with open("dict.txt", "w") as fo:
for i in indices:
fo.write(f"{inv_vocab[i]} 1\n") # "1" is a count, which I'm guessing is the one used when building the tokenizer, I think we can default to 1 here. |
@thomasw21 Opened issue to remove this dict logic for now: #64 It currently is just a file that looks like:
|
Thanks @suchenzang ! File "fairscale/fairscale/nn/misc/flatten_params_wrapper.py", line 276, in _init_flatten_params
assert len(set(p.dtype for p in params)) == 1, "expects all parameters to have same dtype"
AssertionError: expects all parameters to have same dtype Looking at the init, layer norm weight/biases are stored in fp32 while the rest is in fp16. This looks like |
I also meet the same size mismatch error when loading the model state dict. |
Hey @stephenroller, Thanks a lot, the script works very well to merge the sharded checkpoints into a single checkpoint. This script works well with the 350m checkpoint. Any ideas what could be the problem here? |
Double checking this work since people are reporting gibberish. |
Hi, thanks for this wonderful project!
This worked for me both for the 1.3B and 6.7B models |
Thanks for the tip @ftamburin ! Are you able to generate sensible outputs when adding this with |
Yes, sorry, I forgot to write it explicitly. Nice outputs as for the 350m model. You also have to hack the procedure "_upgrade_state_dict" in metaseq/checkpoint_utils.py
returning immediately the state without any upgrade. Not so elegant, but it works. |
Is there a simple way for generating longer sequences after the prompt for this kind of models? |
Thanks @ftamburin and @stephenroller for all the help. The checkpoints up to 6.7B now work for us. I've converted them and uploaded them to the Hub so that they are easy to use by everybody here: #88 Will try to tackle the larger ones tomorrow. @ftamburin, it should be very easy with Hugging Face's Transformers. We're hoping to have the checkpoints ready by Thursday |
Sadly don't know how to do generate with metaseq |
Do you need any help with the larger models? In the meantime it seems like finalizing this PR and producing some generations for all scales for correctness would be my best value-add right now. |
@stephenroller, All models until 2.7GB seem to work well! Tomorrow, I'll check the big ones until 80GB. Think it should all work though :-) |
Hi @patrickvonplaten and @stephenroller, I'm trying to do a regression test on the 1.3b model between HF and metaseq, as HF unittests only covers the 350m model. However, I don't have access to multiple GPUs to do the conversion with |
Hey @shijie-wu, Sure, actually you can find the regression test that I've done to make sure the models work correctly here: https://huggingface.co/patrickvonplaten/opt_metaseq_350m/blob/main/run_model.py |
https://huggingface.co/patrickvonplaten/opt_metaseq_350m shows how one can set it up. As you can see I'm checking that metaseq and transformers logits match with a precision of 1e-3. I've tested this on both CPU and GPU (not on multi-GPU though) |
There are also the same directories for the other sizes here: https://huggingface.co/models?other=opt_metasq |
Thanks for clarifying! I missed the models under your repo 😅. I have checked that To confirm, |
You're exactly right the regression test doesn't cover the merging step, but maybe you could try it yourself with: #88 (comment) ? :-) |
I have tried this script but unfortunately it requires multiple GPUs. I have opened a new issue (#136) for full regression test between metaseq and huggingface, as this is outside of the scope of this PR. |
See: #164 |
Patch Description
Adds a new script which meets the requirements of #31.
Testing steps
Ran on 125m. The usage example in the docstring shows real output.