New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT-NeoX-20B Integration #15642
Comments
Hey @sdtblck, this is great, we'd be super excited to support GPT-NeoX in the library. From our side, I believe @mrm8488 was excited about working on the implementation, and @patil-suraj and myself are happy to help out as well. Let us know how you'd like to collaborate. |
So, i'd be happy to write a basic model class based around GPT-J, but I think we need to decide on how to address the model parallelism issue first. Should the model be written with mp=2 by default? Or would you prefer to write it as a merged model, and try to address the issues with merging somehow? |
@stas00, would you have any recommendation regarding this question? |
We have several of WIP tensor parallelism (TP) projects happening that will eventually address this kind of challenges:
but the first one is not yet integrated and the second is still being worked on. But I'm pretty sure they still expect a single file checkpoint. So a single model 20B will currently work with Deepspeed-ZeRO over either multi-gpu or with cpu/nvme-offload no problem, so I'd say that the low hanging fruit is to do mp=1. And as @sdtblck said Tunji is work on merging the checkpoints - I think he is pretty close to completing at least the many-to-one path. May be try his branch and see if you get better results on checkpoint consolidation? wrt, TP, let's ask the project owners:
And ideally if you have code samples to run that would be very helpful. Thank you, both! |
@stas00 Fundamentally OSLO is designed for transformers. So you can easily parallelize gpt-neox by just adding 3 lines of mapping once it is integrated into transformers. But it will be a bit difficult to use before being integrated into transformers. |
Didn't parallelformers work independently of HF Transformers? |
Both oslo and parallelformers are designed for the transformers. But if you want both to support gpt-neox, I can edit my code. So please tell me if you need it. The process of porting neox to transformers is very important from a user's point of view. If I can help with this process, I will be happy to participate. |
To summarize the conversation, let's first write it as a merged model following @stas00's comment as a first step, leveraging the DeepSpeed integration? If the GPT-NeoX model is similar to GPT-J but contains a few differences, we recommend creating a new class, as you have highlighted above. We've recently introduced a new tool, Let us know if you'd like for us to help in any way @sdtblck. |
I am also trying to setup a conversion pipeline between models that are trained on GPT-NEOX and transformers. GPT-NEOX offers different settings for position embedding and normalizations, however GPT-J is only written for rotary embedding. if we could create a general purpose pipeline, then it would be much easier to integrate new models with the transformers. |
If you are replying to me, Kevin, I was only asking if either oslo and parallelformers can already support gpt-neox directly before oslo is integrated. So that we then have more than one solution to give to users. If not, all is good, we have deepspeed-zero and once oslo is integrated then there will be at least 2 solutions (or 3 if it works with sagemaker as well). |
I can help with creating a conversion pipeline. |
@sdtblck, progress has been slow but thankfully consistent. Our strategy has been to provide various checkpoint manipulation routines inside deepspeed for client scripts to utilize. Below are links to current status deepspeed branch, unit tests, and utils folder bigscience branch and clients. Do let me know if you find this useful. |
Hi @stas00, Sorry I did not see your message earlier. I will look into this and let you know if we can run inference through ds-inference. |
Hi Stas, I checked the implementation, and we have all the kernels to run this through ds-inference. I will add a policy for this and send a PR at DeepSpeed side. |
Thanks a lot, Reza! |
How's this going? I will actively participate in this If there are not enough people. |
Hi, I have a version of the 20B model with TP=1 written up here: https://github.com/zphang/minimal-gpt-neox-20b/blob/main/minimal20b/model.py It uses the checkpoint merging script from EleutherAI/gpt-neox#466. As mentioned in the PR and Sid above, there seems to be a slight performance regression from merging TP=2 to TP=1 weights, due to some small differences between the duplicated layer weights. If we're okay with working off a slightly worse TP=1 version of the model (while we investigate the issue), I am happy to submit a PR for adding GPT-NeoX-20B. (Should the model be uploaded under the EleutherAI org? Should we make it clear that this is slightly worse than the TP=2 version?) |
@zphang is it possible the repo is private? |
Sorry! Should be public now |
Great work to all the people involved in making this available. Since there has been no update for a while, I wanted to check if there are plans to add the Huggingface hub in the near term, and if help is needed with that? |
I've found the issue with TP=1 checkpoints, and should be ready to write the 20B implementation in Transformers. Quick question: Is it possible to upload multiple files for the layer weights to the Model Hub? Given how big the model is, I'm planning to split up the weights by layers. |
Hey @zphang, since #16343, yes it is! Now, when saving the model with We recommend staying under the upper bound of 20GB, as files above that limit won't get distributed by cloudfront, resulting in very long download speed. We have put a default of This is a brand new feature that is only available on the |
Follow-up question: I'm tweaking the implementation for 20B currently, and have the option to either use more optimized code similar to the Megatron implementation (e.g. adding biases at the last moment) or something that reads more like standard Pytorch (just using F.linear) Is there a preference for a more straightforward implementation or more performant implementation? Also, given how large the weights are, even having the model be initialized in fp32 might be quite a burden. Since the model itself is natively fp16, would it make sense for me to call |
I'm working on an implementation here: https://github.com/zphang/transformers/tree/neox20b. Still WIP, but targeting a PR later today/tomorrow. |
Exciting, thanks for working on it @zphang! Let us know if we can help.
It depends how much complexity is necessart for the performant implementation, and how platform-dependant the result is. The difference you mention doesn't look like it adds unnecessary complexity, so I would go with the performant approach.
For GPT-J, I believe we chose to go with two branches on the model repo, one which contains the fp16 weights and one that contains the fp32 weights, so that users may choose whichever weights they prefer. |
Added a PR here: #16659 The model implementation should be fully working, based on quick tests on both LM-eval-harness testing (3.680 vs 3.676 perplexity compared to the NeoX implementation) and I'm a little less certain about the tokenizer implementation. So far I've been working off a Model is large so unfortunately it takes:
First time adding a model, so there are likely things I've left out/mistakes I made. Please let me know! |
I get |
That's typically either cgroups or oom-killer, typically you should see the details in But regardless:
and switch to |
Thanks for the tip on I think this model is already sharded into pieces |
OK, got past the memory issue, it was an issue on my end (slurm job scheduler). This may be an issue: File ..., in load_tokenizer(model_name_or_path='./gpt-neox-20b', **kwargs={'cache_dir': ...})
16 def load_tokenizer(model_name_or_path: str = None, **kwargs) -> AutoTokenizer:
---> 17 return AutoTokenizer.from_pretrained(model_name_or_path, **kwargs)
model_name_or_path = './gpt-neox-20b'
kwargs = {'cache_dir': ...}
File .../lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:525, in AutoTokenizer.from_pretrained(cls=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, pretrained_model_name_or_path='./gpt-neox-20b', *inputs=(), **kwargs={'_from_auto': True, 'cache_dir': ...})
522 tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
524 if tokenizer_class is None:
--> 525 raise ValueError(
526 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
527 )
528 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
530 # Otherwise we have to be creative.
531 # if model is an encoder decoder, the encoder tokenizer class is used by default
ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported. |
I got this error while loading the model in GPU (Quadro RTX 8000 with 48GB memory):
This is also the whole code I used:
Can anyone run the model on GPU? |
@farzanehnakhaee70 when you use the generate function, there is some overhead GPU memory used, especially since num_return_sequences=4 and max_length > 30. I know it doesn't say cuda out of memory, but it seems most likely to me. Perhaps you could try something simpler like
To minimize memory usage and see if that works. Then maybe
Hope that works. |
Thanks @ViktorThink |
I get the following error:
is this no longer supported? |
Zphangs PR hasn't been merged with huggingface transformers yet, so you need to install transformers like this:
Then it should be possible to download. |
Hi everybody, Just an update from my side is that whenever I run the model in full precision, I got the memory error for which can not allocate the memory needed. |
I'm trying to get this to run on multiple GPUs (8) using deepspeed, zero3, fp16. I'm hitting an out-of-memory error I believe. Machine has ~400GiB RAM. Process is killed with exit code -9. |
What if the current setup for loading a pretrained model directly into fp16 in Transformers? I think the issue may be that the weights are being loaded in fp32 before calling |
Is there an easy way to call |
Hi everyone, the model run on gpu with an A40 using the following code:
|
Hello! I was trying to run the model on smaller GPUs (T4) using accelerate. I added one config to get rid of the 'gpt-neox' key-error (straight. Script:
The error I get is the same as @farzanehnakhaee70. Somebody knows how to get the model working with accelerate? |
Much appreciated @gaetanlop . That really works. |
馃殌 Feature request
Over at EleutherAI we've recently released a 20 billion parameter autoregressive gpt model (see gpt-neox for a link to the weights). It would be great to get this into transformers!
Motivation
gpt-neox library is not quite as user-friendly as transformers, and is designed for efficient large scale training over inference. So we think integrating it into transformers would be great for accessibility.
Your contribution
Personally, I can dedicate a bit of time into integrating this model. We already have a PR to convert the weights to HF format, although:
Difficulties
The text was updated successfully, but these errors were encountered: