Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT-NeoX-20B Integration #15642

Closed
sdtblck opened this issue Feb 13, 2022 · 42 comments 路 Fixed by #16659
Closed

GPT-NeoX-20B Integration #15642

sdtblck opened this issue Feb 13, 2022 · 42 comments 路 Fixed by #16659

Comments

@sdtblck
Copy link

sdtblck commented Feb 13, 2022

馃殌 Feature request

Over at EleutherAI we've recently released a 20 billion parameter autoregressive gpt model (see gpt-neox for a link to the weights). It would be great to get this into transformers!

Motivation

gpt-neox library is not quite as user-friendly as transformers, and is designed for efficient large scale training over inference. So we think integrating it into transformers would be great for accessibility.

Your contribution

Personally, I can dedicate a bit of time into integrating this model. We already have a PR to convert the weights to HF format, although:

  1. I suspect you would want to introduce a new model class for it.
  2. It is not merged / thoroughly tested with all possible configurations yet. GPT-NeoX has a bunch of configuration options, and it might be more straightforward to focus on just introducing a model class for GPT-NeoX-20B (which should largely be similar to GPT-J, with some caveats, see next section)

Difficulties

  • Whilst we do have a script to merge model parallel checkpoints that will be merged soon, in larger models we see some performance loss when merging mp ranks, due to there being some very slight differences in the replicated parameters between ranks (see the thread above for more details). If we want to integrate GPT-NeoX-20B as a model to be used on a single GPU, we need to figure out how to address this. I'm not sure if this bug is specific to neox, or was introduced in megatron or deepspeed, but I believe the bigscience team are also looking at merging mp models, so I guess we'll find out soon.
  • If we do integrate the model without model parallelism - it will be too large to run on most consumer GPUs. During inference, it takes ~45GB of GPU memory to run, and during training much more.
@LysandreJik
Copy link
Member

Hey @sdtblck, this is great, we'd be super excited to support GPT-NeoX in the library. From our side, I believe @mrm8488 was excited about working on the implementation, and @patil-suraj and myself are happy to help out as well.

Let us know how you'd like to collaborate.

@sdtblck
Copy link
Author

sdtblck commented Feb 14, 2022

So, i'd be happy to write a basic model class based around GPT-J, but I think we need to decide on how to address the model parallelism issue first. Should the model be written with mp=2 by default? Or would you prefer to write it as a merged model, and try to address the issues with merging somehow?

@LysandreJik
Copy link
Member

@stas00, would you have any recommendation regarding this question?

@stas00
Copy link
Contributor

stas00 commented Feb 22, 2022

We have several of WIP tensor parallelism (TP) projects happening that will eventually address this kind of challenges:

  • oslo: TP
  • Deepspeed-Inference == TP

but the first one is not yet integrated and the second is still being worked on. But I'm pretty sure they still expect a single file checkpoint.

So a single model 20B will currently work with Deepspeed-ZeRO over either multi-gpu or with cpu/nvme-offload no problem, so I'd say that the low hanging fruit is to do mp=1.

And as @sdtblck said Tunji is work on merging the checkpoints - I think he is pretty close to completing at least the many-to-one path. May be try his branch and see if you get better results on checkpoint consolidation?


wrt, TP, let's ask the project owners:

  • @hyunwoongko, could you please comment on the feasibility of running GPT-NeoX-20B on oslo (perhaps for now as a standalone use until it's integrated into transformers)
  • @RezaYazdaniAminabadi, could you please comment on the feasibility of running GPT-NeoX-20B on Deepspeed-Inference

And ideally if you have code samples to run that would be very helpful.

Thank you, both!

@hyunwoongko
Copy link
Contributor

hyunwoongko commented Feb 22, 2022

@stas00 Fundamentally OSLO is designed for transformers. So you can easily parallelize gpt-neox by just adding 3 lines of mapping once it is integrated into transformers. But it will be a bit difficult to use before being integrated into transformers.

@stas00
Copy link
Contributor

stas00 commented Feb 22, 2022

Didn't parallelformers work independently of HF Transformers?

@hyunwoongko
Copy link
Contributor

hyunwoongko commented Feb 23, 2022

Both oslo and parallelformers are designed for the transformers. But if you want both to support gpt-neox, I can edit my code. So please tell me if you need it.

The process of porting neox to transformers is very important from a user's point of view. If I can help with this process, I will be happy to participate.

@LysandreJik
Copy link
Member

To summarize the conversation, let's first write it as a merged model following @stas00's comment as a first step, leveraging the DeepSpeed integration?

If the GPT-NeoX model is similar to GPT-J but contains a few differences, we recommend creating a new class, as you have highlighted above. We've recently introduced a new tool, add-new-model-like, which should help creating an exact replica of GPT-J for you to tweak as you wish.

Let us know if you'd like for us to help in any way @sdtblck.

@eghbalhosseini
Copy link

I am also trying to setup a conversion pipeline between models that are trained on GPT-NEOX and transformers. GPT-NEOX offers different settings for position embedding and normalizations, however GPT-J is only written for rotary embedding. if we could create a general purpose pipeline, then it would be much easier to integrate new models with the transformers.

@stas00
Copy link
Contributor

stas00 commented Feb 23, 2022

Both oslo and parallelformers are designed for the transformers. But if you want both to support gpt-neox, I can edit my code. So please tell me if you need it.

If you are replying to me, Kevin, I was only asking if either oslo and parallelformers can already support gpt-neox directly before oslo is integrated. So that we then have more than one solution to give to users. If not, all is good, we have deepspeed-zero and once oslo is integrated then there will be at least 2 solutions (or 3 if it works with sagemaker as well).

@eghbalhosseini
Copy link

I can help with creating a conversion pipeline.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 23, 2022

And as @sdtblck said Tunji is work on merging the checkpoints - I think he is pretty close to completing at least the many-to-one path. May be try his branch and see if you get better results on checkpoint consolidation?

@sdtblck, progress has been slow but thankfully consistent. Our strategy has been to provide various checkpoint manipulation routines inside deepspeed for client scripts to utilize. Below are links to current status

deepspeed branch, unit tests, and utils folder

bigscience branch and clients.

Do let me know if you find this useful.

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @stas00,

Sorry I did not see your message earlier. I will look into this and let you know if we can run inference through ds-inference.
Thanks,
Reza

@RezaYazdaniAminabadi
Copy link
Contributor

Hi Stas, I checked the implementation, and we have all the kernels to run this through ds-inference. I will add a policy for this and send a PR at DeepSpeed side.
Best,
Reza

@stas00
Copy link
Contributor

stas00 commented Mar 4, 2022

Thanks a lot, Reza!

@hyunwoongko
Copy link
Contributor

hyunwoongko commented Mar 14, 2022

How's this going? I will actively participate in this If there are not enough people.

@zphang
Copy link
Contributor

zphang commented Mar 15, 2022

Hi,

I have a version of the 20B model with TP=1 written up here: https://github.com/zphang/minimal-gpt-neox-20b/blob/main/minimal20b/model.py

It uses the checkpoint merging script from EleutherAI/gpt-neox#466. As mentioned in the PR and Sid above, there seems to be a slight performance regression from merging TP=2 to TP=1 weights, due to some small differences between the duplicated layer weights.

If we're okay with working off a slightly worse TP=1 version of the model (while we investigate the issue), I am happy to submit a PR for adding GPT-NeoX-20B. (Should the model be uploaded under the EleutherAI org? Should we make it clear that this is slightly worse than the TP=2 version?)

@oborchers
Copy link

@zphang is it possible the repo is private?

@zphang
Copy link
Contributor

zphang commented Mar 15, 2022

Sorry! Should be public now

@ViktorThink
Copy link

Great work to all the people involved in making this available. Since there has been no update for a while, I wanted to check if there are plans to add the Huggingface hub in the near term, and if help is needed with that?

@zphang
Copy link
Contributor

zphang commented Mar 30, 2022

I've found the issue with TP=1 checkpoints, and should be ready to write the 20B implementation in Transformers.

Quick question: Is it possible to upload multiple files for the layer weights to the Model Hub? Given how big the model is, I'm planning to split up the weights by layers.

@LysandreJik
Copy link
Member

Hey @zphang, since #16343, yes it is! Now, when saving the model with save_pretrained, you can specify a max_shard_size parameter. It will try and split up the weights in files that have a maximum size that you have specified here.

We recommend staying under the upper bound of 20GB, as files above that limit won't get distributed by cloudfront, resulting in very long download speed. We have put a default of 10GB for each shard.

This is a brand new feature that is only available on the main branch of the repository, so you'll need to install from source; and, as always, we're very eager to hear your feedback when using that brand new feature!

@zphang
Copy link
Contributor

zphang commented Mar 31, 2022

Follow-up question: I'm tweaking the implementation for 20B currently, and have the option to either use more optimized code similar to the Megatron implementation (e.g. adding biases at the last moment) or something that reads more like standard Pytorch (just using F.linear)

Is there a preference for a more straightforward implementation or more performant implementation? To give an idea of the performance difference, in local testing it's something on the order of 1.5s vs. 2s/it. Edit: Ignore the previous comparison, I was looking at the wrong numbers.

Also, given how large the weights are, even having the model be initialized in fp32 might be quite a burden. Since the model itself is natively fp16, would it make sense for me to call .half() when instantiating the internal modules?

@zphang
Copy link
Contributor

zphang commented Apr 5, 2022

I'm working on an implementation here: https://github.com/zphang/transformers/tree/neox20b. Still WIP, but targeting a PR later today/tomorrow.

@LysandreJik
Copy link
Member

Exciting, thanks for working on it @zphang! Let us know if we can help.

Is there a preference for a more straightforward implementation or more performant implementation?

It depends how much complexity is necessart for the performant implementation, and how platform-dependant the result is. The difference you mention doesn't look like it adds unnecessary complexity, so I would go with the performant approach.

Also, given how large the weights are, even having the model be initialized in fp32 might be quite a burden. Since the model itself is natively fp16, would it make sense for me to call .half() when instantiating the internal modules?

For GPT-J, I believe we chose to go with two branches on the model repo, one which contains the fp16 weights and one that contains the fp32 weights, so that users may choose whichever weights they prefer.

@zphang zphang mentioned this issue Apr 7, 2022
5 tasks
@zphang
Copy link
Contributor

zphang commented Apr 7, 2022

Added a PR here: #16659

The model implementation should be fully working, based on quick tests on both LM-eval-harness testing (3.680 vs 3.676 perplexity compared to the NeoX implementation) and .generate. The model weights should be up on the model hub at `EleutherAI/gpt-neox-20b'.

I'm a little less certain about the tokenizer implementation. So far I've been working off a tokenizers Tokenizer object, so I'm not so sure about the slow tokenizer implementation.

Model is large so unfortunately it takes:

  • About 1 minute to initialize the model weights
  • About 1+ minutes to load the weights in (not including downloading them)

First time adding a model, so there are likely things I've left out/mistakes I made. Please let me know!

@aalok-sathe
Copy link

I get Killed rather than a MemoryError trying to load the weights using AutoModelForCausalLM, maybe there is some kind of a memory leak? I was trying to load the model into CPU with >300GB of memory.

@stas00
Copy link
Contributor

stas00 commented Apr 20, 2022

That's typically either cgroups or oom-killer, typically you should see the details in /var/sys/log and may be the output of dmesg - so it doesn't matter how much memory you have, what matters is how tightly these are configured to kill programs that consume more residential cpu memory than they are configured to allow.

But regardless:

  1. shard this large checkpoint first as I have shown here:
    bigscience/T0 multi-gpu inference exits with return code -9聽#16616 (comment)

and switch to transformers@main as just this morning memory usage for sharded loading got more efficient #16844

@aalok-sathe
Copy link

Thanks for the tip on cgroups and oom-killer, I will inquire about these limits.

I think this model is already sharded into pieces <1GB each (see repo here: https://huggingface.co/EleutherAI/gpt-neox-20b/tree/main), so that seems less of an issue.
(more details here: #16659; #16659 (comment))

@aalok-sathe
Copy link

OK, got past the memory issue, it was an issue on my end (slurm job scheduler).

This may be an issue:

File ..., in load_tokenizer(model_name_or_path='./gpt-neox-20b', **kwargs={'cache_dir': ...})
     16 def load_tokenizer(model_name_or_path: str = None, **kwargs) -> AutoTokenizer:
---> 17     return AutoTokenizer.from_pretrained(model_name_or_path, **kwargs)
        model_name_or_path = './gpt-neox-20b'
        kwargs = {'cache_dir': ...}

File .../lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:525, in AutoTokenizer.from_pretrained(cls=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, pretrained_model_name_or_path='./gpt-neox-20b', *inputs=(), **kwargs={'_from_auto': True, 'cache_dir': ...})
    522         tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
    524     if tokenizer_class is None:
--> 525         raise ValueError(
    526             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    527         )
    528     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    530 # Otherwise we have to be creative.
    531 # if model is an encoder decoder, the encoder tokenizer class is used by default

ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported.

@farzanehnakhaee70
Copy link

I got this error while loading the model in GPU (Quadro RTX 8000 with 48GB memory):

Traceback (most recent call last):
  File "infer.py", line 23, in <module>
    beam_output = model.generate(
  File "/gpt_neo/transformers/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/gpt_neo/transformers/src/transformers/generation_utils.py", line 1306, in generate
    return self.sample(
  File "/gpt_neo/transformers/src/transformers/generation_utils.py", line 1922, in sample
    outputs = self(
  File "/gpt_neo/transformers/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpt_neo/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 621, in forward
    outputs = self.gpt_neox(
  File "/gpt_neo/transformers/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpt_neo/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 513, in forward
    outputs = layer(
  File "/gpt_neo/transformers/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpt_neo/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 317, in forward
    attention_layer_outputs = self.attention(
  File "/gpt_neo/transformers/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpt_neo/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 155, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/gpt_neo/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 220, in _attn
    raise RuntimeError()
RuntimeError

This is also the whole code I used:

model = AutoModelForCausalLM.from_pretrained(PATH, local_files_only=True).half().cuda()
tokenizer = GPTNeoXTokenizerFast.from_pretrained(PATH, local_files_only=True)

input_ids=tokenizer.encode("This is the input text", return_tensors="pt",add_special_tokens=False).cuda()

beam_output = model.generate(
      input_ids=input_ids,
      max_length=input_ids.shape[1]+30,
      min_length=input_ids.shape[1]+5,
      early_stopping=True,
      num_return_sequences=4,
      do_sample=True
      )

Can anyone run the model on GPU?

@ViktorThink
Copy link

@farzanehnakhaee70 when you use the generate function, there is some overhead GPU memory used, especially since num_return_sequences=4 and max_length > 30.

I know it doesn't say cuda out of memory, but it seems most likely to me.

Perhaps you could try something simpler like

input_ids=tokenizer.encode("text", return_tensors="pt",add_special_tokens=False).cuda()
model(input_ids)

To minimize memory usage and see if that works.

Then maybe

beam_output = model.generate(
      input_ids=input_ids,
      max_length=input_ids.shape[1]+5,
      min_length=input_ids.shape[1]+5,
      early_stopping=True,
      num_return_sequences=1,
      do_sample=True
      )

Hope that works.

@farzanehnakhaee70
Copy link

Thanks @ViktorThink
Unfortunately it doesn't solve the issue.

@zanussbaum
Copy link
Contributor

zanussbaum commented May 12, 2022

I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b")
      2 tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

File ~/Documents/GitHub/prompting/env/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:423, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    421 kwargs["_from_auto"] = True
    422 if not isinstance(config, PretrainedConfig):
--> 423     config, kwargs = AutoConfig.from_pretrained(
    424         pretrained_model_name_or_path, return_unused_kwargs=True, trust_remote_code=trust_remote_code, **kwargs
    425     )
    426 if hasattr(config, "auto_map") and cls.__name__ in config.auto_map:
    427     if not trust_remote_code:

File ~/Documents/GitHub/prompting/env/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py:672, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    670     return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
    671 elif "model_type" in config_dict:
--> 672     config_class = CONFIG_MAPPING[config_dict["model_type"]]
    673     return config_class.from_dict(config_dict, **kwargs)
    674 else:
    675     # Fallback: use pattern matching on the string.

File ~/Documents/GitHub/prompting/env/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py:387, in _LazyConfigMapping.__getitem__(self, key)
    385     return self._extra_content[key]
    386 if key not in self._mapping:
--> 387     raise KeyError(key)
    388 value = self._mapping[key]
    389 module_name = model_type_to_module_name(key)

KeyError: 'gpt_neox'

is this no longer supported?

@ViktorThink
Copy link

Zphangs PR hasn't been merged with huggingface transformers yet, so you need to install transformers like this:

pip3 install git+https://github.com/zphang/transformers.git@neox20b

Then it should be possible to download.

@farzanehnakhaee70
Copy link

Hi everybody,
Is there any updates for the problem I had? Could anyone run the model on GPU?

Just an update from my side is that whenever I run the model in full precision, I got the memory error for which can not allocate the memory needed.

@mattf1n
Copy link

mattf1n commented May 16, 2022

I'm trying to get this to run on multiple GPUs (8) using deepspeed, zero3, fp16. I'm hitting an out-of-memory error I believe. Machine has ~400GiB RAM. Process is killed with exit code -9.

@zphang
Copy link
Contributor

zphang commented May 16, 2022

What if the current setup for loading a pretrained model directly into fp16 in Transformers? I think the issue may be that the weights are being loaded in fp32 before calling .half()?

@mattf1n
Copy link

mattf1n commented May 19, 2022

Is there an easy way to call .half() before loading weights?

@gaetanlop
Copy link

gaetanlop commented May 20, 2022

Hi everyone, the model run on gpu with an A40 using the following code:

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b", torch_dtype=torch.float16)
model = model.to("cuda:0")

.half() is leading to the same issue reported by @farzanehnakhaee70

@thies1006
Copy link

Hello! I was trying to run the model on smaller GPUs (T4) using accelerate. I added one config to get rid of the 'gpt-neox' key-error (straight.

Script:

import torch

from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
from huggingface_hub import snapshot_download

weights_path = "EleutherAI/gpt-neox-20b"

from accelerate import init_empty_weights, dispatch_model, infer_auto_device_map, load_checkpoint_and_dispatch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM

config = AutoConfig.from_pretrained(weights_path)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

tokenizer = AutoTokenizer.from_pretrained(weights_path)

#not sure if this is needed
#model.tie_weights()

device_map = infer_auto_device_map(model, no_split_module_classes=["GPTNeoXLayer"], max_memory={0:12000000000,1:12000000000,2:12000000000,3:12000000000,4:12000000000,5:12000000000,6:12000000000,7:12000000000}, dtype=torch.float16)

load_checkpoint_and_dispatch(
    model,
    weights_path,
    device_map=device_map,
    offload_folder=None,
    dtype=torch.float16,
    offload_state_dict=True
)

prompt = 'Huggingface is'
input_tokenized = tokenizer(prompt, return_tensors="pt")
output = model.generate(input_tokenized["input_ids"].to(0), do_sample=True)
output_text = tokenizer.decode(output[0].tolist())

The error I get is the same as @farzanehnakhaee70.

Somebody knows how to get the model working with accelerate?

@farzanehnakhaee70
Copy link

Much appreciated @gaetanlop . That really works.
The other problem is that the previous error still existed while using deepSpeed and also accelerate (as what @thies1006 mentioned). Do you have any solution for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.