Adding GPT-NeoX-20B #16659

zphang · 2022-04-07T22:25:07Z

What does this PR do?

Adds GPT-NeoX-20B model and tokenizers.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-04-07T22:41:55Z

The documentation is not available anymore as the PR was closed or merged.

ViktorThink · 2022-04-08T10:30:08Z

Incredible work!

I have tested the model and seems to work as intended. I did discover one problem with the tokenizer though:

Here is the full script:

!git clone https://github.com/zphang/transformers
!cd transformers
!git checkout neox20b
!pip install -e .
!cd ..

from transformers import AutoModelForCausalLM, GPTNeoXTokenizer


model_name = r"EleutherAI/gpt-neox-20b"

model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer = GPTNeoXTokenizer.from_pretrained(model_name)

input_ids=tokenizer.encode("This is the input text", return_tensors="pt",add_special_tokens=False)
beam_output = model.generate(
      input_ids=input_ids,
      max_length=input_ids.shape[1]+30,
      min_length=input_ids.shape[1]+5,
      early_stopping=True,
      num_return_sequences=4,
      do_sample=True
      )

for j in range(4):
        output = tokenizer.decode(beam_output[j][input_ids.shape[1]:], skip_special_tokens=False)

I got the following error:

File "testing/testDecoderOnly.py", line 104, in testModelSample
ran = tokenizer.decode(beam_output[j][input_ids.shape[1]:], skip_special_tokens=False)
File "/home/ec2-user/t5-regression3/transformers/src/transformers/tokenization_utils_base.py", line 3308, in decode
**kwargs,
File "/home/ec2-user/t5-regression3/transformers/src/transformers/tokenization_utils.py", line 946, in _decode
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
File "/home/ec2-user/t5-regression3/transformers/src/transformers/models/gpt2/tokenization_gpt2.py", line 266, in convert_tokens_to_string
text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
File "/home/ec2-user/t5-regression3/transformers/src/transformers/models/gpt2/tokenization_gpt2.py", line 266, in
text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
KeyError: ' '

zphang · 2022-04-10T03:27:45Z

Hm yea, I can replicate that issue too. I'm not too familiar with the tokenization code. The fast tokenizer seems to work just fine, but the Python one (which I'm basing of GPT-2's tokenizer) seems to have some issues.

Here's the a minimal reproducible version:

import transformers
model_name = "EleutherAI/gpt-neox-20b"
tokenizer_slow = transformers.GPTNeoXTokenizer.from_pretrained(model_name)
tokenizer_fast = transformers.GPTNeoXTokenizerFast.from_pretrained(model_name)
print("Fast", repr(tokenizer_fast.decode([50274])))
print("Slow", repr(tokenizer_slow.decode([50274])))

Fast '    '
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [78], in <cell line: 2>()
      1 print("Fast", repr(tokenizer_fast.decode([50274])))
----> 2 print("Slow", repr(tokenizer_slow.decode([50274])))

File ~/code/transformers/src/transformers/tokenization_utils_base.py:3304, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3301 # Convert inputs to python lists
   3302 token_ids = to_py_obj(token_ids)
-> 3304 return self._decode(
   3305     token_ids=token_ids,
   3306     skip_special_tokens=skip_special_tokens,
   3307     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   3308     **kwargs,
   3309 )

File ~/code/transformers/src/transformers/tokenization_utils.py:946, in PreTrainedTokenizer._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs)
    944         current_sub_text.append(token)
    945 if current_sub_text:
--> 946     sub_texts.append(self.convert_tokens_to_string(current_sub_text))
    948 if spaces_between_special_tokens:
    949     text = " ".join(sub_texts)

File ~/code/transformers/src/transformers/models/gpt2/tokenization_gpt2.py:266, in GPT2Tokenizer.convert_tokens_to_string(self, tokens)
    264 """Converts a sequence of tokens (string) in a single string."""
    265 text = "".join(tokens)
--> 266 text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
    267 return text

File ~/code/transformers/src/transformers/models/gpt2/tokenization_gpt2.py:266, in <listcomp>(.0)
    264 """Converts a sequence of tokens (string) in a single string."""
    265 text = "".join(tokens)
--> 266 text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
    267 return text

KeyError: ' '

I believe the NeoX tokenizer handles spaces a little differently (it has special tokens for single, double, triple spaces, etc). Do you know if someone who's more familiar with tokenization code might be able to chime in?

ViktorThink · 2022-04-10T07:58:42Z

Great that the fast version works. Gently pinging @SaulLu and @Narsil if they have any answers.

LysandreJik

Thanks for your PR, @zphang, this is great! There are a few tests failing, let me give you pointers on how to solve them:

The check_code_quality run fails because the quality checks weren't applied. I recommend doing the following from the root of your fork: pip install -e .[quality], followed by make fixup. This should fix most of the issues, and tell you which issues remain to be solved manually.
There's a missing mention of GPT-NeoX-20B in the index.mdx file of the doc. Running make fix-copies from the root of your clone should solve this issue.

The rest of the issues seem to be linked to you importing many different models, most of which do not exist, in both src/transformers/models/gpt_neox/__init__.py and src/transformers/models/auto/modeling_auto.py. Left some comments where that applies.

Did you use the add-new-model-like command to add this model? What was your experience like using the script? Thanks again for your contributions!

src/transformers/models/auto/modeling_auto.py

src/transformers/models/gpt_neox/__init__.py

zphang · 2022-04-11T19:08:30Z

Hey @LysandreJik, thanks for taking a look! I'll look into getting the tests to pass today.

Re: the model script, I did use the new model templating script, but many parts of it seemed to make the assumption that the model with be an encoder-decoder model (e.g. mentioning cross attention). I removed most of other model implementations aside from CasualLM as that's the primary format that NeoX-20B would be used for, but it looks like I missed out some other references to the other model implementations. Other than that, the script was very useful in setting up the boilerplate.

aalok-sathe · 2022-04-21T17:59:51Z

added a PR to the PR to support AutoTokenizer initialization from pretrained_model_name_or_path:
zphang#1

…er to the Hub

StellaAthena · 2022-05-19T14:22:11Z

I have resolved the merge conflicts in the config files, but I am not confidant in my understanding of how these various configs are supposed to work. I would appreciate it if someone double checked that I didn't do anything stupid.

src/transformers/models/auto/tokenization_auto.py

zphang · 2022-05-21T19:44:50Z

Are there any further blockers to merging? It would be nice to have this merged in time for ACL next week :)

sgugger

Hi @zphang. Many of the comments/suggestions on the previous reviews were just ignored. I have a few more suggestions on the style.
We will merge the model as soon as they are resolved, let us know if you need any help.

src/transformers/__init__.py

src/transformers/models/gpt_neox/modeling_gpt_neox.py

tests/models/gpt_neox/test_modeling_gpt_neox.py

zphang · 2022-05-23T16:52:55Z

Apologies, I must have missed the previous comments. I've pushed an update with the desired changes.

sgugger · 2022-05-23T17:45:55Z

There are still four open comment on the modeling file, if you could have a look.

zphang · 2022-05-23T17:55:20Z

I think I got to all of them now (is there an easy way to check on the GitHub web interface?), let me know if I'm missing any.

sgugger · 2022-05-23T18:26:42Z

I see there are closed but not addressed, maybe you forgot to push your commit?

zphang · 2022-05-23T23:42:20Z

Terribly sorry! Pushed now.

sgugger · 2022-05-24T13:31:25Z

Thanks again for all your wok on this model!

* initial * first try * working 20B * 20B tokenizers * Docs * Import fixes for missing classes * Update docs, fixup * black formatting * isort * flake * dummy objects * documentation * Documentation yml * more docs * tweaks for tests * tokenization auto * fix neox tests * test * test * einsum * address PR feedback * Documentation * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/gpt_neox/__init__.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/gpt_neox/configuration_gpt_neox.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Remove undefined LaTeX syntax * Update to full url to avoid confusion about if that's supposed to refer to the Hub * fix auto * move tests * documentation fix * more doc fixes * test refactor * fix import * fix import * fix import * fix import * fix import * style fixes * More modeling fixes Co-authored-by: Jason Phang <zp489@gr057.hpc.nyu.edu> Co-authored-by: Stella Biderman <stellabiderman@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

zphang added 5 commits April 7, 2022 00:17

initial

eb4c8a8

first try

8abd71d

working 20B

f72254c

20B tokenizers

4b6a581

Docs

c9027a4

zphang mentioned this pull request Apr 7, 2022

GPT-NeoX-20B Integration #15642

Closed

LysandreJik reviewed Apr 11, 2022

View reviewed changes

zphang and others added 2 commits April 11, 2022 23:28

Import fixes for missing classes

5fd9866

Update docs, fixup

39e6aae

zphang force-pushed the neox20b branch from 8867f15 to f1ac111 Compare April 15, 2022 21:51

black formatting

e27c3f0

zphang force-pushed the neox20b branch from f1ac111 to e27c3f0 Compare April 15, 2022 23:27

zphang added 7 commits April 15, 2022 20:05

isort

42ec9d2

flake

2b6c3d7

dummy objects

4d7afe1

documentation

9e73dcd

Documentation yml

18c5607

more docs

8b5ffed

tweaks for tests

4a9768e

zphang added 4 commits April 21, 2022 15:37

tokenization auto

aa607c3

fix neox tests

4294c43

merge

8715315

test

ae49bc6

zphang force-pushed the neox20b branch from a9aafaf to ae49bc6 Compare April 21, 2022 21:30

StellaAthena added 2 commits May 2, 2022 10:04

Remove undefined LaTeX syntax

38e807c

Update to full url to avoid confusion about if that's supposed to ref…

5dae564

…er to the Hub

StellaAthena mentioned this pull request May 19, 2022

How to use Beam Search for text generation? EleutherAI/gpt-neox#619

Closed

Merge branch 'main' into neox20b

133b23c

sgugger reviewed May 19, 2022

View reviewed changes

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved

zphang added 10 commits May 21, 2022 13:50

fix auto

9b1670d

move tests

019ab16

documentation fix

7e6eec3

more doc fixes

aa1d2fa

test refactor

900079f

fix import

7adbc2a

fix import

2075726

fix import

fb26f6c

fix import

5316301

fix import

35d3ea8

sgugger reviewed May 23, 2022

View reviewed changes

style fixes

0351b94

More modeling fixes

1b2cfea

sgugger merged commit 71e6027 into huggingface:main May 24, 2022

sgugger changed the title ~~[WIP] Adding GPT-NeoX-20B~~ Adding GPT-NeoX-20B May 24, 2022

VHellendoorn mentioned this pull request Jun 14, 2022

Hugging Face's Hub VHellendoorn/Code-LMs#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GPT-NeoX-20B #16659

Adding GPT-NeoX-20B #16659

zphang commented Apr 7, 2022

HuggingFaceDocBuilderDev commented Apr 7, 2022 •

edited

ViktorThink commented Apr 8, 2022

zphang commented Apr 10, 2022

ViktorThink commented Apr 10, 2022

LysandreJik left a comment

zphang commented Apr 11, 2022

aalok-sathe commented Apr 21, 2022 •

edited

StellaAthena commented May 19, 2022

zphang commented May 21, 2022

sgugger left a comment

zphang commented May 23, 2022

sgugger commented May 23, 2022

zphang commented May 23, 2022

sgugger commented May 23, 2022

zphang commented May 23, 2022

sgugger commented May 24, 2022

Adding GPT-NeoX-20B #16659

Adding GPT-NeoX-20B #16659

Conversation

zphang commented Apr 7, 2022

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 7, 2022 • edited

ViktorThink commented Apr 8, 2022

zphang commented Apr 10, 2022

ViktorThink commented Apr 10, 2022

LysandreJik left a comment

Choose a reason for hiding this comment

zphang commented Apr 11, 2022

aalok-sathe commented Apr 21, 2022 • edited

StellaAthena commented May 19, 2022

zphang commented May 21, 2022

sgugger left a comment

Choose a reason for hiding this comment

zphang commented May 23, 2022

sgugger commented May 23, 2022

zphang commented May 23, 2022

sgugger commented May 23, 2022

zphang commented May 23, 2022

sgugger commented May 24, 2022

HuggingFaceDocBuilderDev commented Apr 7, 2022 •

edited

aalok-sathe commented Apr 21, 2022 •

edited