TF: BART compatible with XLA generation #17479

gante · 2022-05-30T18:59:24Z

What does this PR do?

Adds position_ids to TFBart, so that we can do generation with a padded past -- a requirement for XLA generation.

This PR was built on top of #17426 (so it will contain its diff until it gets merged), and is a requirement for #17458.

🚨 Important notes:

Review suggestion: check the Bart file, then its test file. The other changes are either cosmetic changes (e.g. correcting comments) or the result of make fix-copies (several files have copies from Bart).
There are several failing tests, but it's intentional -- some models' prepare_inputs_for_generation were copied from Bart, but the models do not have the position_ids input. If the PR gets a positive review, I will propagate the changes to the affected models.

src/transformers/models/bart/modeling_tf_bart.py

HuggingFaceDocBuilderDev · 2022-05-30T19:11:27Z

The documentation is not available anymore as the PR was closed or merged.

tests/models/bart/test_modeling_tf_bart.py

gante · 2022-05-31T18:44:48Z

@ydshieh tagging you for TF review, as Matt is off and you are also familiar with generate :)

ydshieh · 2022-05-31T20:26:11Z

@ydshieh tagging you for TF review, as Matt is off and you are also familiar with generate :)

Actually not very familiar, but would love to get more involved 😃. Thanks for tagging me!

patrickvonplaten

Looks very nice to me! All the changes to modeling_tf_bart.py are 100% ok/good for me. I'd maybe just add one slow xla test to test_modeling_bart.py and once this works we can proceed and make all the tests pass?

src/transformers/models/bart/modeling_tf_bart.py

tests/test_modeling_tf_common.py

tests/models/bart/test_modeling_tf_bart.py

ydshieh

Hi, @gante I left a few comments so far.

Questions:

From the change in prepare_inputs_for_generation, both in the PR for TF-GPT2 and this PR, my understanding of the main change is that: we need to use (decoder) attention mask in order to calculate the correct position_ids for both left/right padding. And this is done using tf.math.cumsum. Do I understand these PR correctly?

Why we need decoder_position_ids when past_key_values is passed?

transformers/src/transformers/models/bart/modeling_tf_bart.py

Lines 943 to 945 in 9089b7b

    
           if position_ids is None: 
        
               if past_key_values is not None: 
        
                   raise ValueError("Make sure to provide `decoder_position_ids` when passing `past_key_values`.")

This is not in TF-GPT2, and it is calculated as

transformers/src/transformers/models/gpt2/modeling_tf_gpt2.py

Lines 388 to 389 in 26e5e12

    
           if position_ids is None: 
        
               position_ids = tf.expand_dims(tf.range(past_length, input_shape[-1] + past_length), axis=0)

In prepare_inputs_for_generation (this PR), we also compute it as

transformers/src/transformers/models/bart/modeling_tf_bart.py

Lines 1420 to 1421 in 9089b7b

    
           elif past is not None:  # non xla + past 
        
               decoder_position_ids = tf.broadcast_to(past[0][0].shape[2], (decoder_input_ids.shape[0], 1))

which should be the same as the computation in TFBartLearnedPositionalEmbedding or as TFGP2MainLayer, right?

I know you mentioned this guard is copied from Flax, but I am just wondering if it is a real necessity. (I feel if it is really necessary, the same guard should also exist in GPT-2.)

src/transformers/models/bart/modeling_tf_bart.py

ydshieh

I continued a bit, but need to take a rest before looking _update_model_kwargs_for_xla_generation.

src/transformers/models/bart/modeling_tf_bart.py

ydshieh · 2022-06-04T12:50:31Z

src/transformers/models/bart/modeling_tf_bart.py

+            decoder_position_ids = tf.broadcast_to(past[0][0].shape[2], (decoder_input_ids.shape[0], 1))
+        else:  # non xla + non past
+            decoder_position_ids = tf.broadcast_to(tf.range(decoder_input_ids.shape[1]), decoder_input_ids.shape)
+


As far as I understand, this (else) case (i.e. when past is None) do NOT require decoder_position_ids. TFBartLearnedPositionalEmbedding will take care of creating it.

Also, we don't need to broadcast here. (unless XLA requires the explicit shape here)

That is correct, since the guard is only active when the past is passed (not the case here). However, the return statement would fail, because it is expecting a decoder_position_ids variable. I'd rather make it explicit than implicit :)

(removed the broadcast)

src/transformers/models/bart/modeling_tf_bart.py

ydshieh · 2022-06-05T11:51:51Z

src/transformers/models/bart/modeling_tf_bart.py

        # cut decoder_input_ids if past is used
        if past is not None:
            decoder_input_ids = decoder_input_ids[:, -1:]

+        if decoder_attention_mask is not None:  # xla


nit: From TF-GPT2's prepare_inputs_for_generation, maybe the following will be more consistent?

I would guess decoder_position_ids would never be passed to prepare_inputs_for_generation, so it will be always created here.
(But the question becomes why we even have a such check in TF-GPT2?)

decoder_position_ids = kwargs.get("decoder_position_ids", None) ... if decoder_position_ids is None: if decoder_attention_mask is not None: .... elif: ... else: ...

However, see my comment for the (merged) TF-GPT2 PR

https://github.com/huggingface/transformers/pull/17426/files#r889687584

I agree with your concerns, but I'm not going to worry about consistency for now :) I haven't managed to get beam search to work, so there is a chance I will have to rewrite all these generate-related functions.

After beam search is working then yes, I'd like to revisit these models and make a template for each kind (i.e. decoder-only or encoder-decoder model). Would that work for you, @ydshieh ?

Good for me, @gante :-)

src/transformers/models/bart/modeling_tf_bart.py

ydshieh

Left a few comment for _update_model_kwargs_for_xla_generation.

It's NP hard for me to understand 😢, so I tried to add some comments in the code. Hopefully you will find it is helpful and merge it.
Need some explanations on decoder_attention_mask 🙏

src/transformers/models/bart/modeling_tf_bart.py

ydshieh · 2022-06-06T08:00:15Z

src/transformers/models/bart/modeling_tf_bart.py

+            decoder_attention_mask = tf.concat(
+                [
+                    tf.ones((batch_size, 1), dtype=tf.int32),
+                    tf.zeros((batch_size, num_padding_values), dtype=tf.int32),
+                    tf.ones((batch_size, 1), dtype=tf.int32),
+                ],
+                axis=1,
+            )


[Update]
It looks like this block is for the case where the generation of decoder_start_token_id was done.
So the num_padding_values would equal to max_length - 1 - 1.
It's still not clear to me why we put zeros before the second ones.

In general, I think it would be great if we can put more comments to explain things along the code.
(You definitely know things much better, but it would be beneficial to other developers 😄 )

[Original comment]
I am not able to understand this block so far.

The decoder_attention_mask normally has the same length as the current input sequence.
I guess maybe here you want to keep the shape being fixed (i.e. with max_length step).
But this block gives a length of 2 + num_padding_values?

Also, it looks like this discards the decoder_attention_mask in model_kwargs (if provided). In TF-GPT2, this case is treated. But probably there is some assumption that decoder_attention_mask is never provided to generate for TF-Bart, and will only be added in _update_model_kwargs_for_xla_generation?

Yeah, this can definitely be improved (it was copy/paste from T5). Will not touch it this PR though, the beam search PR will rewrite it out of necessity 😭

gante · 2022-06-06T17:37:33Z

Hey @ydshieh 👋 answering your questions:

From the change in prepare_inputs_for_generation, both in the PR for TF-GPT2 and this PR, my understanding of the main change is that: we need to use (decoder) attention mask in order to calculate the correct position_ids for both left/right padding. And this is done using tf.math.cumsum. Do I understand these PR correctly?

Correct 👍

Why we need decoder_position_ids when past_key_values is passed?

In the original PT code and eager execution TF, the position ids can be obtained by default (i.e. when not explicitly passed) from the past length, as the past length corresponds to the next position id if there is no left padding. In FLAX and XLA TF, the past is zero-padded, so the past length is not the default position id. As such, it is dangerous to leave the default path active -- this path should only be used in generate anyways, and the updated generate passes the position ids. (The GPT2 should also get the same guard, to be safe!)

ydshieh · 2022-06-06T20:00:05Z

OK, I might got it. The past sent to the model is the padded (on the right) version! (which is required by XLA to have a fixed shape during loop, right?)

Thank you @gante !

ydshieh · 2022-06-06T20:31:09Z

I didn't think it in a thorough way, but in prepare_inputs_for_generation, when we return the actual inputs to a model,

transformers/src/transformers/models/bart/modeling_tf_bart.py

Line 1428 in 9089b7b

"past_key_values": past,

it seems to me that we could cut past to the actual (non-padded) version. And when the model returns past, in _update_model_kwargs_for_xla_generation, we just always pad on the right.

(of course, we need to pass the current length info. to prepare_inputs_for_generation if we want to do so)

this will keep model_kwargs["past"] compatible with XLA
the actual past to model is the same as before
- especially, it won't get max_length - 1 as length, so we no longer have overhead due to the increasing length
it might make the logic a bit easier in _update_model_kwargs_for_xla_generation

@gante I don't want to make you too busy. I will let you judge if this is a good idea, and even if it is, if we should change it now, or we can do it later. I know we want to publish our work soon!

gante · 2022-06-06T20:37:09Z

it seems to me that we could cut past to the actual (non-padded) version.

I would love to do that, and it would be a great idea to simplify the code, but sadly XLA does not allow dynamic-sized slices (i.e. cutting past based on the current length or based on its non-zero values). I've had the same idea too, but then I came across this limitation (documented here)😢 Sadly, we have to keep working with the full padded array everywhere when XLA is on.

ydshieh

I finally finished the review, sorry for being too long.
Thank you for this awesome work, @gante! LGTM for the logic (thanks for the explanations!)

I would encourage to explain things a bit more along the code, as mentioned in a few of my comments.
(I only review the Bart-related files)

For examples, this place

transformers/src/transformers/models/bart/modeling_tf_bart.py

Line 1461 in 9089b7b

decoder_attention_mask = tf.concat(

seems to suggest the prompt would have seq length 1 (i.e. [decoder_start_token_id]), and I am totally fine as if this would be the only use case for Bart (I believe so).

However, from the method itself, it looks like it can handle any prompt.

especially the treatment of past

A comment mentioning what (case/assumption) the block is dealing with would be great.

tests/models/bart/test_modeling_tf_bart.py

patrickvonplaten · 2022-06-08T16:20:52Z

Think we can move towards finishing this PR here :-)

gante · 2022-06-15T13:59:57Z

@patrickvonplaten it is ready to merge -- would you like to make a final review, or can I merge the PR? :)

patrickvonplaten

Looks clean - good to go for me!

Would indeed be nice to eventually replace BART's slow temporary test with a XLA beam search test

* Also propagate changes to blenderbot, blenderbot_small, marian, mbart, and pegasus

gante commented May 30, 2022

View reviewed changes

src/transformers/models/bart/modeling_tf_bart.py Show resolved Hide resolved

gante commented May 30, 2022

View reviewed changes

src/transformers/models/bart/modeling_tf_bart.py Outdated Show resolved Hide resolved

gante commented May 30, 2022

View reviewed changes

src/transformers/models/bart/modeling_tf_bart.py Outdated Show resolved Hide resolved

gante commented May 30, 2022

View reviewed changes

src/transformers/models/bart/modeling_tf_bart.py Show resolved Hide resolved

gante commented May 30, 2022

View reviewed changes

tests/models/bart/test_modeling_tf_bart.py Show resolved Hide resolved

gante mentioned this pull request May 30, 2022

TF: XLA Beam Search #17458

Closed

gante marked this pull request as ready for review May 31, 2022 13:07

gante force-pushed the xla_bart branch from aeaabee to ac88dc0 Compare May 31, 2022 18:38

gante requested review from patrickvonplaten and ydshieh May 31, 2022 18:41

patrickvonplaten reviewed Jun 3, 2022

View reviewed changes

ydshieh reviewed Jun 3, 2022

View reviewed changes

src/transformers/models/bart/modeling_tf_bart.py Show resolved Hide resolved

src/transformers/models/bart/modeling_tf_bart.py Outdated Show resolved Hide resolved

ydshieh reviewed Jun 4, 2022

View reviewed changes

ydshieh reviewed Jun 5, 2022

View reviewed changes

ydshieh mentioned this pull request Jun 5, 2022

TF: GPT-2 generation supports left-padding #17426

Merged

ydshieh reviewed Jun 5, 2022

View reviewed changes

src/transformers/models/bart/modeling_tf_bart.py Show resolved Hide resolved

ydshieh reviewed Jun 6, 2022

View reviewed changes

ydshieh approved these changes Jun 8, 2022

View reviewed changes

tests/models/bart/test_modeling_tf_bart.py Outdated Show resolved Hide resolved

tests/models/bart/test_modeling_tf_bart.py Show resolved Hide resolved

gante added 4 commits June 15, 2022 10:10

tmp commit

272d907

only pre-write pad token when it is non zero and non None

624efc2

Fix tf generate attention mask where eos token == pad token

7f24c73

fix the case in the previous commit where left padding is also present

148a8d3

gante added 10 commits June 15, 2022 10:10

throw a warning instead

b9f3e38

make fixup

78bf3de

attention mask has to stay as arg

6ff79b1

tmp commit

5ad50b8

xla bart operational

6ea224b

make fixup (and fix copies)

36c286e

Trigger Build

b7a95f1

updated comment

5adf1fa

break line

fd89f8b

PR comments

98996ed

gante force-pushed the xla_bart branch from c8299be to 98996ed Compare June 15, 2022 10:10

gante added 3 commits June 15, 2022 11:28

Add changes to blenderbot

69edd10

add changes to blenderbot_small

32cf1b1

add changes to marian, mbart, and pegasus

63e6d9f

patrickvonplaten approved these changes Jun 20, 2022

View reviewed changes

gante merged commit 132402d into huggingface:main Jun 20, 2022

gante deleted the xla_bart branch June 20, 2022 10:07

younesbelkada pushed a commit to younesbelkada/transformers that referenced this pull request Jun 25, 2022

TF: BART compatible with XLA generation (huggingface#17479)

c98ac16

* Also propagate changes to blenderbot, blenderbot_small, marian, mbart, and pegasus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF: BART compatible with XLA generation #17479

TF: BART compatible with XLA generation #17479

gante commented May 30, 2022 •

edited

HuggingFaceDocBuilderDev commented May 30, 2022 •

edited

gante commented May 31, 2022

ydshieh commented May 31, 2022

patrickvonplaten left a comment

ydshieh left a comment •

edited

ydshieh left a comment

ydshieh Jun 4, 2022 •

edited

gante Jun 14, 2022

ydshieh Jun 5, 2022

ydshieh Jun 5, 2022

gante Jun 14, 2022 •

edited

ydshieh Jun 14, 2022

ydshieh left a comment

ydshieh Jun 6, 2022 •

edited

ydshieh Jun 6, 2022 •

edited

gante Jun 14, 2022

gante commented Jun 6, 2022

ydshieh commented Jun 6, 2022 •

edited

ydshieh commented Jun 6, 2022 •

edited

gante commented Jun 6, 2022

ydshieh left a comment •

edited

patrickvonplaten commented Jun 8, 2022

gante commented Jun 15, 2022

patrickvonplaten left a comment

	if position_ids is None:
	if past_key_values is not None:
	raise ValueError("Make sure to provide `decoder_position_ids` when passing `past_key_values`.")

	if position_ids is None:
	position_ids = tf.expand_dims(tf.range(past_length, input_shape[-1] + past_length), axis=0)

	elif past is not None: # non xla + past
	decoder_position_ids = tf.broadcast_to(past[0][0].shape[2], (decoder_input_ids.shape[0], 1))

TF: BART compatible with XLA generation #17479

TF: BART compatible with XLA generation #17479

Conversation

gante commented May 30, 2022 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented May 30, 2022 • edited

gante commented May 31, 2022

ydshieh commented May 31, 2022

patrickvonplaten left a comment

Choose a reason for hiding this comment

ydshieh left a comment • edited

Choose a reason for hiding this comment

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh Jun 4, 2022 • edited

Choose a reason for hiding this comment

gante Jun 14, 2022

Choose a reason for hiding this comment

ydshieh Jun 5, 2022

Choose a reason for hiding this comment

ydshieh Jun 5, 2022

Choose a reason for hiding this comment

gante Jun 14, 2022 • edited

Choose a reason for hiding this comment

ydshieh Jun 14, 2022

Choose a reason for hiding this comment

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh Jun 6, 2022 • edited

Choose a reason for hiding this comment

ydshieh Jun 6, 2022 • edited

Choose a reason for hiding this comment

gante Jun 14, 2022

Choose a reason for hiding this comment

gante commented Jun 6, 2022

ydshieh commented Jun 6, 2022 • edited

ydshieh commented Jun 6, 2022 • edited

gante commented Jun 6, 2022

ydshieh left a comment • edited

Choose a reason for hiding this comment

patrickvonplaten commented Jun 8, 2022

gante commented Jun 15, 2022

patrickvonplaten left a comment

Choose a reason for hiding this comment

gante commented May 30, 2022 •

edited

HuggingFaceDocBuilderDev commented May 30, 2022 •

edited

ydshieh left a comment •

edited

ydshieh Jun 4, 2022 •

edited

gante Jun 14, 2022 •

edited

ydshieh Jun 6, 2022 •

edited

ydshieh Jun 6, 2022 •

edited

ydshieh commented Jun 6, 2022 •

edited

ydshieh commented Jun 6, 2022 •

edited

ydshieh left a comment •

edited