Zero-shot classification pipeline truncation support #13381

davidmezzetti · 2021-09-01T16:16:36Z

Transformers 4.10.0 brought a change that modified the default truncation strategy to TruncationStrategy.DO_NOT_TRUNCATE for the ZeroShotClassificationPipeline.

That uncovered an issue in that the ZeroShotClassificationPipeline doesn't appear to pass kwargs to the parent's call method. So even when calling the pipeline with truncation=True, it doesn't allow for truncation.

Thank you for the assistance in advance, appreciate all the work you guys do.

davidmezzetti · 2021-09-01T18:25:02Z

One additional related question: this warning gets printed out each zsl pipeline call now:

logger.warning("The tokenizer {self.tokenizer} does not have a pad token, we're not running it as a batch")

Would that make more sense to raise on init of the object vs each call? Otherwise, it can get pretty noisy even if you reuse the same zsl pipeline multiple times.

LysandreJik · 2021-09-02T09:54:41Z

cc @Narsil

Narsil · 2021-09-02T12:36:15Z

What model/tokenizer were you using ? Using ONLY_FIRST on tokenizer that do not have pad_token should have resulted in error before, but I could be mistaken.

For the warning, as it was supposed to be a new path (and not an old one), it is indeed a bit noisy, and should be cleaned up with the new refactor of pipelines. #13308

What you are claiming is that this was an unexpected regression here, so I would like to test out what was exactly wrong, so we can have a real test for this case before fixing it.

davidmezzetti · 2021-09-02T17:08:07Z

Appreciate the quick response. Here is code that works in 4.9.2 but fails in 4.10.0

from transformers import pipeline

nlp = pipeline("zero-shot-classification", model="roberta-large-mnli", tokenizer="roberta-large-mnli")
nlp(["Very long text" * 1000, "Happy to hear"], ["negative", "positive"], truncation=True)

nlp = pipeline("zero-shot-classification")
nlp(["Very long text" * 1000, "Happy to hear"], ["negative", "positive"], truncation=True)

I don't think the truncation param ever did anything but changing the default tokenization param from ONLY_FIRST to DO_NOT_TRUNCATE seems to have exposed the issue.

Narsil · 2021-09-02T18:03:27Z

Ok.

I can confirm that truncation went from being on by default to not by default.

@LysandreJik that was to enable all tokenizers that don't have a pad_token (and can't pad anyway). However, changing the default for tokenizer that can was an oversight on my part.

We can go different routes:

Revert the change of default, and override to DO_NOT_TRUNCATE only in the pad_token missing path (I think this is what should have been done in the first place, my bad here).
Simply roll with it as it was released, but fix the passing around of the truncation argument (it does require a change, to pass on the kwargs to call simply, but might break existing code where some kwargs where just ignored before and would wind up in the tokenizer(...) call triggering new errors).

LysandreJik · 2021-09-03T08:04:26Z

I think we can go with 1. and then release a patch (v4.10.1) to keep the previous behavior

Narsil · 2021-09-03T10:45:54Z

The PR is ready. Turns out the default Truncation.ONLY_FIRST can break (on LED with small tokenizers) where the input is not large enough. (I am not really sure why it's an error in tokenizers).

So I made the changes that hopefully should match the old behavior more closely.

* Fixing #13381 * Enabling automatic LED models.

Narsil added a commit to Narsil/transformers that referenced this issue Sep 3, 2021

Fixing huggingface#13381

c669fc6

Narsil mentioned this issue Sep 3, 2021

Fixing #13381 #13400

Merged

5 tasks

LysandreJik closed this as completed in #13400 Sep 9, 2021

LysandreJik pushed a commit that referenced this issue Sep 9, 2021

Fixing #13381 (#13400)

aacd212

* Fixing #13381 * Enabling automatic LED models.

patrickvonplaten pushed a commit that referenced this issue Sep 10, 2021

Fixing #13381 (#13400)

e5e0452

* Fixing #13381 * Enabling automatic LED models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-shot classification pipeline truncation support #13381

Zero-shot classification pipeline truncation support #13381

davidmezzetti commented Sep 1, 2021

davidmezzetti commented Sep 1, 2021

LysandreJik commented Sep 2, 2021

Narsil commented Sep 2, 2021

davidmezzetti commented Sep 2, 2021

Narsil commented Sep 2, 2021

LysandreJik commented Sep 3, 2021

Narsil commented Sep 3, 2021

Zero-shot classification pipeline truncation support #13381

Zero-shot classification pipeline truncation support #13381

Comments

davidmezzetti commented Sep 1, 2021

davidmezzetti commented Sep 1, 2021

LysandreJik commented Sep 2, 2021

Narsil commented Sep 2, 2021

davidmezzetti commented Sep 2, 2021

Narsil commented Sep 2, 2021

LysandreJik commented Sep 3, 2021

Narsil commented Sep 3, 2021