Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-shot classification pipeline truncation support #13381

Closed
davidmezzetti opened this issue Sep 1, 2021 · 7 comments · Fixed by #13400
Closed

Zero-shot classification pipeline truncation support #13381

davidmezzetti opened this issue Sep 1, 2021 · 7 comments · Fixed by #13400

Comments

@davidmezzetti
Copy link
Contributor

Transformers 4.10.0 brought a change that modified the default truncation strategy to TruncationStrategy.DO_NOT_TRUNCATE for the ZeroShotClassificationPipeline.

That uncovered an issue in that the ZeroShotClassificationPipeline doesn't appear to pass kwargs to the parent's call method. So even when calling the pipeline with truncation=True, it doesn't allow for truncation.

Thank you for the assistance in advance, appreciate all the work you guys do.

@davidmezzetti
Copy link
Contributor Author

One additional related question: this warning gets printed out each zsl pipeline call now:

logger.warning("The tokenizer {self.tokenizer} does not have a pad token, we're not running it as a batch")

Would that make more sense to raise on init of the object vs each call? Otherwise, it can get pretty noisy even if you reuse the same zsl pipeline multiple times.

@LysandreJik
Copy link
Member

cc @Narsil

@Narsil
Copy link
Contributor

Narsil commented Sep 2, 2021

What model/tokenizer were you using ? Using ONLY_FIRST on tokenizer that do not have pad_token should have resulted in error before, but I could be mistaken.

For the warning, as it was supposed to be a new path (and not an old one), it is indeed a bit noisy, and should be cleaned up with the new refactor of pipelines. #13308

What you are claiming is that this was an unexpected regression here, so I would like to test out what was exactly wrong, so we can have a real test for this case before fixing it.

@davidmezzetti
Copy link
Contributor Author

Appreciate the quick response. Here is code that works in 4.9.2 but fails in 4.10.0

from transformers import pipeline

nlp = pipeline("zero-shot-classification", model="roberta-large-mnli", tokenizer="roberta-large-mnli")
nlp(["Very long text" * 1000, "Happy to hear"], ["negative", "positive"], truncation=True)

nlp = pipeline("zero-shot-classification")
nlp(["Very long text" * 1000, "Happy to hear"], ["negative", "positive"], truncation=True)

I don't think the truncation param ever did anything but changing the default tokenization param from ONLY_FIRST to DO_NOT_TRUNCATE seems to have exposed the issue.

@Narsil
Copy link
Contributor

Narsil commented Sep 2, 2021

Ok.

I can confirm that truncation went from being on by default to not by default.

@LysandreJik that was to enable all tokenizers that don't have a pad_token (and can't pad anyway). However, changing the default for tokenizer that can was an oversight on my part.

We can go different routes:

  1. Revert the change of default, and override to DO_NOT_TRUNCATE only in the pad_token missing path (I think this is what should have been done in the first place, my bad here).
  2. Simply roll with it as it was released, but fix the passing around of the truncation argument (it does require a change, to pass on the kwargs to call simply, but might break existing code where some kwargs where just ignored before and would wind up in the tokenizer(...) call triggering new errors).

@LysandreJik
Copy link
Member

I think we can go with 1. and then release a patch (v4.10.1) to keep the previous behavior

Narsil added a commit to Narsil/transformers that referenced this issue Sep 3, 2021
@Narsil Narsil mentioned this issue Sep 3, 2021
5 tasks
@Narsil
Copy link
Contributor

Narsil commented Sep 3, 2021

The PR is ready. Turns out the default Truncation.ONLY_FIRST can break (on LED with small tokenizers) where the input is not large enough. (I am not really sure why it's an error in tokenizers).

So I made the changes that hopefully should match the old behavior more closely.

LysandreJik pushed a commit that referenced this issue Sep 9, 2021
* Fixing #13381

* Enabling automatic LED models.
patrickvonplaten pushed a commit that referenced this issue Sep 10, 2021
* Fixing #13381

* Enabling automatic LED models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants