New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-shot classification pipeline truncation support #13381
Comments
One additional related question: this warning gets printed out each zsl pipeline call now: logger.warning("The tokenizer {self.tokenizer} does not have a pad token, we're not running it as a batch") Would that make more sense to raise on init of the object vs each call? Otherwise, it can get pretty noisy even if you reuse the same zsl pipeline multiple times. |
cc @Narsil |
What model/tokenizer were you using ? Using ONLY_FIRST on tokenizer that do not have pad_token should have resulted in error before, but I could be mistaken. For the warning, as it was supposed to be a new path (and not an old one), it is indeed a bit noisy, and should be cleaned up with the new refactor of pipelines. #13308 What you are claiming is that this was an unexpected regression here, so I would like to test out what was exactly wrong, so we can have a real test for this case before fixing it. |
Appreciate the quick response. Here is code that works in 4.9.2 but fails in 4.10.0 from transformers import pipeline
nlp = pipeline("zero-shot-classification", model="roberta-large-mnli", tokenizer="roberta-large-mnli")
nlp(["Very long text" * 1000, "Happy to hear"], ["negative", "positive"], truncation=True)
nlp = pipeline("zero-shot-classification")
nlp(["Very long text" * 1000, "Happy to hear"], ["negative", "positive"], truncation=True) I don't think the truncation param ever did anything but changing the default tokenization param from ONLY_FIRST to DO_NOT_TRUNCATE seems to have exposed the issue. |
Ok. I can confirm that truncation went from being on by default to not by default. @LysandreJik that was to enable all tokenizers that don't have a We can go different routes:
|
I think we can go with 1. and then release a patch (v4.10.1) to keep the previous behavior |
The PR is ready. Turns out the default Truncation.ONLY_FIRST can break (on LED with small tokenizers) where the input is not large enough. (I am not really sure why it's an error in So I made the changes that hopefully should match the old behavior more closely. |
Transformers 4.10.0 brought a change that modified the default truncation strategy to TruncationStrategy.DO_NOT_TRUNCATE for the ZeroShotClassificationPipeline.
That uncovered an issue in that the ZeroShotClassificationPipeline doesn't appear to pass kwargs to the parent's call method. So even when calling the pipeline with truncation=True, it doesn't allow for truncation.
Thank you for the assistance in advance, appreciate all the work you guys do.
The text was updated successfully, but these errors were encountered: