New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept pre-tokenized references & hypothesis for METEOR calculation #2822
Conversation
@yutanakamura-tky Heya! We've started using https://pre-commit.com/ pretty recently. With this program, some tests are ran before it lets you commit. With our configuration, it means that some commands are ran to ensure some code quality. In particular, the My recommendation is to run Upon installing, you may run Let me know if you have any issues. |
@tomaarsen |
I would prefer to require tokenised input (a breaking change) than to have |
@stevenbird 1. Enforcement of tokenized inputsI have updated the function implementations, docstrings, and examples. 2. Modification of the test codeI have deleted the following code in
I have also changed the code as follows to examine better if the preprocessing works correctly:
I have confirmed locally that it passes |
I believe that all the other translation scores, i.e. BLEU, CHRF, GLEU, NIST and RIBES also use pre-tokenized inputs (Correct me if I'm wrong). As a result, I'm all in favor of modifying the METEOR score so that it also takes tokenised input. |
I've applied tokenization on the hypotheses and references in |
…to modified_meteor
@tomaarsen 1. Addition of Type HintsI have added type hints to all methods in 2. Minor Improvement of Local Variable NamesI have changed some local variable names to make their roles clearer.
|
I'm guessing this was leftover from when it was wrapped with print() for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'm a great fan of the modernisation, and the modified functionality is wonderful too, it makes the METEOR score calculation similar to the other translation score metrics.
@tomaarsen The ongoing docstring in the
The "3" derives from the power of penalty in the original METEOR article. |
@yutanakamura-tky I think the best solution in that situation is simply to change |
… single_meteor_score()
@tomaarsen |
Wonderful, it all looks great to me. I'll leave the PR like this for now, so perhaps another team member can have a look at it too. |
nice! i love the addition of type hints anywhere in NLTK
|
Thank you @yutanakamura-tky, @tomaarsen, @dannysepler. |
Nltk 3.6.5 introduces a breaking change to METEOR score. See <nltk/nltk#2822>.
@stevenbird Please, consider incrementing at least the minor version the next time you introduce a breaking change. |
@Witiko: noted, my apologies |
I added an option to input pre-tokenized sentence(s) for METEOR calculation.
At present, references & hypothesis are always tokenized with
str.split()
(Lines 30-31 ofmeteor_score.py
):However, we have trouble calculating METEOR in a language that does not separate words with spaces (e.g., Japanese).
In this pull request, I have changed
_generate_enums
to accept pre-tokenized reference & hypothesis in addition to untokenized reference & hypothesis so that we can use a customized tokenizer.Other functions have been changed accordingly.