Accept pre-tokenized references & hypothesis for METEOR calculation #2822

yutanakamura-tky · 2021-09-27T16:07:19Z

I added an option to input pre-tokenized sentence(s) for METEOR calculation.

At present, references & hypothesis are always tokenized with str.split() (Lines 30-31 of meteor_score.py):

    hypothesis_list = list(enumerate(preprocess(hypothesis).split()))
    reference_list = list(enumerate(preprocess(reference).split()))

However, we have trouble calculating METEOR in a language that does not separate words with spaces (e.g., Japanese).

In this pull request, I have changed _generate_enums to accept pre-tokenized reference & hypothesis in addition to untokenized reference & hypothesis so that we can use a customized tokenizer.

Other functions have been changed accordingly.

tomaarsen · 2021-09-27T16:14:14Z

@yutanakamura-tky Heya! We've started using https://pre-commit.com/ pretty recently. With this program, some tests are ran before it lets you commit. With our configuration, it means that some commands are ran to ensure some code quality. In particular, the black formatter. As you can see in the test output, the black formatter complained that the changes you are trying to push are not formatted according to the black formatting rules.

My recommendation is to run pre-commit install (pre-commit should have been pip install-ed already, but otherwise you can use pip install pre-commit). This will install the pre-commit hooks, causing it to run all of those hooks (e.g. formatters, verifiers, etc.) when you make another commit. If you then try to push another commit, the black hook will reformat your changes, and you can push the correctly formatted changes. This might seem a little bit confusing, but it's more intuitive in practice. Oh, also, pre-commit only affects this repository, you don't have to worry about it affecting other projects.

Upon installing, you may run pre-commit run --all, which applies formatting and such on all files, and in particular on your changes. You may push these changes, so the Continuous Integration stops complaining.

Let me know if you have any issues.

yutanakamura-tky · 2021-09-27T16:22:48Z

@tomaarsen
Thank you! I have updated this pull request using pre-commit.

stevenbird · 2021-09-27T21:52:50Z

I would prefer to require tokenised input (a breaking change) than to have str or list(str) parameter

… calculation

yutanakamura-tky · 2021-09-28T04:22:01Z

@stevenbird
Thank you! I have made the following changes:

1. Enforcement of tokenized inputs

I have updated the function implementations, docstrings, and examples.

2. Modification of the test code

I have deleted the following code in nltk/test/unit/translate/test_meteor.py because it assumed that the input was given not in list(str) but str (Lines 12-13):

        # no `preprocess` argument
        self.assertRaises(TypeError, meteor_score, reference, candidate)

I have also changed the code as follows to examine better if the preprocessing works correctly:

        reference = [["this", "is", "a", "test"], ["this", "is" "test"]]
        candidate = ["THIS", "Is", "a", "tEST"]

        score = meteor_score(reference, candidate, preprocess=str.lower)

I have confirmed locally that it passes pytest nltk/test/unit/translate/test_meteor.py.

yutanakamura-tky · 2021-09-28T04:27:56Z

The force-push from 5db9c72 to ea66229 was made to amend the commit message of 3d0d272 from "accept pre-tokenized references & hypothesis" to "applied code formatter."
Sorry for any inconvenience.

nltk/translate/meteor_score.py

tomaarsen · 2021-09-28T12:34:59Z

I would prefer to require tokenised input (a breaking change) than to have str or list(str) parameter

I believe that all the other translation scores, i.e. BLEU, CHRF, GLEU, NIST and RIBES also use pre-tokenized inputs (Correct me if I'm wrong). As a result, I'm all in favor of modifying the METEOR score so that it also takes tokenised input.

tomaarsen · 2021-09-28T14:11:48Z

I've applied tokenization on the hypotheses and references in meteor.doctest, so the tests use tokenized inputs rather than the original string ones. The tests pass now! 🎉

…#2821)

…to modified_meteor

yutanakamura-tky · 2021-09-29T03:33:03Z

@tomaarsen
Thank you for your review comments & updating meteor.doctest!
I have made some changes as follows:

1. Addition of Type Hints

I have added type hints to all methods in meteor_score.py.
Accordingly, I have removed type descriptions in the docstrings.

2. Minor Improvement of Local Variable Names

I have changed some local variable names to make their roles clearer.

_generate_enums()
- hypothesis_list -> enum_hypothesis_list
- reference_list -> enum_reference_list
exact_match()
- hypothesis_list -> enum_hypothesis_list
- reference_list -> enum_reference_list
_enum_stem_match()
- stemmed_enum_list1 -> stemmed_enum_hypothesis_list
- stemmed_enum_list2 -> stemmed_enum_reference_list

I'm guessing this was leftover from when it was wrapped with print() for debugging?

tomaarsen

Looks good to me! I'm a great fan of the modernisation, and the modified functionality is wonderful too, it makes the METEOR score calculation similar to the other translation score metrics.

yutanakamura-tky · 2021-09-29T08:46:39Z

@tomaarsen
Thank you so much!
I want just one more piece of advice about variable typing.

The ongoing docstring in the develop branch says that meteor_score() takes float for the variable beta, while it defaults to 3:

def meteor_score(
    references: Iterable[Iterable[str]],
    hypothesis: Iterable[str],
    preprocess: Callable[[str], str] = str.lower,
    stemmer: StemmerI = PorterStemmer(),
    wordnet: WordNetCorpusReader = wordnet,
    alpha: float = 0.9,
    beta: float = 3,
    gamma: float = 0.5,
)

The "3" derives from the power of penalty in the original METEOR article.
I am wondering which is better, beta: Union[int, float]=3 or beta: float=3.0.

tomaarsen · 2021-09-29T08:49:10Z

@yutanakamura-tky I think the best solution in that situation is simply to change beta: float = 3 into beta: float = 3.0. That way it's extra clear that this value can be/is meant to be a float. That said, users are free to use e.g. 4, because Python automatically converts ints to floats whenever requires anyways.
Good catch, by the way!

… single_meteor_score()

yutanakamura-tky · 2021-09-30T00:05:21Z

@tomaarsen
Thank you so much! I have also made the same change to single_meteor_score().
There are no further changes that I am planning to make.

tomaarsen · 2021-09-30T07:05:59Z

Wonderful, it all looks great to me. I'll leave the PR like this for now, so perhaps another team member can have a look at it too.

dannysepler · 2021-10-09T16:41:09Z

nice! i love the addition of type hints anywhere in NLTK

~~given that this is a breaking change, should we have a changelog entry? after that, lgtm!~~ (EDIT: Realizing now that only Steven has edited that file in recent history. Sorry!)

stevenbird · 2021-10-10T00:00:34Z

Thank you @yutanakamura-tky, @tomaarsen, @dannysepler.

Nltk 3.6.5 introduces a breaking change to METEOR score. See <nltk/nltk#2822>.

Witiko · 2021-10-16T21:34:44Z

@stevenbird Please, consider incrementing at least the minor version the next time you introduce a breaking change.

stevenbird · 2021-10-18T09:58:23Z

@Witiko: noted, my apologies

accept pre-tokenized references & hypothesis

5e896d9

tomaarsen added enhancement metrics labels Sep 27, 2021

accept pre-tokenized references & hypothesis

3866a06

tomaarsen added the needs review label Sep 27, 2021

yutanakamura-tky added 5 commits September 28, 2021 09:27

applied code formatter

3d0d272

breaking changes: always require pre-tokenized sentence(s) for METEOR…

823eb9f

… calculation

modified test

ea66229

fixed typo & deleted unneeded "preprocess"

0942dde

deleted unneeded "preprocess" further

f451f1f

type check & error message of _generate_enums()

49740df