Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept pre-tokenized references & hypothesis for METEOR calculation #2822

Merged
merged 20 commits into from Oct 9, 2021
Merged

Accept pre-tokenized references & hypothesis for METEOR calculation #2822

merged 20 commits into from Oct 9, 2021

Conversation

yutanakamura-tky
Copy link
Contributor

I added an option to input pre-tokenized sentence(s) for METEOR calculation.

At present, references & hypothesis are always tokenized with str.split() (Lines 30-31 of meteor_score.py):

    hypothesis_list = list(enumerate(preprocess(hypothesis).split()))
    reference_list = list(enumerate(preprocess(reference).split()))

However, we have trouble calculating METEOR in a language that does not separate words with spaces (e.g., Japanese).

In this pull request, I have changed _generate_enums to accept pre-tokenized reference & hypothesis in addition to untokenized reference & hypothesis so that we can use a customized tokenizer.

Other functions have been changed accordingly.

@tomaarsen
Copy link
Member

@yutanakamura-tky Heya! We've started using https://pre-commit.com/ pretty recently. With this program, some tests are ran before it lets you commit. With our configuration, it means that some commands are ran to ensure some code quality. In particular, the black formatter. As you can see in the test output, the black formatter complained that the changes you are trying to push are not formatted according to the black formatting rules.

My recommendation is to run pre-commit install (pre-commit should have been pip install-ed already, but otherwise you can use pip install pre-commit). This will install the pre-commit hooks, causing it to run all of those hooks (e.g. formatters, verifiers, etc.) when you make another commit. If you then try to push another commit, the black hook will reformat your changes, and you can push the correctly formatted changes. This might seem a little bit confusing, but it's more intuitive in practice. Oh, also, pre-commit only affects this repository, you don't have to worry about it affecting other projects.

Upon installing, you may run pre-commit run --all, which applies formatting and such on all files, and in particular on your changes. You may push these changes, so the Continuous Integration stops complaining.

Let me know if you have any issues.

@yutanakamura-tky
Copy link
Contributor Author

@tomaarsen
Thank you! I have updated this pull request using pre-commit.

@stevenbird
Copy link
Member

I would prefer to require tokenised input (a breaking change) than to have str or list(str) parameter

@yutanakamura-tky
Copy link
Contributor Author

yutanakamura-tky commented Sep 28, 2021

@stevenbird
Thank you! I have made the following changes:

1. Enforcement of tokenized inputs

I have updated the function implementations, docstrings, and examples.

2. Modification of the test code

I have deleted the following code in nltk/test/unit/translate/test_meteor.py because it assumed that the input was given not in list(str) but str (Lines 12-13):

        # no `preprocess` argument
        self.assertRaises(TypeError, meteor_score, reference, candidate)

I have also changed the code as follows to examine better if the preprocessing works correctly:

        reference = [["this", "is", "a", "test"], ["this", "is" "test"]]
        candidate = ["THIS", "Is", "a", "tEST"]

        score = meteor_score(reference, candidate, preprocess=str.lower)

I have confirmed locally that it passes pytest nltk/test/unit/translate/test_meteor.py.

@yutanakamura-tky
Copy link
Contributor Author

yutanakamura-tky commented Sep 28, 2021

The force-push from 5db9c72 to ea66229 was made to amend the commit message of 3d0d272 from "accept pre-tokenized references & hypothesis" to "applied code formatter."
Sorry for any inconvenience.

@tomaarsen
Copy link
Member

I would prefer to require tokenised input (a breaking change) than to have str or list(str) parameter

I believe that all the other translation scores, i.e. BLEU, CHRF, GLEU, NIST and RIBES also use pre-tokenized inputs (Correct me if I'm wrong). As a result, I'm all in favor of modifying the METEOR score so that it also takes tokenised input.

@tomaarsen
Copy link
Member

tomaarsen commented Sep 28, 2021

I've applied tokenization on the hypotheses and references in meteor.doctest, so the tests use tokenized inputs rather than the original string ones. The tests pass now! 🎉

@yutanakamura-tky
Copy link
Contributor Author

yutanakamura-tky commented Sep 29, 2021

@tomaarsen
Thank you for your review comments & updating meteor.doctest!
I have made some changes as follows:

1. Addition of Type Hints

I have added type hints to all methods in meteor_score.py.
Accordingly, I have removed type descriptions in the docstrings.

2. Minor Improvement of Local Variable Names

I have changed some local variable names to make their roles clearer.

  • _generate_enums()
    • hypothesis_list -> enum_hypothesis_list
    • reference_list -> enum_reference_list
  • exact_match()
    • hypothesis_list -> enum_hypothesis_list
    • reference_list -> enum_reference_list
  • _enum_stem_match()
    • stemmed_enum_list1 -> stemmed_enum_hypothesis_list
    • stemmed_enum_list2 -> stemmed_enum_reference_list

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'm a great fan of the modernisation, and the modified functionality is wonderful too, it makes the METEOR score calculation similar to the other translation score metrics.

@yutanakamura-tky
Copy link
Contributor Author

yutanakamura-tky commented Sep 29, 2021

@tomaarsen
Thank you so much!
I want just one more piece of advice about variable typing.

The ongoing docstring in the develop branch says that meteor_score() takes float for the variable beta, while it defaults to 3:

def meteor_score(
    references: Iterable[Iterable[str]],
    hypothesis: Iterable[str],
    preprocess: Callable[[str], str] = str.lower,
    stemmer: StemmerI = PorterStemmer(),
    wordnet: WordNetCorpusReader = wordnet,
    alpha: float = 0.9,
    beta: float = 3,
    gamma: float = 0.5,
)

The "3" derives from the power of penalty in the original METEOR article.
I am wondering which is better, beta: Union[int, float]=3 or beta: float=3.0.

@tomaarsen
Copy link
Member

tomaarsen commented Sep 29, 2021

@yutanakamura-tky I think the best solution in that situation is simply to change beta: float = 3 into beta: float = 3.0. That way it's extra clear that this value can be/is meant to be a float. That said, users are free to use e.g. 4, because Python automatically converts ints to floats whenever requires anyways.
Good catch, by the way!

@stevenbird stevenbird self-assigned this Sep 29, 2021
@yutanakamura-tky
Copy link
Contributor Author

yutanakamura-tky commented Sep 30, 2021

@tomaarsen
Thank you so much! I have also made the same change to single_meteor_score().
There are no further changes that I am planning to make.

@tomaarsen
Copy link
Member

Wonderful, it all looks great to me. I'll leave the PR like this for now, so perhaps another team member can have a look at it too.

@dannysepler
Copy link
Contributor

dannysepler commented Oct 9, 2021

nice! i love the addition of type hints anywhere in NLTK

given that this is a breaking change, should we have a changelog entry? after that, lgtm! (EDIT: Realizing now that only Steven has edited that file in recent history. Sorry!)

@stevenbird stevenbird merged commit 836b98e into nltk:develop Oct 9, 2021
@stevenbird
Copy link
Member

Thank you @yutanakamura-tky, @tomaarsen, @dannysepler.

Witiko added a commit to MIR-MU/regemt that referenced this pull request Oct 16, 2021
Nltk 3.6.5 introduces a breaking change to METEOR score.
See <nltk/nltk#2822>.
@Witiko
Copy link

Witiko commented Oct 16, 2021

@stevenbird Please, consider incrementing at least the minor version the next time you introduce a breaking change.

@stevenbird
Copy link
Member

@Witiko: noted, my apologies

@tomaarsen tomaarsen mentioned this pull request Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants