Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multi Bleu functionality and tests #2793

Merged
merged 6 commits into from Nov 20, 2021

Conversation

BatMrE
Copy link
Contributor

@BatMrE BatMrE commented Aug 31, 2021

Feature as per #2320

from bleu_score import corpus_bleu

references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]
print('Individual 1-gram: %f' % corpus_bleu(references, candidates, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % corpus_bleu(references, candidates, weights=(0, 1, 0, 0)))
Individual 1-gram: 1.000000
Individual 2-gram: 1.000000

@tomaarsen
Copy link
Member

I've added a few more tests to give myself some more confidence that your changes work correctly - I hope that's okay.

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I had some time to push some of these suggestions myself. Note that I only modified comments, and not actual code. In short, the following text is outdated, but perhaps it helps explain my commit c730dc3.

The only change request that I still have is to have another look at the if isinstance(weights, tuple). Preferably, I'd like to keep weights=[0.8, 0.2] as a working parameter. In this PR, that throws an exception.


Also, the docstring should be modified to include a modified return and rtype section.

E.g.

    :return: The corpus-level BLEU score. Returns a list if multiple weights were supplied.
    :rtype: float / list(float)

It would also be nice to include an example in the docstring of how weights can be used to finetune the BLEU score.
For example, placing the following comment in the whitespace here:

0.6223...
:param list_of_references: a corpus of lists of reference sentences, w.r.t. hypotheses

    Custom weights may be supplied to fine-tune the BLEU score further. 
    A tuple of float weights for unigrams, bigrams, trigrams and so on can be given.

    >>> weights = (0.1, 0.3, 0.5, 0.1)
    >>> corpus_bleu(list_of_references, hypotheses, weights=weights) # doctest: +ELLIPSIS
    0.5818...

    This particular weight gave extra value to trigrams.

    Furthermore, multiple weights can be given, resulting in multiple BLEU scores.

    >>> weights = [
    ...     (0.5, 0.5),
    ...     (0.333, 0.333, 0.334),
    ...     (0.25, 0.25, 0.25, 0.25),
    ...     (0.2, 0.2, 0.2, 0.2, 0.2)
    ... ]
    >>> corpus_bleu(list_of_references, hypotheses, weights=weights) # doctest: +ELLIPSIS
    [0.8242..., 0.7067..., 0.5920..., 0.4719...]

Also, sentence_bleu should also have their docstring updated. This function simply calls corpus_bleu, so it also gets upgraded with this PR.

nltk/translate/bleu_score.py Outdated Show resolved Hide resolved
nltk/translate/bleu_score.py Outdated Show resolved Hide resolved
Comment on lines 172 to 175
if isinstance(weights, tuple):
weights = [weights]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit dangerous. People might be using list weights right now, and that would stop working. Perhaps a safer solution is to check if weights and isinstance(weights[0], float):?

Currently, the following program will break:

from nltk.translate.bleu_score import corpus_bleu
# define ref's and hyp's here, left out for conciseness.
weights = [0.2, 0.8]
bleu_scores = corpus_bleu(
    list_of_references=[[ref1a, ref1b, ref1c], [ref2a]],
    hypotheses=[hyp1, hyp2],
    weights=weights,
)

Throws the following exception:

Traceback (most recent call last):
  File "[...]\nltk_2793.py", line 110, in <module>
    out = bleu_scores = corpus_bleu(
  File "[...]\nltk\translate\bleu_score.py", line 174, in corpus_bleu
    max_weight_length = max(len(weight) for weight in weights)
  File "[...]\nltk\translate\bleu_score.py", line 174, in <genexpr>  
    max_weight_length = max(len(weight) for weight in weights)
TypeError: object of type 'float' has no len()

This didn't break before. People will expect that something that worked before to still work later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BatMrE I think the functionality of using weights like [0.2, 0.8] like in my example above should still be supported. I would rather not merge this PR until that is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tomaarsen,

I agree with your point however while testing for the script:

from nltk.translate.bleu_score import corpus_bleu
# define ref's and hyp's here, left out for conciseness.
weights = [0.2, 0.8]
bleu_scores = corpus_bleu(
    list_of_references=[[ref1a, ref1b, ref1c], [ref2a]],
    hypotheses=[hyp1, hyp2],
    weights=weights,
)

with code:

    if isinstance(weights, tuple):
        weights = [weights]
    max_weight_length = max(len(weight) for weight in weights)

I am getting the expected result as 0.7496592017168556
and 460 passed, 32 skipped, 9 xfailed, 1 warning in 222.27s (0:03:42)

also your code:

    if weights and isinstance(weights[0], float):
        max_weight_length = max(len(weight) for weight in weights)

gives the same result 0.7496592017168556
and 12 failed, 448 passed, 32 skipped, 9 xfailed, 1 warning in 210.82s (0:03:30)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how you get the expected result with your current implementation. I still only get

TypeError: object of type 'float' has no len()
This is the script I'm using:
hyp1 = [
    "It",
    "is",
    "a",
    "guide",
    "to",
    "action",
    "which",
    "ensures",
    "that",
    "the",
    "military",
    "always",
    "obeys",
    "the",
    "commands",
    "of",
    "the",
    "party",
]
ref1a = [
    "It",
    "is",
    "a",
    "guide",
    "to",
    "action",
    "that",
    "ensures",
    "that",
    "the",
    "military",
    "will",
    "forever",
    "heed",
    "Party",
    "commands",
]
ref1b = [
    "It",
    "is",
    "the",
    "guiding",
    "principle",
    "which",
    "guarantees",
    "the",
    "military",
    "forces",
    "always",
    "being",
    "under",
    "the",
    "command",
    "of",
    "the",
    "Party",
]
ref1c = [
    "It",
    "is",
    "the",
    "practical",
    "guide",
    "for",
    "the",
    "army",
    "always",
    "to",
    "heed",
    "the",
    "directions",
    "of",
    "the",
    "party",
]
hyp2 = [
    "he",
    "read",
    "the",
    "book",
    "because",
    "he",
    "was",
    "interested",
    "in",
    "world",
    "history",
]
ref2a = [
    "he",
    "was",
    "interested",
    "in",
    "world",
    "history",
    "because",
    "he",
    "read",
    "the",
    "book",
]

weights = [0.2, 0.8]

from nltk.translate.bleu_score import corpus_bleu

out = bleu_scores = corpus_bleu(
    list_of_references=[[ref1a, ref1b, ref1c], [ref2a]],
    hypotheses=[hyp1, hyp2],
    weights=weights,
)

print(out)

Can you explain how you get the correct results with weights [0.2, 0.8] currently?

Also, I meant something along the lines of

if weights and isinstance(weights[0], float):
    weights = [weights]
max_weight_length = max(len(weight) for weight in weights)

That said, this does give problems when the weights are for example integers, like (0, 0, 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated as per weights type, tests are passing, let me know for any issue will fix now

nltk/translate/bleu_score.py Outdated Show resolved Hide resolved
@tomaarsen tomaarsen linked an issue Sep 12, 2021 that may be closed by this pull request
@stevenbird
Copy link
Member

@BatMrE please take a look at the above feedback, thanks

@stevenbird
Copy link
Member

@BatMrE thanks for your further work. Can you report whether you dealt with the requested changes from @tomaarsen please?

@BatMrE
Copy link
Contributor Author

BatMrE commented Oct 11, 2021

@BatMrE thanks for your further work. Can you report whether you dealt with the requested changes from @tomaarsen please?

Yes, the changes expected by @tomaarsen for different data types for weight is resolved in my implementation, tested with different formats

@tomaarsen
Copy link
Member

I'll have a look at your changes when I have some more time!

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BatMrE One more thing, it seems that you've removed some commits that were meant to improve the documentation of this file, such as c730dc3. I'm not sure whether this was accidental, but I believe I should be able to re-push these if it was accidental.

There are some closed conversations on this PR that were solved by this commit.

nltk/translate/bleu_score.py Outdated Show resolved Hide resolved
@tomaarsen tomaarsen merged commit 3c2a5a6 into nltk:develop Nov 20, 2021
@tomaarsen
Copy link
Member

Thank you for these changes @BatMrE!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

One-shot BLEU-[2, 3, 4] computation
3 participants