Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lepor : A machine translation evaluation Metric. #3178

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

ulhaqi12
Copy link

Hi,
Work on #3176
This PR contains the code for a machine translation evaluation metric LEPOR. I have provided 'sentence_lepor' and 'corpus_lepor' methods that can be used to calculate the scores for sentences in a list of individually.

  • Lepor for corpus
>>> hypothesis = ['a bird is on a stone.', 'scary crow was not bad.']
>>> references = ['a bird behind the stone.', 'scary cow was good.']
>>> corpus_lepor(references, hypothesis)
[0.7824248013113159, 0.5639427891892225]
  • Lepor for sentence
>>> hypothesis = 'a bird is on a stone.'
>>> references = 'a bird behind the stone.'
>>> sentence_lepor(references, hypothesis)
0.7824248013113159

Kindly suggest any improvements if required. Thank you for giving me the opportunity to contribute.

BR,
Ikram

@stevenbird
Copy link
Member

Let's think about new names for 'sentence_lepor' and 'corpus_lepor' which are more in keeping with NLTK practices.

@tomaarsen
Copy link
Member

tomaarsen commented Jul 29, 2023

I think sentence_lepor and corpus_lepor are reasonable names given that many other scoring functions use the same format:

from nltk.translate.bleu_score import sentence_bleu as bleu
from nltk.translate.ribes_score import sentence_ribes as ribes
from nltk.translate.meteor_score import meteor_score as meteor
from nltk.translate.metrics import alignment_error_rate
from nltk.translate.stack_decoder import StackDecoder
from nltk.translate.nist_score import sentence_nist as nist
from nltk.translate.chrf_score import sentence_chrf as chrf
from nltk.translate.gale_church import trace
from nltk.translate.gdfa import grow_diag_final_and
from nltk.translate.gleu_score import sentence_gleu as gleu

However, I would propose to use:

from nltk.translate.lepor import sentence_lepor as lepor, corpus_lepor

This matches the other scores more closely, and people can primarily use lepor(..., ...) just like bleu(..., ...) and meteor(..., ...,).

Beyond that, you can use pre-commit install to set up pre-commit for (only) this repo, that way it'll automatically run some formatting scripts. Otherwise the tests will fail on this PR.

cc: @alvations expressed interest in this PR.

  • Tom Aarsen

@ulhaqi12
Copy link
Author

Thank you @tomaarsen. I have made the suggested changes and pushed changes with pre-commit install.
@alvations, Your suggestions would be appreciated. Thank you.

Ikram

@tomaarsen
Copy link
Member

I think running pre-commit run --all should prevent the pre-commit workflow from complaining any further. By default, pre-commit only runs the (formatting) scripts over files edited in that commit, i.e. only for __init__.py that you edited, so the changes in lepor.py are not compliant with the formatting rules yet.

@ulhaqi12
Copy link
Author

Yes, As the changes in lepor.py was in the previous commit, and after using pre-commit it was not tracking that file. Pushed it again and now commited lepor.py code again through pre-commit

Copy link
Contributor

@alvations alvations left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the contribution!

Generally a good first draft for the PR. But there's some refactoring and function re-shuffling work as described in the code review.

Then after the revision, we'll have figure out some tests, either in the doc or unit test but first try to incorporate the suggestions/revisions in this first code review first.

"""

r_len = len(reference)
o_len = len(hypothesis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of o_len is a little too arbitrary. The naming of the variable should be a little more readable, try:

ref_len = len(reference)
hyp_len = len(hypothesis)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this way it is more readable. Done.

elif r_len < o_len:
return math.exp(1 - (r_len / o_len))
else:
return math.exp(1 - (o_len / r_len))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to follow how Eq(2) in the original lepor paper list down the three options, or at least annotate with some comments what the else: ... represents (to align to the paper):

    if ref_len == hyp_len:
        return 1
    elif ref_len < hyp_len:
        return math.exp(1 - (ref_len / hyp_len))
    else: # i.e. r_len > hyp_len
        return math.exp(1 - (hyp_len / ref_len))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, It's better to have it as close to paper as possible.

def length_penalty(reference: List[str], hypothesis: List[str]) -> float:
"""
Function will calculate length penalty(LP) one of the components in LEPOR, which is defined to embrace
the penalty for both longer and shorter hypothesis compared with the reference translations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small grammar suggestion:

This function calculates the length penalty(LP) for the LEPOR metric, which is defined to embrace the penalty for both longer and shorter hypothesis compared with the reference translations.
Refer from Eq (2) on https://aclanthology.org/C12-2044

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this.

Function will calculate length penalty(LP) one of the components in LEPOR, which is defined to embrace
the penalty for both longer and shorter hypothesis compared with the reference translations.

:param reference: reference sentence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style suggestion: Capitalize start of param description.

:param reference: Reference sentence

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will do that. Actually, I saw it in blue_score.py. This wasn't the case there.


:param reference: reference sentence
:type reference: str
:param hypothesis: hypothesis sentence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:param hypothesis: Hypothesis sentence 



def ngram_positional_penalty(
ref_words: List[str], hypothesis_words: List[str]
Copy link
Contributor

@alvations alvations Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should follows the parts that the original LEPOR paper described, so that it's more readable and accessible to people learning the algorithm.From paper, "To calculate the value, there are two steps: aligning and calculating"

def alignment(ref_tokens: List[str], hyp_tokens: List[str]):
    """
    This function computes the context-dependenct n-gram word alignment tasks that 
    takes into account the surrounding context (neighbouring words) of the potential 
    word to select a better matching pairs between the output and the reference.

    This alignment task is used to compute the ngram positional difference penalty
    component of the LEPOR score. Generally, the function finds the matching tokens
    between the reference and hypothesis, then find the indices of longest matching 
    n-grams by checking the left and right unigram window of the matching tokens.

    :param ref_tokens: A list of tokens in reference sentence.
    :type ref_tokens: List[str]
    :param hyp_tokens: A list of tokens in hypothesis sentence.
    :type hyp_tokens: List[str]    
    """
    alignments = []

    # Store the reference and hypothesis tokens length.
    hyp_len = len(hyp_tokens)
    ref_len = len(ref_tokens)

    for hyp_index, hyp_token in enumerate(hyp_tokens):

        # If no match.
        if ref_tokens.count(hyp_token) == 0:
            alignments.append(-1)
        # If only one match.
        elif ref_tokens.count(hyp_token) == 1:
            alignments.append(ref_tokens.index(hyp_token))
        # Otherwise, compute the multiple possibilities. 
        else:
            # Keeps an index of where the hypothesis token matches the reference.
            ref_indexes = [i for i, ref_token in enumerate(ref_tokens) if ref_token == hyp_token]

            # Iterate through the matched tokens, and check if 
            # the one token to the left/right also matches.
            is_matched = []    
            for ind, ref_index in enumerate(ref_indexes):
                # The the one to the left token also matches.
                if (
                    0 < ref_index - 1 < ref_len
                    and 0 < hyp_index - 1 < hyp_len
                    and ref_tokens[ref_index - 1] == hyp_tokens[hyp_index - 1]
                ):
                    is_matched[ind] = True
                # The the one to the right token also matches.
                elif (
                    0 < ref_index + 1 < ref_len
                    and 0 < hyp_index + 1 < hyp_len
                    and ref_tokens[ref_index + 1] == hyp_tokens[hyp_index + 1]
                ):
                    is_matched[ind] = True
                # If the left and right tokens don't match.
                else:
                    is_matched[ind] = False

            # Stores the alignments that have matching phrases.
            # If there's only a single matched alignment. 
            if is_matched.count(True) == 1:
                alignments.append(ref_indexes[is_matched.index(True)])
            # If there's multiple matched alignments that have matching
            # tokens in the left/right window, we shift the index of the
            # alignment to the right most matching token.
            elif is_matched.count(True) > 1:
                min_distance = 0
                min_index = 0
                for match, ref_index in zip(is_matched, ref_indexes):
                    if match:
                        distance = abs(hyp_index - ref_index)
                        if distance > min_distance:
                            min_distance = distance
                            min_index = ref_index
                alignments.append(min_index)
            # If there's no matched alignments, 
            # we still keep indexes of the matching tokens 
            # without explicitly checking for the left/right window.
            else:
                min_distance = 0
                min_index = 0
                for ref_index in ref_indexes:
                    distance = abs(hyp_index - ref_index)
                    if distance > min_distance:
                        min_distance = distance
                        min_index = ref_index
                alignments.append(min_index)

    # The alignments are one indexed to keep track of the ending slice pointer of the matching ngrams.
    alignments = [a+1 for a in alignments if a != -1]
    return alignments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the calculating part:

def ngram_positional_penalty(
    ref_tokens: List[str], hyp_tokens: List[str]
) -> (float, float):
    """
    This function calculates the n-gram position difference penalty (NPosPenal) described in the LEPOR paper. 
    The NPosPenal is an exponential of the length normalized n-gram matches between the reference and the hypothesis.

    :param ref_tokens: list of words in reference sentence.
    :type ref_tokens: List[str]
    :param hyp_tokens: list of words in hypothesis sentence.
    :type hyp_tokens: List[str]

    :return: A tuple containing two elements:
             - NPosPenal: N-gram positional penalty.
             - match_count: Count of matched n-grams.
    :rtype: tuple
    """

    alignments = alignment(ref_tokens, hyp_tokens)
    match_count = len(alignments)

    # Stores the the n-gram position values (difference values) of aligned words 
    # between output and reference sentences, 
    # aka |PD| of eq (4) in https://aclanthology.org/C12-2044
    pd = []
    for i, a in enumerate(alignments):
        pd.append(abs((i + 1) / len(hyp_tokens) -  a / len(ref_tokens)))

    npd = sum(pd) / len(hyp_tokens)
    return math.exp(-npd), match_count

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great suggestion. Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your time in commenting on the code. it looks better now.

>>> sentence_lepor(reference, hypothesis)
0.7824248013113159

:param reference: reference sentence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:param references: Reference sentences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize first word in param description.



def sentence_lepor(
reference: str, hypothesis: str, alpha: float = 1.0, beta: float = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence_lepor should follow the general interface of the other metrics to take in a list of references per hypothesis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See

>>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) # doctest: +ELLIPSIS

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the case with other metrics. i will provide the same facility as well.

hypothesis = nltk.word_tokenize(hypothesis)

# reference = re.findall(r"[\w']+|[.,!?;]", reference)
# hypothesis = re.findall(r"[\w']+|[.,!?;]", hypothesis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in NLTK, I'll think nltk.word_tokenize() would be expected, but you can try adding simple vs nltk tokenize as a parameter:

from nltk.tokenize import word_tokenize

def sentence_lepor(
    reference: str, hypothesis: str, alpha: float = 1.0, beta: float = 1.0, 
    use_nltk_tokenize=True):
    ....
    if use_nltk_tokenize: 
        reference = word_tokenize(reference)
        hypothesis = word_tokenize(hypothesis)
    else:
        reference = re.findall(r"[\w']+|[.,!?;]", reference)
        hypothesis = re.findall(r"[\w']+|[.,!?;]", hypothesis)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better, Perhaps we can also provide an option to pass a callable tokenizer(if not passed, we can use the one already in nltk).

[0.7824248013113159, 0.5639427891892225]


:param references: reference sentences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the sentence_lepor, the expected behavior should take in multiple references per hypothesis

@ulhaqi12
Copy link
Author

Thank you very much @alvations for giving me the opportunity to contribute and really appreciate your time in reviewing the code.

I have made changes in the code as per your comments. Feel free to suggest more things if needed. Only one test is in the doc string for now but I can add more after finalizing whether we are going to have it in the doc string or unit test.

BR,
Ikram

@ulhaqi12 ulhaqi12 force-pushed the feature/lepor-tranlation-evaluation-metric branch from a870de1 to 06313e4 Compare July 31, 2023 20:39
@ulhaqi12
Copy link
Author

ulhaqi12 commented Aug 7, 2023

Hey @alvations ,
Thank you for your time and guidance. I already pushed the required changes as per your recent review. Kindly suggest if there is some improvement required.
Can you give me a little guide on how should i go for testing? should I implement more tests within doc strings?

BR,
Ikram

@ulhaqi12
Copy link
Author

ulhaqi12 commented Sep 4, 2023

Hi @alvations,
A kind reminder.

@ulhaqi12
Copy link
Author

Hey @alvations, @tomaarsen, and @stevenbird,

Hope you all are doing great.
I was wondering if we could look into the remaining work with your suggestions and close this one,
Thank you

BR,
Ikram

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants