Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix edit_distance_align() in distance.py #3017

Merged
merged 3 commits into from
Jul 19, 2022
Merged

Conversation

yzhaoinuw
Copy link
Contributor

@yzhaoinuw yzhaoinuw commented Jul 11, 2022

This pull request provides a quick fix to issue #2954, in which the alignment mapping of two strings based on the minimum edit distance is wrong.

The Levenshtein edit distance table, named lev inside edit_distance_align() was correct. The problem lied in the _edit_dist_backtrace() function. During back tracing, in previous version of _edit_dist_backtrace(), the first cell in (i - 1, j), (i, j - 1), (i - 1, j - 1) (in this order) that has the minimum cost value was selected to be the next cell in the path. However, it did not consider that going through either cell in (i - 1, j), (i, j - 1) should always incur an additional cost because they correspond to insertion or deletion. Going though (i - 1, j - 1), however, corresponds to a substitution, which can incur an additional cost of either 0 or a user defined substitution cost (default to 1 in edit_distance_align()). For example (see issue #2954 for an illustration of a similar example), suppose we are at cell (i, j) whose cost is n. Now, if the cost values for (i - 1, j), (i, j - 1), (i - 1, j - 1) are all n, same as the cost value in cell (i, j), then (i - 1, j) would be added to the path, which is an illegal move because you can't go through an insertion or deletion without an additional cost. In this case, the only viable path is through cell (i - 1, j - 1), which is a substitution move of 0 cost. The easiest way to fix that is to change the order to (i - 1, j - 1), (i - 1, j), (i, j - 1). The corresponding docstring inside edit_distance_align() has also been updated to reflect this change.

@tomaarsen
Copy link
Member

I believe you're very right with your reasoning (also in #2954 (comment)).

It is possible for multiple $n$'s to appear next to one another in the grid, but it is not possible for a path to exist between them. I'm fairly certain that this change improves the result of edit_distance_align, and fixes the example we had previously talked about with fubar and foobar.
However, I'm not necessarily convinced yet that this simple fix with an updated order works for all situations. Perhaps @Miguelvs23 can have a look at this too. I know you were interested in this topic.

Oh, one last thing @yzhaoinuw, we use pre-commit to run these scripts on every single commit, to ensure a certain level of code quality. That is why your CI was previously failing so quickly. I would advise performing:

pip install pre-commit
pre-commit install

Now, any time you commit anything to the NLTK git, then it will run those scripts beforehand, notifying you of any issues. It may update files to reduce spaces at the end or change formatting, which you can then add and include in the commit.

(0, 0) into (0, 1) corresponds to an insertion, which makes much more sense for 'rain' and 'brainy' than the previous (0, 0) into (1, 1)
@tomaarsen
Copy link
Member

I've also updated a previously broken test case regarding this. Previously, it was:

>>> edit_distance_align("rain", "brainy")
    [(0, 0), (1, 1), (1, 2), (2, 3), (3, 4), (4, 5), (4, 6)]

However, (0, 0) into (1, 1) doesn't make much sense for "rain" and "brainy".
It is now:

>>> edit_distance_align("rain", "brainy")
    [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (4, 6)]

@yzhaoinuw
Copy link
Contributor Author

yzhaoinuw commented Jul 11, 2022

Thank you @tomaarsen. Just installed pre-commit. Is there anything that I need to do at this point for this pull request hopefully to be merged?

To your point, this fix does seem a little too "simple". But my reasoning that this fix works for all situation is this: provided that the Levenshtein edit distance table, named lev in edit_distance_align() is correct, then for any cell in the table we can always find at least one cell from its left, upper, and upper left neighbors which is a legitimate move to add to a path. Suppose the current cell in lev has a cost value of n. We can divide all situations into two cases.

  • Case 1. All cells in the left, upper, and upper left neighbors have cost values greater or equal to n.

  • Case 2. At least one cell in the left, upper, and upper left neighbors has cost value smaller than n.

We can see that the two cases above cover all situations. If we can show that the fix leads to a correct path in both cases then we prove that the fix works for all situations. In Case 1, we won't have cost values in all three neighbors greater than that of the current cell, because that would indicate an error in lev, since the cost value in the current cell is computed by adding a non negative integer (1 for insertion or deletion, 0 or a user defined positive integer value for substitution) to the cost value in one of the three neighbors. Then we can infer that the cost value in the upper left neighbor must be equal to n, as neither the left nor upper neighbor is a legitimate move when their cost value is greater or equal to n. So in this case the reordering of neighbors list fix in this pull request would pick the correct cell to add to the path.

In Case 2, whichever of the three neighbors that has a cost value smaller than n is a legitimate move to add to the path. So in this case, if the upper left neighbor's cost value is smaller than n, then it is a legitimate move to add to the path. If not, then we can infer that the upper or the left neighbor has a cost value smaller than n (again based on the premise that lev is correct). Then, the one that has the minimum cost value is a legitimate move to add to the path (when both have equal cost values, then which to pick doesn't matter). The min() function that was previously already in _edit_dist_backtrace() takes care of that.

To conclude, the proposed fix should lead to a correct path in all situations. Please let me know if I missed something.

@stevenbird stevenbird merged commit c4c67a9 into nltk:develop Jul 19, 2022
@stevenbird
Copy link
Member

I'm convinced this is correct... thank you @yzhaoinuw and @tomaarsen.

@tomaarsen
Copy link
Member

That looks sound @yzhaoinuw. Thank you for the simple fix and the analysis! Also thank you to @Miguelvs23 for the discussion in #2954.

@yzhangcs
Copy link

yzhangcs commented May 22, 2023

@yzhaoinuw @tomaarsen @stevenbird Hi, I think there are still some problems in this impl by simply reordering the list.
Considering a simplest case with sub cost INF in the following table, the min edit cost should be 2, and the desired alignments should be [(0, 0), (0, 1), (1, 1)], but edit_distance_align gives [(0, 0), (1, 1)], which is incorrect I think.

# b
# 0 1
a 1 2

because SUBSTITUTION is disabled by INF cost, what we actually expect to get is INSERTION then DELETION.

>>> from nltk.metrics.distance import edit_distance, edit_distance_align
>>> edit_distance('a', 'b')
1
>>> edit_distance_align('a', 'b')
[(0, 0), (1, 1)]
>>> edit_distance('a', 'b', float('inf'))
2
>>> edit_distance_align('a', 'b', float('inf'))  # incorrect
[(0, 0), (1, 1)]

Maybe we need to maintain a path table to record the directions to facilitate the follow-up backtrace?

@yzhaoinuw
Copy link
Contributor Author

Hi @yzhangcs, good catch. I tracked down the cause of this bug to be in _edit_dist_backtrace(),

def _edit_dist_backtrace(lev):
    i, j = len(lev) - 1, len(lev[0]) - 1
    alignment = [(i, j)]

    while (i, j) != (0, 0):
        directions = [
            (i - 1, j - 1),  # substitution
            (i - 1, j),  # skip s1
            (i, j - 1),  # skip s2
        ]

        direction_costs = (
            (lev[i][j] if (i >= 0 and j >= 0) else float("inf"), (i, j))
            for i, j in directions
        )
        _, (i, j) = min(direction_costs, key=operator.itemgetter(0))

        alignment.append((i, j))
    return list(reversed(alignment))

The substitution cost being float(inf) means that the substitution is not allowed. Yet in the code above, when adding directions, paths via substitution are still added. In your example, at (1, 1), the directions_costs is (0, (0, 0)), (1, (0, 1)), (1, (1, 0)), in this order. When the direction with the minimum cost in the lev table gets picked out and appended to alignment, it is the direction via substitution that gets appended, which is incorrect when the substitution cost is float(inf). In fact, it would be incorrect for any substitution cost greater than 2, because substitution would then cost more than a deletion, costing 1, followed by an addition, costing another 1.

My proposed fix is to pass the substitution_cost from edit_distance_align() to _edit_dist_backtrace(). Then in _edit_dist_backtrace(), add a condition to check for substitution_cost. When it is greater than 2, do not add (i - 1, j - 1) to direction. I will do a follow-up pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants