-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix edit_distance_align() in distance.py #3017
Conversation
…at it gives the correct path
I believe you're very right with your reasoning (also in #2954 (comment)). It is possible for multiple Oh, one last thing @yzhaoinuw, we use
Now, any time you commit anything to the NLTK git, then it will run those scripts beforehand, notifying you of any issues. It may update files to reduce spaces at the end or change formatting, which you can then add and include in the commit. |
(0, 0) into (0, 1) corresponds to an insertion, which makes much more sense for 'rain' and 'brainy' than the previous (0, 0) into (1, 1)
I've also updated a previously broken test case regarding this. Previously, it was: >>> edit_distance_align("rain", "brainy")
[(0, 0), (1, 1), (1, 2), (2, 3), (3, 4), (4, 5), (4, 6)] However, >>> edit_distance_align("rain", "brainy")
[(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (4, 6)] |
Thank you @tomaarsen. Just installed To your point, this fix does seem a little too "simple". But my reasoning that this fix works for all situation is this: provided that the Levenshtein edit distance table, named
We can see that the two cases above cover all situations. If we can show that the fix leads to a correct path in both cases then we prove that the fix works for all situations. In Case 1, we won't have cost values in all three neighbors greater than that of the current cell, because that would indicate an error in In Case 2, whichever of the three neighbors that has a cost value smaller than n is a legitimate move to add to the path. So in this case, if the upper left neighbor's cost value is smaller than n, then it is a legitimate move to add to the path. If not, then we can infer that the upper or the left neighbor has a cost value smaller than n (again based on the premise that To conclude, the proposed fix should lead to a correct path in all situations. Please let me know if I missed something. |
I'm convinced this is correct... thank you @yzhaoinuw and @tomaarsen. |
That looks sound @yzhaoinuw. Thank you for the simple fix and the analysis! Also thank you to @Miguelvs23 for the discussion in #2954. |
@yzhaoinuw @tomaarsen @stevenbird Hi, I think there are still some problems in this impl by simply reordering the list.
because >>> from nltk.metrics.distance import edit_distance, edit_distance_align
>>> edit_distance('a', 'b')
1
>>> edit_distance_align('a', 'b')
[(0, 0), (1, 1)]
>>> edit_distance('a', 'b', float('inf'))
2
>>> edit_distance_align('a', 'b', float('inf')) # incorrect
[(0, 0), (1, 1)] Maybe we need to maintain a path table to record the directions to facilitate the follow-up backtrace? |
Hi @yzhangcs, good catch. I tracked down the cause of this bug to be in def _edit_dist_backtrace(lev):
i, j = len(lev) - 1, len(lev[0]) - 1
alignment = [(i, j)]
while (i, j) != (0, 0):
directions = [
(i - 1, j - 1), # substitution
(i - 1, j), # skip s1
(i, j - 1), # skip s2
]
direction_costs = (
(lev[i][j] if (i >= 0 and j >= 0) else float("inf"), (i, j))
for i, j in directions
)
_, (i, j) = min(direction_costs, key=operator.itemgetter(0))
alignment.append((i, j))
return list(reversed(alignment)) The substitution cost being My proposed fix is to pass the |
This pull request provides a quick fix to issue #2954, in which the alignment mapping of two strings based on the minimum edit distance is wrong.
The Levenshtein edit distance table, named
lev
insideedit_distance_align()
was correct. The problem lied in the_edit_dist_backtrace()
function. During back tracing, in previous version of_edit_dist_backtrace()
, the first cell in(i - 1, j), (i, j - 1), (i - 1, j - 1)
(in this order) that has the minimum cost value was selected to be the next cell in the path. However, it did not consider that going through either cell in(i - 1, j), (i, j - 1)
should always incur an additional cost because they correspond to insertion or deletion. Going though(i - 1, j - 1)
, however, corresponds to a substitution, which can incur an additional cost of either 0 or a user defined substitution cost (default to 1 inedit_distance_align()
). For example (see issue #2954 for an illustration of a similar example), suppose we are at cell(i, j)
whose cost is n. Now, if the cost values for(i - 1, j), (i, j - 1), (i - 1, j - 1)
are all n, same as the cost value in cell(i, j)
, then(i - 1, j)
would be added to the path, which is an illegal move because you can't go through an insertion or deletion without an additional cost. In this case, the only viable path is through cell(i - 1, j - 1)
, which is a substitution move of 0 cost. The easiest way to fix that is to change the order to(i - 1, j - 1), (i - 1, j), (i, j - 1)
. The corresponding docstring insideedit_distance_align()
has also been updated to reflect this change.