Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit_distance now computes the actual Damerau-Levenshtein edit-distance #2736

Conversation

avena554
Copy link
Contributor

This is a (rather quick) fix to #2734 () -- assuming one wants edit_distance(s1, s2, transpositions=True) function in nltk to computes the Damerau-Levenshtein distance rather than Optimal String Alignment, this does it (with minimal changes).

Best,
Antoine

@stevenbird stevenbird self-assigned this Jul 1, 2021
@stevenbird
Copy link
Member

stevenbird commented Aug 17, 2021

@avena554 thanks for your contribution, and sorry for the delay!
Could I trouble you to please provide some test cases in a new file:
nltk/develop/nltk/test/unit/test_distance.py

@avena554
Copy link
Contributor Author

Dear @stevenbird ,

sorry for my own response delay. I added a couple unit tests checking the relevant behavior, though very basic ones, at the requested location.

Best,
Antoine

@tomaarsen
Copy link
Member

tomaarsen commented Sep 14, 2021

Hello @avena554,

Thank you for providing some tests for us to look at. I took some time to modify the tests and add some more, and ran into the following quirk, which I believe is not quite right. A simple program to reproduce the issue is the following:

from nltk.metrics import edit_distance

print(edit_distance("wants", "wasp", transpositions=True))
print(edit_distance("wasp", "wants", transpositions=True))

which outputs:

2
3

I might be mistaken, but it should not be possible to change "wants" into "wasp" with just 2 modifications, right? Even when transpositions are allowed.


For comparison, this same program outputs

3
3

on the develop branch.

@avena554
Copy link
Contributor Author

Dear @tomaarsen ,

Thanks a lot, you are definitively right. There is something wrong with my changes too. The "abc" -> "ca" example is done right, but there is something quite wrong with my changes.

I'll try and fix it.

Best,
Antoine

@tomaarsen
Copy link
Member

@avena554 Would a possible problem be that d in https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance (which is called lev in our code), is still instantiated to have dimensions length(a)+1, length(b)+1, instead of the correct length(a)+2, length(b)+2?
As you can see in the Wikipedia algorithm, d is -1 indexed, and the first row and column (labeled as -1 in the pseudocode) should be maxdist.

I believe this detail was not updated in the conversion from the Optimal String Alignment algorithm to the correct Damerau-Levenshtein edit-distance. Updating it will likely cause some nasty off-by-one errors though.

@tomaarsen
Copy link
Member

@avena554 I'm not sure the output is also incorrect with transpositions=False. If I run the following program:

from nltk.metrics import edit_distance
print(edit_distance("wants", "wasp", transpositions=False))

Then I get 3, which seems to be correct? For example:

# "wants" -> "wats" -> "was" -> "wasp"
# Two deletions followed by an insertion

That said, the way it finds that output might still be incorrect.

@avena554
Copy link
Contributor Author

@tomaarsen

Yes, this was nonsense from me sorry.

I am in a bit of a rush now, but I'll be able to look at your suggestion in a couple hours !

Thanks !

@tomaarsen
Copy link
Member

@avena554 Wonderful, I'm curious to see what caused the issue!

@avena554
Copy link
Contributor Author

avena554 commented Sep 15, 2021

Hello again @tomaarsen ,

I think I have now fixed the problem. I don't think there was any array size or indices offset problems (though I haven't extensively tested for that). What caused the mistake was a mix up between the first and the second argument strings towards the end of the main loop in edit_distance, (s2 instead of s1). The thing is, last_left_t is supposed to keep track, for every symbol in the alphabet, of the index at which it was last seen in s1. Now I updated it based on s2 instead. I guess it was luck or (lack thereof) that it worked on the abc -> ca example :)

I have refactored the unit tests accordingly (you said you added some of yours, I don't really know how to merge those with my branch, sorry about that). Many thanks again for your vigilance, I hope it works now.

@tomaarsen
Copy link
Member

@avena554 I'll have a little look when I have a moment! And don't worry, I believe I have permissions to add commits to this PR, so I can just add my tests myself, if that's okay with you.

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avena554 I manually determined some simple test cases, and they all seem to pass. I have some confidence that the program is now correct. Thank you for the fix!

For the record, if I copy the new tests to the develop branch, then the following fail:

FAILED nltk\test\unit\test_distance.py::TestEditDistance::test_with_transpositions[abc-ca-1-expecteds0] - assert 3 == 2
FAILED nltk\test\unit\test_distance.py::TestEditDistance::test_with_transpositions[abc-ca-5-expecteds1] - assert 3 == 2
FAILED nltk\test\unit\test_distance.py::TestEditDistance::test_with_transpositions[wants-swim-2-expecteds15] - assert 7 == 6

The remainder pass for the develop branch too.

@avena554
Copy link
Contributor Author

@tomaarsen

Thanks for checking !

So if the function's correct, did I mess up something with my tests, or do you mean than adding the new tests to the develop branch -- before adding the fix in edit_distance -- some tests using transpositions fail, which was expected ?

If that's the former, sorry about that !

@tomaarsen
Copy link
Member

@avena554 The latter. Some of the new tests to the develop branch (before adding the fix in edit_distance) fail. This was indeed to be expected.
Your tests were correct, I believe.

@stevenbird stevenbird merged commit 77b5945 into nltk:develop Sep 18, 2021
@stevenbird
Copy link
Member

Great work @avena554 and @tomaarsen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

edit_distance() does not compute the actual Damerau-Levenshtein distance when transpositions is set to True
3 participants