New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edit_distance now computes the actual Damerau-Levenshtein edit-distance #2736
Edit_distance now computes the actual Damerau-Levenshtein edit-distance #2736
Conversation
@avena554 thanks for your contribution, and sorry for the delay! |
Merge branch 'develop' of https://github.com/nltk/nltk into fix/compute-actual-Damereau-Levenshtein-distance
Dear @stevenbird , sorry for my own response delay. I added a couple unit tests checking the relevant behavior, though very basic ones, at the requested location. Best, |
Hello @avena554, Thank you for providing some tests for us to look at. I took some time to modify the tests and add some more, and ran into the following quirk, which I believe is not quite right. A simple program to reproduce the issue is the following: from nltk.metrics import edit_distance
print(edit_distance("wants", "wasp", transpositions=True))
print(edit_distance("wasp", "wants", transpositions=True)) which outputs: 2
3 I might be mistaken, but it should not be possible to change For comparison, this same program outputs 3
3 on the develop branch. |
Dear @tomaarsen , Thanks a lot, you are definitively right. There is something wrong with my changes too. The "abc" -> "ca" example is done right, but there is something quite wrong with my changes. I'll try and fix it. Best, |
@avena554 Would a possible problem be that I believe this detail was not updated in the conversion from the Optimal String Alignment algorithm to the correct Damerau-Levenshtein edit-distance. Updating it will likely cause some nasty off-by-one errors though. |
@avena554 I'm not sure the output is also incorrect with from nltk.metrics import edit_distance
print(edit_distance("wants", "wasp", transpositions=False)) Then I get # "wants" -> "wats" -> "was" -> "wasp"
# Two deletions followed by an insertion That said, the way it finds that output might still be incorrect. |
Yes, this was nonsense from me sorry. I am in a bit of a rush now, but I'll be able to look at your suggestion in a couple hours ! Thanks ! |
@avena554 Wonderful, I'm curious to see what caused the issue! |
Hello again @tomaarsen , I think I have now fixed the problem. I don't think there was any array size or indices offset problems (though I haven't extensively tested for that). What caused the mistake was a mix up between the first and the second argument strings towards the end of the main loop in I have refactored the unit tests accordingly (you said you added some of yours, I don't really know how to merge those with my branch, sorry about that). Many thanks again for your vigilance, I hope it works now. |
@avena554 I'll have a little look when I have a moment! And don't worry, I believe I have permissions to add commits to this PR, so I can just add my tests myself, if that's okay with you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avena554 I manually determined some simple test cases, and they all seem to pass. I have some confidence that the program is now correct. Thank you for the fix!
For the record, if I copy the new tests to the develop branch, then the following fail:
FAILED nltk\test\unit\test_distance.py::TestEditDistance::test_with_transpositions[abc-ca-1-expecteds0] - assert 3 == 2
FAILED nltk\test\unit\test_distance.py::TestEditDistance::test_with_transpositions[abc-ca-5-expecteds1] - assert 3 == 2
FAILED nltk\test\unit\test_distance.py::TestEditDistance::test_with_transpositions[wants-swim-2-expecteds15] - assert 7 == 6
The remainder pass for the develop branch too.
Thanks for checking ! So if the function's correct, did I mess up something with my tests, or do you mean than adding the new tests to the develop branch -- before adding the fix in If that's the former, sorry about that ! |
@avena554 The latter. Some of the new tests to the develop branch (before adding the fix in |
Great work @avena554 and @tomaarsen! |
This is a (rather quick) fix to #2734 () -- assuming one wants
edit_distance(s1, s2, transpositions=True)
function in nltk to computes the Damerau-Levenshtein distance rather than Optimal String Alignment, this does it (with minimal changes).Best,
Antoine