New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: levenstein distance for duplicated letters #2849
fix: levenstein distance for duplicated letters #2849
Conversation
@avena554 Perhaps you'll be able to have a quick look at this? I know you're familiar with the function. |
Hello @p9f , @tomaarsen , I'm trying to have a look, but I can't reproduce the behavior at hand. These are the 4 examples mentioned above:
I confirm that the 3 first lines above return 1 -- which is what I would expect, and, I think, conform to @p9f 's expectations as well. When I try the fourth Best, |
So small update :
with the duplicated letter on the right is actually the incriminated one, it returns 0 for me as well when it should be 1. Now that I understand the problem, I'll look at the solution :) |
Hello again @p9f, @tomaarsen , so I have looked at the proposed changes. In my understanding, it does the two things mentioned by @p9f :
While we're at it, what I personally find (more) confusing each time I go back this function, is that the loop variables |
Fix a bug where a duplicated letter was not contributing to the distance, if transposition was set to true and duplicated letter was the left argument. ```python3 edit_distance("duuplicated", "duplicated", transpositions=False) edit_distance("duplicated", "duuplicated", transpositions=True) edit_distance("duuplicated", "duplicated", transpositions=True) # all return 1 - correct edit_distance("duplicated", "duuplicated", transpositions=True) # returns 0 - incorrect ``` I believe it is a bug introduced three weeks ago by PR [2736]. The fix make nltk implementation closer to the [wikipedia] pseudo code, which should make further reviews / iteration easier I believe. [2736]: #2736 [wikipedia]: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions
Thanks @avena554 for your quick replies.
Apologies for that, you are right I had it wrong in my report :/ I fixed it.
Me neither. I can update the PR to just correct the indentation of the line 119 if you prefer.
No strong opinion on this. I agree it is confusing, but I think it will be confusing either way just due to the nature of the algorithm. it would result in (original version): # iterate over the array
for i in range(1, len1 + 1):
last_right = 0
for j in range(1, len2 + 1):
last_left = last_left_t[s2[j - 1]]
_edit_dist_step(
lev,
i,
j,
s1,
s2,
last_left,
last_right,
substitution_cost=substitution_cost,
transpositions=transpositions,
)
if s1[i - 1] == s2[j - 1]:
last_right = j
last_left_t[s1[i - 1]] = i
return lev[len1][len2] or (close to wikipedia): # iterate over the array
for i in range(1, len1 + 1):
last_right_buf = 0
for j in range(1, len2 + 1):
last_left = last_left_t[s2[j - 1]]
last_right = last_right_buf
if s1[i - 1] == s2[j - 1]:
last_right_buf = j
_edit_dist_step(
lev,
i,
j,
s1,
s2,
last_left,
last_right,
substitution_cost=substitution_cost,
transpositions=transpositions,
)
last_left_t[s1[i - 1]] = i
return lev[len1][len2] |
Thanks @p9f, @avena554, @tomaarsen. I for one like implementations that are as close as possible to published reference versions, since this means their correctness is transparent. |
@p9f would you please go ahead with your final proposed change, re loop counter from |
Start i / j loops from 1 and not 0 to make the code closer to [wikipedia] pseudo code, as requested by this pull request comment [0]. [wikipedia]: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions [0]: #2849 (comment)
@stevenbird updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I messed around with the outputs some more, and it seems to pass all tests, and all other tests I can come up with.
Perhaps @avena554 can have another quick look at the latest changes? I would appreciate it.
Thanks @p9f and @tomaarsen |
@avena554 does this look good to you? Thanks |
Hi all, sorry, It's been a little busy lately and I forgot to reply. Yes I've looked at the latest change and it looks good me ! Thanks again @p9f . Best, |
Thanks all! |
Fix a bug where a duplicated letter was not contributing to the
distance, if transposition was set to true and duplicated letter was the
left argument.
I believe it is a bug introduced three weeks ago by PR 2736.
The fix makes nltk implementation closer to the wikipedia pseudo code,
which should make further reviews / iteration easier I believe.
should fix #2848