Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: levenstein distance for duplicated letters #2849

Merged
merged 2 commits into from Oct 27, 2021
Merged

fix: levenstein distance for duplicated letters #2849

merged 2 commits into from Oct 27, 2021

Conversation

p9f
Copy link
Contributor

@p9f p9f commented Oct 7, 2021

Fix a bug where a duplicated letter was not contributing to the
distance, if transposition was set to true and duplicated letter was the
left argument.

edit_distance("duuplicated", "duplicated", transpositions=False)
edit_distance("duplicated", "duuplicated", transpositions=False)
edit_distance("duuplicated", "duplicated", transpositions=True)
# all return 1 - correct

edit_distance("duplicated", "duuplicated", transpositions=True)
# returns 0 - incorrect

I believe it is a bug introduced three weeks ago by PR 2736.

The fix makes nltk implementation closer to the wikipedia pseudo code,
which should make further reviews / iteration easier I believe.

should fix #2848

@tomaarsen
Copy link
Member

@avena554 Perhaps you'll be able to have a quick look at this? I know you're familiar with the function.

@avena554
Copy link
Contributor

avena554 commented Oct 7, 2021

Hello @p9f , @tomaarsen ,

I'm trying to have a look, but I can't reproduce the behavior at hand. These are the 4 examples mentioned above:

  1. edit_distance("duuplicated", "duplicated", transpositions=False)
  2. edit_distance("duplicated", "duuplicated", transpositions=True)
  3. edit_distance("duuplicated", "duplicated", transpositions=True)
  4. edit_distance("duuplicated", "duplicated", transpositions=True)

I confirm that the 3 first lines above return 1 -- which is what I would expect, and, I think, conform to @p9f 's expectations as well. When I try the fourth edit_distance("duuplicated", "duplicated", transpositions=True) I get 1 as well. Since this line seems identical to the third, I would expect as much. But I wouldn't trust my (very poor) eyes on this, so: is there a typo in the bug report (and in that case, what is the actual line that would return an unexpected 0) or am I misreading the difference here ?

Best,
Antoine

@avena554
Copy link
Contributor

avena554 commented Oct 7, 2021

So small update :

edit_distance("duplicated", "duuplicated", transpositions=True)

with the duplicated letter on the right is actually the incriminated one, it returns 0 for me as well when it should be 1.

Now that I understand the problem, I'll look at the solution :)

@avena554
Copy link
Contributor

avena554 commented Oct 7, 2021

Hello again @p9f, @tomaarsen ,

so I have looked at the proposed changes. In my understanding, it does the two things mentioned by @p9f :

  • It fixes the bug indeed, which originated in an over-indented line (this one:
    last_left_t[s1[i]] = i + 1
    ) which should not part of the nested for-loop. Obviously, this is nice :)
  • As @p9f claimed, it makes the code much closer to wikipedia's pseudocode. I don't have a particularly strong opinion about this one. If the wikipedia page stays as it is, then it's true that it could help to have the code closer to it.

While we're at it, what I personally find (more) confusing each time I go back this function, is that the loop variables i and j range over string indices for the lhs and rhs strings respectively, starting at 0, whereas last_left, last_right, and the arguments expected by edit_distance_step are indices who start at 1.

Fix a bug where a duplicated letter was not contributing to the
distance, if transposition was set to true and duplicated letter was the
left argument.

```python3
edit_distance("duuplicated", "duplicated", transpositions=False)
edit_distance("duplicated", "duuplicated", transpositions=True)
edit_distance("duuplicated", "duplicated", transpositions=True)
# all return 1 - correct

edit_distance("duplicated", "duuplicated", transpositions=True)
# returns 0 - incorrect
```

I believe it is a bug introduced three weeks ago by PR [2736].

The fix make nltk implementation closer to the [wikipedia] pseudo code,
which should make further reviews / iteration easier I believe.

[2736]: #2736
[wikipedia]: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions
@p9f
Copy link
Contributor Author

p9f commented Oct 8, 2021

Thanks @avena554 for your quick replies.

@avena554 So small update :
edit_distance("duplicated", "duuplicated", transpositions=True)
with the duplicated letter on the right is actually the incriminated one, it returns 0 for me as well when it should be 1.
Now that I understand the problem, I'll look at the solution :)

Apologies for that, you are right I had it wrong in my report :/ I fixed it.

@avena554 As @p9f claimed, it makes the code much closer to wikipedia's pseudocode. I don't have a particularly strong opinion about this one.

Me neither. I can update the PR to just correct the indentation of the line 119 if you prefer.

@avena554 While we're at it, what I personally find (more) confusing each time I go back this function, is that the loop variables i and j range over string indices for the lhs and rhs strings respectively, starting at 0, whereas last_left, last_right, and the arguments expected by edit_distance_step are indices who start at 1.

No strong opinion on this. I agree it is confusing, but I think it will be confusing either way just due to the nature of the algorithm.
But I can take the opportunity of this PR to change if you like, for example to make the i and j loop variable to start from 1 and not 0?

it would result in (original version):

    # iterate over the array
    for i in range(1, len1 + 1):
        last_right = 0
        for j in range(1, len2 + 1):
            last_left = last_left_t[s2[j - 1]]
            _edit_dist_step(
                lev,
                i,
                j,
                s1,
                s2,
                last_left,
                last_right,
                substitution_cost=substitution_cost,
                transpositions=transpositions,
            )
            if s1[i - 1] == s2[j - 1]:
                last_right = j
        last_left_t[s1[i - 1]] = i
    return lev[len1][len2]

or (close to wikipedia):

    # iterate over the array
    for i in range(1, len1 + 1):
        last_right_buf = 0
        for j in range(1, len2 + 1):
            last_left = last_left_t[s2[j - 1]]
            last_right = last_right_buf
            if s1[i - 1] == s2[j - 1]:
                last_right_buf = j
            _edit_dist_step(
                lev,
                i,
                j,
                s1,
                s2,
                last_left,
                last_right,
                substitution_cost=substitution_cost,
                transpositions=transpositions,
            )
        last_left_t[s1[i - 1]] = i
    return lev[len1][len2]

@stevenbird
Copy link
Member

Thanks @p9f, @avena554, @tomaarsen. I for one like implementations that are as close as possible to published reference versions, since this means their correctness is transparent.

@stevenbird
Copy link
Member

stevenbird commented Oct 12, 2021

@p9f would you please go ahead with your final proposed change, re loop counter from 1 (with a comment to reference the Wikipedia version)

Start i / j loops from 1 and not 0 to make the code closer to
[wikipedia] pseudo code, as requested by this pull request comment [0].

[wikipedia]: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions
[0]: #2849 (comment)
@p9f
Copy link
Contributor Author

p9f commented Oct 12, 2021

@stevenbird updated

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I messed around with the outputs some more, and it seems to pass all tests, and all other tests I can come up with.
Perhaps @avena554 can have another quick look at the latest changes? I would appreciate it.

@stevenbird
Copy link
Member

Thanks @p9f and @tomaarsen

@stevenbird
Copy link
Member

@avena554 does this look good to you? Thanks

@avena554
Copy link
Contributor

Hi all, sorry, It's been a little busy lately and I forgot to reply. Yes I've looked at the latest change and it looks good me !

Thanks again @p9f .

Best,
Antoine

@stevenbird stevenbird merged commit ad3c84c into nltk:develop Oct 27, 2021
@stevenbird
Copy link
Member

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid levenstein distance for duplicated letters
4 participants