Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing results within edit distance in fuzzy search #390

Closed
lucaong opened this issue Feb 11, 2019 · 5 comments
Closed

Missing results within edit distance in fuzzy search #390

lucaong opened this issue Feb 11, 2019 · 5 comments

Comments

@lucaong
Copy link
Contributor

lucaong commented Feb 11, 2019

Hello,
the issue #375 seems to be still occurring in some cases. Fuzzy search, even with stemming disabled, misses some words within the given edit distance. In particular, it seems to miss terms that are equal to the search term plus a suffix as long as the maximum edit distance.

In other words, searching for abc~2 correctly matches abcx, but misses abcxx. Searching for abc~3 correctly matches abcx and abcxx, but misses abcxxx.

Here is a fiddle reproducing the issue on Lunr v3.2.5 (latest release at the moment of posting): https://jsfiddle.net/3ajf72Ly/1/

I know that #382 addressed this already, and it did fix some cases, but not all.

This is also visible from the expansion of lunr.TokenSet.fromFuzzyString("abc", 2).toArray(), where abc** is missing:

**abc, **bc, **c, *a*bc, *a*c, *ab, *ab*, *ab*c, *abc, *abc*,
*ac, *acb, *b, *b*, *b*c, *bac, *bc, *bc*, *c, *cb, a*, a**, a**bc,
a**c, a*b, a*b*, a*b*c, a*bc, a*bc*, a*c, a*c*, a*cb, ab, ab*,
ab**, ab**c, ab*c, ab*c*, abc, abc*, ac, ac*, ac*b, acb, acb*, b,
b*, b*ac, b*c, ba, ba*, ba*c, bac, bac*, bc, bc*, bca

Thanks again for the great library!

@olivernn
Copy link
Owner

That's annoying! Thanks for reporting, I'll put together some tests and then figure out how to get a fix out.

lucaong added a commit to lucaong/lunr.js that referenced this issue Feb 11, 2019
@lucaong
Copy link
Contributor Author

lucaong commented Feb 11, 2019

Thanks a lot! Here is one simple test reproducing it: lucaong@24f4223

I can send a PR if you wish, but for now it would only include the test, as I did not get the time to delve into the code and devise a fix yet.

@olivernn
Copy link
Owner

Thanks, yeah I had something similar as a test too.

I've got something that fixes that exact bug, but I'm convinced there are other bugs in that code with a similar fix, I'm just trying to find the right test to expose it.

If you look at the changes from the previous diff the fix is to unconditionally add a final frame to the stack. That same pattern is in every other fuzzy operation (insertion, deletion, substitution etc). Indeed, it is the same fix in a different location that fixes what you've reported.

Of course, I could be completely wrong, I'm having a hard time trying to wrap my head around that code at the moment!

I'll spend a bit more time trying to understand the bug in a wider context, otherwise I'll put out a patch in the next couple of days with the more specific fix to unblock you.

@lucaong
Copy link
Contributor Author

lucaong commented Feb 15, 2019

Hi @olivernn
Yes I understand. No need to rush for a fix, better to take the due time to properly figure it out.

There is one way I could maybe help. When working on this kind of algorithmic code, where it’s easy to overlook corner cases, I often find it useful to have one or two property based tests, where I assert some invariant on many examples of randomly generated data.

I could help writing one such test for the fuzzy search if that is useful. I have something like that on my own full text search library, and it helped me discover a number of cases: https://github.com/lucaong/minisearch/blob/master/src/SearchableMap/SearchableMap.test.js#L256

I probably won’t make it this weekend, but I might have some time soon if it’s useful.

@olivernn
Copy link
Owner

I've just pushed a fix for this in 2.3.6. Thanks again for taking the time to investigate and report the bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants