Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing expected results in fuzzy search (no stemming) #375

Closed
lucaong opened this issue Sep 27, 2018 · 4 comments
Closed

Missing expected results in fuzzy search (no stemming) #375

lucaong opened this issue Sep 27, 2018 · 4 comments

Comments

@lucaong
Copy link
Contributor

lucaong commented Sep 27, 2018

Performing fuzzy search seems to miss some words within the given edit distance.
Here is one example (disabling stemming and all other pipeline functions to ensure that we are only observing the behavior of fuzzy search):

const l = lunr(function () {
  this.field('txt')
  this.pipeline.remove(lunr.stemmer)
  this.pipeline.remove(lunr.trimmer)
  this.pipeline.remove(lunr.stopWordFilter)
  this.searchPipeline.remove(lunr.stemmer)
  this.searchPipeline.remove(lunr.trimmer)
  this.searchPipeline.remove(lunr.stopWordFilter)

  ;[
    { id: 1, txt: 'coscienza' },
    { id: 2, txt: 'scienza' },
    { id: 3, txt: 'conoscienza' },
    { id: 4, txt: 'coscienzaxx' },
  ].forEach(line => this.add(line))
})

l.search('coscienza~2')
// => [ { ref: '3', score: ... }, { ref: '1', score: ... } ]

In the example above, I would expect the words scienza and coscienzaxx to also match, as they are at edit distance of 2 from the query term coscienza (two deletions or insertions at the word boundary).

This is also visible if one observes the fuzzy TokenSet expansion for the term coscienza:

lunr.TokenSet.fromFuzzyString("coscienza", 2).toArray()
// => results contains `*scienza` and `coscienza`, but not `scienza` or `coscienza**`
// (in the context of fuzzy search the * token is not linked to itself, so it matches exactly 1 character)

I am not sure if this is a bug or the intended behavior of fuzzy search. In the latter case, maybe it would deserve a mention in the documentation.

Thanks again for the great work!

@olivernn
Copy link
Owner

Sorry for taking a while to get to this...

Looks like a bug to me, I put together a simplified reproduction on jsfiddle.

It looks like, for some reason, that trailing characters only match if they are the same as the last character in the fuzzy string, weird! This also explains why the test is passing.

I'll dig into this a bit and come up with a fix, thanks for reporting.

@hoelzro
Copy link
Contributor

hoelzro commented Oct 25, 2018

Looking at q.toArray() from @olivernn's example, I see the following output:

[ '*oo',
  '*foo',
  'oo',
  'ofo',
  'f*o',
  'f*oo',
  'fo',
  'fo*',
  'fo*o',
  'foo' ]

Would I be incorrect in thinking that foo* should be in there as well? The presence of fo*o explains why fooo is in the intersection, but why food is not.

hoelzro added a commit to hoelzro/lunr.js that referenced this issue Oct 26, 2018
Fixes GH olivernn#375

Before, insertions were not made at the end of a fuzzy string for
token sets
@hoelzro
Copy link
Contributor

hoelzro commented Oct 26, 2018

I've created a PR at #382 that I believe fixes this issue.

olivernn pushed a commit that referenced this issue Oct 29, 2018
Fixes GH #375

Before, insertions were not made at the end of a fuzzy string for
token sets
@olivernn
Copy link
Owner

I've just pushed 2.3.5 which includes the fix from @hoelzro .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants