Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lunr throws in Safari sometimes when calling the query method #279

Open
imaustink opened this issue Jun 30, 2017 · 37 comments
Open

Lunr throws in Safari sometimes when calling the query method #279

imaustink opened this issue Jun 30, 2017 · 37 comments

Comments

@imaustink
Copy link

imaustink commented Jun 30, 2017

When calling the .query method with certain search terms in Safari, the following error is thrown:

TypeError: undefined is not an object (evaluating 'posting._index') 

The problem is here. Adding an check for posting solves the problem but I am almost certain this is a sign of a larger problem and this should be fixed upstream.

We found that removing wildcard option from all .term calls fixes the issue but search results suffer.

We are aware of this being reported here: #276 (comment). We are working to create a reduced test case.

@chasenlehara
Copy link

chasenlehara commented Jul 3, 2017

@olivernn, I work with @imaustink and we’d like to contribute to get this bug fixed, but I’m not sure where to start in determining why either the invertedIndex wouldn’t have the expandedTerm for the query, or why a bad expandedTerm would be generated.

Our project is open source and we are happy to provide any info that’s useful, or if you can point us in the right direction we will happily work on it. Thanks for the great project and for your help. 🙏

chasenlehara added a commit to bit-docs/lunr.js that referenced this issue Jul 3, 2017
@chasenlehara
Copy link

We’re using a fork with this commit (reference above) until we can figure out the root issue.

@olivernn
Copy link
Owner

olivernn commented Jul 3, 2017

@imaustink @chasenlehara thanks for taking the time to help with this issue, its a real help!

I think the first step is to get a reduced test case. Without that its going to be really tricky to track this down. Even a non-reduced test case would be a start.

From the looks of it, it seems the bug is being triggered by some specific combination of query and document content. Why this would only happen in Safari I don't know, there isn't anything particularly special happening.

@chasenlehara
Copy link

What does the ideal reduced test case look like? A sample project that has data, sets up indexing and querying, and reproduces the exception? Or should we look into modifying any of the current tests?

@olivernn
Copy link
Owner

olivernn commented Jul 3, 2017

In the past people have put together reproductions with jsfiddle. The ideal would be having an index with a single document and a query that triggers the bug.

An example fiddle for some inspiration - https://jsfiddle.net/of54k0uk/14/

@Julinar
Copy link

Julinar commented Jul 13, 2017

EDIT : not sure if related, if it can help other peoples, fixed it by adding a condition on keywords

    listkeywords.forEach(function (keyword) {
     if (keyword != '') { // adding that fix the bug
        q.term(keyword, {
            fields: ['title',
                     'title_broken_space_bar',
                     'keywords',
                     'keywords_why_no_space_bar',
                     'categories',
                     'categoriesnoaccents'],
            wildcard: lunr.Query.wildcard.TRAILING
        });
    }
    });

Hello, if it can help, it's not only with safari, I reproduce the error in react native / android too.

searchMerchant(keywords)
{

//let found = this.idx.search(keywords)
if (keywords == '') return {}
let listkeywords = keywords.split(' ')
let found = this.idx.query(function (q) {
    q.term(keywords, { usePipeline: true, boost: 30 })


    listkeywords.forEach(function (keyword) {
        q.term(keyword, {
            fields: ['title',
                     'title_broken_space_bar',
                     'keywords',
                     'keywords_why_no_space_bar',
                     'categories',
                     'categoriesnoaccents'],
            wildcard: lunr.Query.wildcard.TRAILING
        });
    });

});

}

I generate or load the index if not in cache, the error seem to trig only when the index is generated, when loaded from cache with lunr.Index.load() the error seems to never trig.

screenshot_1499937955

@olivernn
Copy link
Owner

@Julinar interesting, so you're workaround was to prevent empty searches? Was this what was causing the issue.

As I mentioned earlier in the issue, without more details to go on, specifically a simple reproduction, its very difficult for me to try and provide a fix. If you can put together a simple test case that shows the issue I can get to fixing it straight away.

@imaustink
Copy link
Author

@chasenlehara and I were able to create a reduced test case. Open this jsfiddle in Safari and you will see the error. I hope this helps.

@olivernn
Copy link
Owner

olivernn commented Sep 1, 2017

@imaustink thanks, that is super useful!

I've only had a very quick look, but certainly something strange is happening! You are searching for "can*", and when I pause the debugger on the thrown exception I see that Lunr has somehow expanded that into "c�a�n�-�c�enta", which is weird because that definitely isn't in the index.

So somehow this.tokenSet.intersect(termTokenSet).toArray() is inventing new terms, and only in Safari.

I'll do some more digging and let you know what I find.

@olivernn
Copy link
Owner

olivernn commented Sep 1, 2017

Hmm, where did that unicode come from? I pasted that directly from the debugger, so maybe that is the cause of the weirdness....

@olivernn
Copy link
Owner

olivernn commented Sep 3, 2017

So I've been trying to understand what is going on in Safari using the debugger, if I pause on exceptions I can see that the expanded terms that are being looked for are ["can-compon", "c�a�n�-�c�enta"], however if I step through the lunr.TokenSet#intersect method the result is what I would consider the correct expanded terms: ["can-compon", "can-componenta"].

This might be tricky...

@olivernn
Copy link
Owner

olivernn commented Sep 3, 2017

I've tried to reduce the test case some more (no doubt it can be reduced further later) and I'm very confused - https://jsfiddle.net/23fujmvf/1/

By alerting the results of the lunr.TokenSet#intersect the problem goes away, i.e. it seems to change the string that is being returned.

Without the alert we get something that looks like "can-centa" which looks like it has dropped the "ompon" part of "can-componenta".

Perhaps alert is somehow normalising the string (remember those weird unicode symbols from above), but I find this very unexpected.

@imaustink
Copy link
Author

@olivernn, thanks for jumping on this so quickly!

I did some experimentation as well before making the test case, and I also noticed the unicode characters, and the fact that observing the string seems to fix the problem. Merely testing the string with RegEx is one example of this, comparing the string is another example. In addition, I just discovered that even inspecting a copy of the string seems to fix the problem.

This behavior is very unexpected indeed. This definitely seems like a bug in Safari rather than Lunr. Although I am curious how Lunr seems to be the only library having issues with this currently. Perhaps there is a very obscure bug that Lunr is somehow invoking? Please let me know if there is anything I can do to help with this issue. 🍻

@olivernn
Copy link
Owner

olivernn commented Sep 3, 2017

@imaustink no problem, its certainly an interesting case!

I'm pretty sure this is a bug in Safari, not that that helps us much, Lunr will need to work around it somehow and I don't think using alert is going to cut it!

To be able to get any traction with a browser bug we're going to have to have a much reduced test case, ideally one that does not involve any Lunr code at all. My current theory is that it is an issue with how Lunr is getting characters from strings, and how it is then putting those characters back together again to form a string, perhaps it isn't correctly handling multi byte characters? That is just a guess though.

I think the interesting thing is how something like alert 'fixes' the issue. My understanding would be that alert does not modify its parameters, indeed, strings in JavaScript are immutable, so how does it fix the issue? Is it normalising unicode or something. Maybe this would give us a clue as to what is going on?

I'm going to carry on poking around with the debugger, but please let me know if you come up with something too.

@olivernn
Copy link
Owner

olivernn commented Sep 3, 2017

One thing you could try is to use something similar to lunr-unicode-normalizer. That repo is not updated for Lunr 2, but the idea is the same, remove all diacritic marks. Not a fix, but might be a reasonable work around for now.

@olivernn
Copy link
Owner

olivernn commented Sep 3, 2017

Okay, this is definitely an issue with unicode, I'm not sure where the issue is, but using String#normalize seems to fix the problem - https://jsfiddle.net/23fujmvf/2/

I don't think String#normalize has good enough browser support yet to just use that in Lunr, though I could be wrong. Also, its still weird that this is required in Safari, but not Chrome or Firefox.

@imaustink
Copy link
Author

Very Interesting. String.prototype.normalize is supported in Safari 10 and up. What do you think about identifying Safari and normalizing if available? I don't see a reason to normalize in other browsers so support really shouldn't be a concern in my opinion. That is as long as supporting Safari 10+ is acceptable.

@olivernn
Copy link
Owner

olivernn commented Sep 4, 2017

Ok, I think the problem is being caused by lunr.trimmer. It tries to remove any trailing punctuation from terms before they enter the index, but it does so with a fairly naive regular expression. To be fair, it does say that its not great at non-latin characters, but its certainly not something that would be immediately obvious. By removing the trimmer the test case behaves as expected.

This is probably the cause in this case, since one of the terms has some trailing, non-latin, characters. My guess is that it is mangling the final 'character' which by removing half of the code point that makes up the full character. Without any other test cases its difficult to say if this is always the cause.

So, as an immediate fix, the trimmer can be removed from the builder pipeline:

lunr(function () {
  this.pipeline.remove(lunr.trimmer)
})

The existing implementation could possibly be made more robust against these cases, I'll have to think about the best way to implement that.

In addition, Safari is certainly doing something unexpected (as far as I'm concerned). It's now probably easier to produce a test case that doesn't directly involve Lunr, I'll update here when I've done that.

@olivernn
Copy link
Owner

olivernn commented Sep 6, 2017

I've been doing some thinking about this and I think an approach forward is to improve the quality of the implementation of lunr.tokenizer. Its job is to turn some text into individual words or tokens. The fact that lunr.trimmer even exists suggests some inadequacy in the implementation of lunr.tokenzier. More generally, unicode is hard, and texts with non latin characters are not really well supported by the current approach.

I think an approach based on UAX#29 is probably more robust, though the implementation details are certainly more involved. I'm going to experiment with writing a tokeniser using the rules in the above document, I want to see how much better it is able to deal with these cases, as well as understanding what, if any, impact there is on performance (both speed and library size).

In the meantime I'm still interested in seeing if this bug can be isolated enough to show to the Safari developers, as its current behaviour is still weird to me.

@escofield
Copy link

i was faced with the same issue and removing the trimmer from the pipeline solved my error as well.

@fbennett
Copy link

fbennett commented Feb 2, 2018

We found that cloning the clause for each term in index.js cleared this error for us. I'm not sure if that is related to, or less or more drastic than, removing trimmer, but I'll tie it into this thread for completeness. The issue that we experienced and the patch are at #327.

@nbuonin
Copy link

nbuonin commented Mar 1, 2018

I'm seeing the same bug on this I'm working on site. I've tried both work arounds mentioned above: removing the trimmer and cloning the clause for each term - neither worked though.

One kludgey thing that did work was to console.log out each expandedTerm here:

var expandedTerm = expandedTerms[j],

In my case this was 3000+ terms. And it only works when I log every term. If I set a conditional to log only the index and term that throws the bug, then it breaks. If I log an arbitrary string it still breaks.

The index of expanded terms that causes the bug seems to change. In my brief testing it was words that began with 'w' and the strings did not contain non-latin chars.

Lastly, the search itself seems off. Searching the words: "coastal arctic food web" returns the correct item as the first result in Chrome, but it doesn't appear when searched in Safari.
Correction: I was wrong on this point. I has slightly different search terms which threw off the comparison. The results are largely the same, though there are a few dupes.

Talking through this with a coworker, he thought this suggested some kind of race condition - perhaps the logging and evaluating the array index slowed things down just enough to sort itself out.

nbuonin added a commit to PoLAR-Hub/polarhub that referenced this issue Mar 2, 2018
LunrJS has some undefined behavoir in Safari, documented here:
olivernn/lunr.js#279

This workaround console-logs out the terms which seems to slow down things
enough for it to work, suggesting a race condition somewhere. As this is
a bug with either Safari or LunrJS, we'll have to wait for them to sort
it out.
@dawez
Copy link

dawez commented Jun 8, 2018

Is there any update on this issue? I also noticed that I have broken search on safari.

coreyward added a commit to coreyward/lunr.js that referenced this issue Jul 18, 2018
@lucaong
Copy link
Contributor

lucaong commented Jul 20, 2018

Just to add more to this, there is something curious with the "misterious" fixes to this issue.
It does not really matter what one does with the expanded term: just calling any method defined on String.prototype (even without actually using the result) fixes this bug:

https://jsfiddle.net/egLzL24L/40/ (see the .charAt(0) on line 17. Removing it causes the bug to re-appear)

Interestingly, calling methods higher in the prototype chain (like on Object.prototype), does not work. It seems like Safari "casts" the string to its correct representation only when a string method is first called on it.

It would be interesting to understand the underlying cause of this, but the observation offers a possible viable solution for the moment: just call an inexpensive string method on the expanded term before retrieving it from the inverted index.

lucaong added a commit to lucaong/lunr.js that referenced this issue Jul 20, 2018
Calling any method defined on String.prototype on the expanded term
seems to force the string to be properly represented, fixing an issue
affecting Safari users.

See olivernn#279
@lucaong
Copy link
Contributor

lucaong commented Jul 20, 2018

I think I managed to pinpoint the exact place in lunr.TokenSet where the string gets sometimes corrupted in Safari:

https://jsfiddle.net/egLzL24L/77/ (see line 38 and relevant comment)

Which corresponds to this line in the repo:

prefix: frame.prefix.concat(edge),

It seems that the string concatenation (no matter if done with .concat or with +) results in the corruption, which is fixed by calling any String.prototype method on the string. The two strings that get concatenated look fine, only the result is corrupted. In fact, calling a string method on the edge variable does not fix this: the method needs to be called on the result of the concatenation.

As of why this happens, and what exactly triggers it, I have no clue. I would be inclined to think that some underlying memory optimization of string concatenation is buggy in Safari in some corner case.

lucaong added a commit to lucaong/lunr.js that referenced this issue Jul 20, 2018
it turns out that a specific string concatenation in
TokenSet.prototype.toArray sometimes results in a corrupted string in
Safari. It is fixed by calling any String.prototype method on it, which
forces the string to the correct representation.

The previous commit did the same, but this commit moves the fix closer
to the source of the problem. It could be applied exactly at the point
of the problematic concatenation, but that would result in some
unnecessary (if inexpensive) calls, so it is instead when pushing each
result string in the returned array.
@lucaong
Copy link
Contributor

lucaong commented Jul 20, 2018

and here is the smallest script where I can reproduce the bug.

https://jsfiddle.net/egLzL24L/156/

It seems like it's a combination of the trimmer RegExp, a trailing non-word, a Unicode character in a higher block than Latin-1 Supplement (so unicode of at least 2 bytes), and string concatenation.

@lucaong
Copy link
Contributor

lucaong commented Jul 21, 2018

And here the bug is reproduced without Lunr, just a short snippet of code similar to the way Lunr builds a TokenSet and then turns it into an array.

https://jsfiddle.net/8zn2fj6s/18/

Note that if you copy/paste the output of the alert when the bug occurs, and inspect it's binary content with xxd there are null bytes (00) that corrupt the string. Can anybody get it even smaller than this? I think we could file a bug with Safari.

@lucaong
Copy link
Contributor

lucaong commented Jul 21, 2018

Ok, this seems the minimum script that reproduces it. No Lunr code nor any complex data structure is involved:

https://jsfiddle.net/DukeLeNoir/mkrfw4g8/

@lucaong
Copy link
Contributor

lucaong commented Jul 21, 2018

I took the liberty to file a bug report on Safari, as this is now clearly a browser bug and not a Lunr issue.

@olivernn
Copy link
Owner

@lucaong nice work! Are you able to share the link to the Safari bug report?

I think that this is really a bug in the way that Lunr handles (or doesn't) unicode. In both lunr.TokenSet and lunr.tokenizer lunr attempts to get a 'character' from a string, either with String#charAt or String#[].

My understanding (mostly from reading this article) is that neither of those methods handle unicode characters particularly well.

Safari is certainly doing something different to other browsers here, maybe its a bug. The workaround certainly suggests something odd is happening.

I think your workaround is probably the right approach for now, I'll take a look and get a release out in the next day or so. A more long term fix is to make both lunr.tokenizer and lunr.TokenSet less naive when it comes to getting characters from strings.

@lucaong
Copy link
Contributor

lucaong commented Jul 23, 2018

@olivernn I filed the bug at https://bugreport.apple.com, it has number 42468541 but does not seem to be publicly visible. No response so far, but I will keep you updated.

You're right, unicode handling is mostly lacking in JS, but that should not lead to a corrupted string. Besides, the null characters could be the native string terminator leaking out, so there's the risk that someone more skilled than me could devise a way to leverage this bug to expose more memory content.

Regarding the fix, on the good side it should not affect performance nor change the Lunr behavior in any way. It is quite surprising though, so it might be worth writing a test for the toArray method to avoid regressions when this part of code is changed.

By the way, thanks for Lunr, it's really an amazing library!

@chasenlehara
Copy link

@lucaong It might be worth reporting a WebKit bug so we have some visibility into their process.

Thank you so much for taking the time to debug this issue more!

olivernn pushed a commit that referenced this issue Jul 24, 2018
* Fix issue #279 (bug with Safari)

Calling any method defined on String.prototype on the expanded term
seems to force the string to be properly represented, fixing an issue
affecting Safari users.

See #279

* fix issue #279 at the source, on TokenSet.prototype.toArray

it turns out that a specific string concatenation in
TokenSet.prototype.toArray sometimes results in a corrupted string in
Safari. It is fixed by calling any String.prototype method on it, which
forces the string to the correct representation.

The previous commit did the same, but this commit moves the fix closer
to the source of the problem. It could be applied exactly at the point
of the problematic concatenation, but that would result in some
unnecessary (if inexpensive) calls, so it is instead when pushing each
result string in the returned array.

* remove changes to generated distribution file
@olivernn
Copy link
Owner

The patch from @lucaong is now on master, so if anyone wants to try out the bleeding edge they can. A proper release will follow shortly.

@lucaong
Copy link
Contributor

lucaong commented Jul 24, 2018

@chasenlehara thanks for the link, I filed this bug there: https://bugs.webkit.org/show_bug.cgi?id=187947

@olivernn
Copy link
Owner

I've just pushed 2.3.1 to npm which includes the patch from @lucaong.

I'm going to leave this issue open for now until I get around to updating the tokeniser and token store to be more aware of unicode.

@hftf
Copy link

hftf commented Aug 26, 2019

With lunr 2.3.6 and the trimmer removed from the pipeline, I still encounter this issue sporadically.

The patch in #361 may have been insufficient or too localized. Maybe it should be guaranteed that an undefined posting doesn't get past that line?

I see that the issue was already thoroughly investigated more than I can meaningfully contribute to. Anecdotally, it often happens when the term is empty (or stopwords), but I saw it happen with *m* several times too. I also use lunr-unicode-normalizer (monkey-patched for 2.x) for the rare document with Unicode text.

@brutuscat
Copy link

Apparently the fix reported by @lucaong has been fixed https://trac.webkit.org/changeset/255975/webkit and it lives now in the Safari Tech. Preview https://webkit.org/blog/10031/release-notes-for-safari-technology-preview-101/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests