Lunr throws in Safari sometimes when calling the query method #279

imaustink · 2017-06-30T21:27:22Z

When calling the .query method with certain search terms in Safari, the following error is thrown:

TypeError: undefined is not an object (evaluating 'posting._index')

The problem is here. Adding an check for posting solves the problem but I am almost certain this is a sign of a larger problem and this should be fixed upstream.

We found that removing wildcard option from all .term calls fixes the issue but search results suffer.

We are aware of this being reported here: #276 (comment). We are working to create a reduced test case.

The text was updated successfully, but these errors were encountered:

chasenlehara · 2017-07-03T14:37:08Z

@olivernn, I work with @imaustink and we’d like to contribute to get this bug fixed, but I’m not sure where to start in determining why either the invertedIndex wouldn’t have the expandedTerm for the query, or why a bad expandedTerm would be generated.

Our project is open source and we are happy to provide any info that’s useful, or if you can point us in the right direction we will happily work on it. Thanks for the great project and for your help. 🙏

Work around for olivernn#279

chasenlehara · 2017-07-03T16:48:54Z

We’re using a fork with this commit (reference above) until we can figure out the root issue.

olivernn · 2017-07-03T17:18:00Z

@imaustink @chasenlehara thanks for taking the time to help with this issue, its a real help!

I think the first step is to get a reduced test case. Without that its going to be really tricky to track this down. Even a non-reduced test case would be a start.

From the looks of it, it seems the bug is being triggered by some specific combination of query and document content. Why this would only happen in Safari I don't know, there isn't anything particularly special happening.

chasenlehara · 2017-07-03T17:28:33Z

What does the ideal reduced test case look like? A sample project that has data, sets up indexing and querying, and reproduces the exception? Or should we look into modifying any of the current tests?

olivernn · 2017-07-03T17:34:00Z

In the past people have put together reproductions with jsfiddle. The ideal would be having an index with a single document and a query that triggers the bug.

An example fiddle for some inspiration - https://jsfiddle.net/of54k0uk/14/

Julinar · 2017-07-13T09:50:04Z

EDIT : not sure if related, if it can help other peoples, fixed it by adding a condition on keywords

    listkeywords.forEach(function (keyword) {
     if (keyword != '') { // adding that fix the bug
        q.term(keyword, {
            fields: ['title',
                     'title_broken_space_bar',
                     'keywords',
                     'keywords_why_no_space_bar',
                     'categories',
                     'categoriesnoaccents'],
            wildcard: lunr.Query.wildcard.TRAILING
        });
    }
    });

Hello, if it can help, it's not only with safari, I reproduce the error in react native / android too.

searchMerchant(keywords)
{

//let found = this.idx.search(keywords)
if (keywords == '') return {}
let listkeywords = keywords.split(' ')
let found = this.idx.query(function (q) {
    q.term(keywords, { usePipeline: true, boost: 30 })


    listkeywords.forEach(function (keyword) {
        q.term(keyword, {
            fields: ['title',
                     'title_broken_space_bar',
                     'keywords',
                     'keywords_why_no_space_bar',
                     'categories',
                     'categoriesnoaccents'],
            wildcard: lunr.Query.wildcard.TRAILING
        });
    });

});

}

I generate or load the index if not in cache, the error seem to trig only when the index is generated, when loaded from cache with lunr.Index.load() the error seems to never trig.

olivernn · 2017-07-13T18:50:20Z

@Julinar interesting, so you're workaround was to prevent empty searches? Was this what was causing the issue.

As I mentioned earlier in the issue, without more details to go on, specifically a simple reproduction, its very difficult for me to try and provide a fix. If you can put together a simple test case that shows the issue I can get to fixing it straight away.

imaustink · 2017-08-31T20:29:28Z

@chasenlehara and I were able to create a reduced test case. Open this jsfiddle in Safari and you will see the error. I hope this helps.

olivernn · 2017-09-01T06:59:32Z

@imaustink thanks, that is super useful!

I've only had a very quick look, but certainly something strange is happening! You are searching for "can*", and when I pause the debugger on the thrown exception I see that Lunr has somehow expanded that into "c�a�n�-�c�enta", which is weird because that definitely isn't in the index.

So somehow this.tokenSet.intersect(termTokenSet).toArray() is inventing new terms, and only in Safari.

I'll do some more digging and let you know what I find.

olivernn · 2017-09-01T07:00:11Z

Hmm, where did that unicode come from? I pasted that directly from the debugger, so maybe that is the cause of the weirdness....

olivernn · 2017-09-03T10:06:08Z

So I've been trying to understand what is going on in Safari using the debugger, if I pause on exceptions I can see that the expanded terms that are being looked for are ["can-compon", "c�a�n�-�c�enta"], however if I step through the lunr.TokenSet#intersect method the result is what I would consider the correct expanded terms: ["can-compon", "can-componenta"].

This might be tricky...

olivernn · 2017-09-03T10:40:02Z

I've tried to reduce the test case some more (no doubt it can be reduced further later) and I'm very confused - https://jsfiddle.net/23fujmvf/1/

By alerting the results of the lunr.TokenSet#intersect the problem goes away, i.e. it seems to change the string that is being returned.

Without the alert we get something that looks like "can-centa" which looks like it has dropped the "ompon" part of "can-componenta".

Perhaps alert is somehow normalising the string (remember those weird unicode symbols from above), but I find this very unexpected.

imaustink · 2017-09-03T15:21:10Z

@olivernn, thanks for jumping on this so quickly!

I did some experimentation as well before making the test case, and I also noticed the unicode characters, and the fact that observing the string seems to fix the problem. Merely testing the string with RegEx is one example of this, comparing the string is another example. In addition, I just discovered that even inspecting a copy of the string seems to fix the problem.

This behavior is very unexpected indeed. This definitely seems like a bug in Safari rather than Lunr. Although I am curious how Lunr seems to be the only library having issues with this currently. Perhaps there is a very obscure bug that Lunr is somehow invoking? Please let me know if there is anything I can do to help with this issue. 🍻

olivernn · 2017-09-03T17:18:11Z

@imaustink no problem, its certainly an interesting case!

I'm pretty sure this is a bug in Safari, not that that helps us much, Lunr will need to work around it somehow and I don't think using alert is going to cut it!

To be able to get any traction with a browser bug we're going to have to have a much reduced test case, ideally one that does not involve any Lunr code at all. My current theory is that it is an issue with how Lunr is getting characters from strings, and how it is then putting those characters back together again to form a string, perhaps it isn't correctly handling multi byte characters? That is just a guess though.

I think the interesting thing is how something like alert 'fixes' the issue. My understanding would be that alert does not modify its parameters, indeed, strings in JavaScript are immutable, so how does it fix the issue? Is it normalising unicode or something. Maybe this would give us a clue as to what is going on?

I'm going to carry on poking around with the debugger, but please let me know if you come up with something too.

olivernn · 2017-09-03T17:31:55Z

One thing you could try is to use something similar to lunr-unicode-normalizer. That repo is not updated for Lunr 2, but the idea is the same, remove all diacritic marks. Not a fix, but might be a reasonable work around for now.

olivernn · 2017-09-03T17:55:36Z

Okay, this is definitely an issue with unicode, I'm not sure where the issue is, but using String#normalize seems to fix the problem - https://jsfiddle.net/23fujmvf/2/

I don't think String#normalize has good enough browser support yet to just use that in Lunr, though I could be wrong. Also, its still weird that this is required in Safari, but not Chrome or Firefox.

imaustink · 2017-09-04T00:42:20Z

Very Interesting. String.prototype.normalize is supported in Safari 10 and up. What do you think about identifying Safari and normalizing if available? I don't see a reason to normalize in other browsers so support really shouldn't be a concern in my opinion. That is as long as supporting Safari 10+ is acceptable.

olivernn · 2017-09-04T18:45:03Z

Ok, I think the problem is being caused by lunr.trimmer. It tries to remove any trailing punctuation from terms before they enter the index, but it does so with a fairly naive regular expression. To be fair, it does say that its not great at non-latin characters, but its certainly not something that would be immediately obvious. By removing the trimmer the test case behaves as expected.

This is probably the cause in this case, since one of the terms has some trailing, non-latin, characters. My guess is that it is mangling the final 'character' which by removing half of the code point that makes up the full character. Without any other test cases its difficult to say if this is always the cause.

So, as an immediate fix, the trimmer can be removed from the builder pipeline:

lunr(function () {
  this.pipeline.remove(lunr.trimmer)
})

The existing implementation could possibly be made more robust against these cases, I'll have to think about the best way to implement that.

In addition, Safari is certainly doing something unexpected (as far as I'm concerned). It's now probably easier to produce a test case that doesn't directly involve Lunr, I'll update here when I've done that.

olivernn · 2017-09-06T18:20:12Z

I've been doing some thinking about this and I think an approach forward is to improve the quality of the implementation of lunr.tokenizer. Its job is to turn some text into individual words or tokens. The fact that lunr.trimmer even exists suggests some inadequacy in the implementation of lunr.tokenzier. More generally, unicode is hard, and texts with non latin characters are not really well supported by the current approach.

I think an approach based on UAX#29 is probably more robust, though the implementation details are certainly more involved. I'm going to experiment with writing a tokeniser using the rules in the above document, I want to see how much better it is able to deal with these cases, as well as understanding what, if any, impact there is on performance (both speed and library size).

In the meantime I'm still interested in seeing if this bug can be isolated enough to show to the Safari developers, as its current behaviour is still weird to me.

escofield · 2018-01-31T20:35:08Z

i was faced with the same issue and removing the trimmer from the pipeline solved my error as well.

fbennett · 2018-02-02T06:53:56Z

We found that cloning the clause for each term in index.js cleared this error for us. I'm not sure if that is related to, or less or more drastic than, removing trimmer, but I'll tie it into this thread for completeness. The issue that we experienced and the patch are at #327.

nbuonin · 2018-03-01T18:16:23Z

I'm seeing the same bug on this I'm working on site. I've tried both work arounds mentioned above: removing the trimmer and cloning the clause for each term - neither worked though.

One kludgey thing that did work was to console.log out each expandedTerm here:

lunr.js/lib/index.js

Line 181 in fd5dccd

var expandedTerm = expandedTerms[j],

In my case this was 3000+ terms. And it only works when I log every term. If I set a conditional to log only the index and term that throws the bug, then it breaks. If I log an arbitrary string it still breaks.

The index of expanded terms that causes the bug seems to change. In my brief testing it was words that began with 'w' and the strings did not contain non-latin chars.

~~Lastly, the search itself seems off. Searching the words: "coastal arctic food web" returns the correct item as the first result in Chrome, but it doesn't appear when searched in Safari.~~
Correction: I was wrong on this point. I has slightly different search terms which threw off the comparison. The results are largely the same, though there are a few dupes.

Talking through this with a coworker, he thought this suggested some kind of race condition - perhaps the logging and evaluating the array index slowed things down just enough to sort itself out.

LunrJS has some undefined behavoir in Safari, documented here: olivernn/lunr.js#279 This workaround console-logs out the terms which seems to slow down things enough for it to work, suggesting a race condition somewhere. As this is a bug with either Safari or LunrJS, we'll have to wait for them to sort it out.

dawez · 2018-06-08T11:15:21Z

Is there any update on this issue? I also noticed that I have broken search on safari.

lucaong · 2018-07-20T10:02:50Z

Just to add more to this, there is something curious with the "misterious" fixes to this issue.
It does not really matter what one does with the expanded term: just calling any method defined on String.prototype (even without actually using the result) fixes this bug:

https://jsfiddle.net/egLzL24L/40/ (see the .charAt(0) on line 17. Removing it causes the bug to re-appear)

Interestingly, calling methods higher in the prototype chain (like on Object.prototype), does not work. It seems like Safari "casts" the string to its correct representation only when a string method is first called on it.

It would be interesting to understand the underlying cause of this, but the observation offers a possible viable solution for the moment: just call an inexpensive string method on the expanded term before retrieving it from the inverted index.

Calling any method defined on String.prototype on the expanded term seems to force the string to be properly represented, fixing an issue affecting Safari users. See olivernn#279

lucaong · 2018-07-20T12:40:28Z

I think I managed to pinpoint the exact place in lunr.TokenSet where the string gets sometimes corrupted in Safari:

https://jsfiddle.net/egLzL24L/77/ (see line 38 and relevant comment)

Which corresponds to this line in the repo:

lunr.js/lib/token_set.js

Line 309 in f9aeea2

prefix: frame.prefix.concat(edge),

It seems that the string concatenation (no matter if done with .concat or with +) results in the corruption, which is fixed by calling any String.prototype method on the string. The two strings that get concatenated look fine, only the result is corrupted. In fact, calling a string method on the edge variable does not fix this: the method needs to be called on the result of the concatenation.

As of why this happens, and what exactly triggers it, I have no clue. I would be inclined to think that some underlying memory optimization of string concatenation is buggy in Safari in some corner case.

it turns out that a specific string concatenation in TokenSet.prototype.toArray sometimes results in a corrupted string in Safari. It is fixed by calling any String.prototype method on it, which forces the string to the correct representation. The previous commit did the same, but this commit moves the fix closer to the source of the problem. It could be applied exactly at the point of the problematic concatenation, but that would result in some unnecessary (if inexpensive) calls, so it is instead when pushing each result string in the returned array.

lucaong · 2018-07-20T16:48:52Z

and here is the smallest script where I can reproduce the bug.

https://jsfiddle.net/egLzL24L/156/

It seems like it's a combination of the trimmer RegExp, a trailing non-word, a Unicode character in a higher block than Latin-1 Supplement (so unicode of at least 2 bytes), and string concatenation.

lucaong · 2018-07-21T19:47:59Z

And here the bug is reproduced without Lunr, just a short snippet of code similar to the way Lunr builds a TokenSet and then turns it into an array.

https://jsfiddle.net/8zn2fj6s/18/

Note that if you copy/paste the output of the alert when the bug occurs, and inspect it's binary content with xxd there are null bytes (00) that corrupt the string. Can anybody get it even smaller than this? I think we could file a bug with Safari.

lucaong · 2018-07-21T20:43:54Z

Ok, this seems the minimum script that reproduces it. No Lunr code nor any complex data structure is involved:

https://jsfiddle.net/DukeLeNoir/mkrfw4g8/

lucaong · 2018-07-21T20:57:41Z

I took the liberty to file a bug report on Safari, as this is now clearly a browser bug and not a Lunr issue.

olivernn · 2018-07-22T19:47:15Z

@lucaong nice work! Are you able to share the link to the Safari bug report?

I think that this is really a bug in the way that Lunr handles (or doesn't) unicode. In both lunr.TokenSet and lunr.tokenizer lunr attempts to get a 'character' from a string, either with String#charAt or String#[].

My understanding (mostly from reading this article) is that neither of those methods handle unicode characters particularly well.

Safari is certainly doing something different to other browsers here, maybe its a bug. The workaround certainly suggests something odd is happening.

I think your workaround is probably the right approach for now, I'll take a look and get a release out in the next day or so. A more long term fix is to make both lunr.tokenizer and lunr.TokenSet less naive when it comes to getting characters from strings.

lucaong · 2018-07-23T08:46:08Z

@olivernn I filed the bug at https://bugreport.apple.com, it has number 42468541 but does not seem to be publicly visible. No response so far, but I will keep you updated.

You're right, unicode handling is mostly lacking in JS, but that should not lead to a corrupted string. Besides, the null characters could be the native string terminator leaking out, so there's the risk that someone more skilled than me could devise a way to leverage this bug to expose more memory content.

Regarding the fix, on the good side it should not affect performance nor change the Lunr behavior in any way. It is quite surprising though, so it might be worth writing a test for the toArray method to avoid regressions when this part of code is changed.

By the way, thanks for Lunr, it's really an amazing library!

chasenlehara · 2018-07-23T20:00:08Z

@lucaong It might be worth reporting a WebKit bug so we have some visibility into their process.

Thank you so much for taking the time to debug this issue more!

* Fix issue #279 (bug with Safari) Calling any method defined on String.prototype on the expanded term seems to force the string to be properly represented, fixing an issue affecting Safari users. See #279 * fix issue #279 at the source, on TokenSet.prototype.toArray it turns out that a specific string concatenation in TokenSet.prototype.toArray sometimes results in a corrupted string in Safari. It is fixed by calling any String.prototype method on it, which forces the string to the correct representation. The previous commit did the same, but this commit moves the fix closer to the source of the problem. It could be applied exactly at the point of the problematic concatenation, but that would result in some unnecessary (if inexpensive) calls, so it is instead when pushing each result string in the returned array. * remove changes to generated distribution file

olivernn · 2018-07-24T07:03:51Z

The patch from @lucaong is now on master, so if anyone wants to try out the bleeding edge they can. A proper release will follow shortly.

lucaong · 2018-07-24T08:20:07Z

@chasenlehara thanks for the link, I filed this bug there: https://bugs.webkit.org/show_bug.cgi?id=187947

olivernn · 2018-07-24T16:56:30Z

I've just pushed 2.3.1 to npm which includes the patch from @lucaong.

I'm going to leave this issue open for now until I get around to updating the tokeniser and token store to be more aware of unicode.

hftf · 2019-08-26T08:09:34Z

With lunr 2.3.6 and the trimmer removed from the pipeline, I still encounter this issue sporadically.

The patch in #361 may have been insufficient or too localized. Maybe it should be guaranteed that an undefined posting doesn't get past that line?

I see that the issue was already thoroughly investigated more than I can meaningfully contribute to. Anecdotally, it often happens when the term is empty (or stopwords), but I saw it happen with *m* several times too. I also use lunr-unicode-normalizer (monkey-patched for 2.x) for the rare document with Unicode text.

brutuscat · 2020-04-19T15:00:00Z

Apparently the fix reported by @lucaong has been fixed https://trac.webkit.org/changeset/255975/webkit and it lives now in the Safari Tech. Preview https://webkit.org/blog/10031/release-notes-for-safari-technology-preview-101/

chasenlehara added a commit to bit-docs/lunr.js that referenced this issue Jul 3, 2017

Add check to avoid uncaught exception in Safari

edcdc14

Work around for olivernn#279

chasenlehara mentioned this issue Jul 19, 2017

Don’t use our fork of Lunr for search canjs/bit-docs-html-canjs#385

Open

2 tasks

olivernn mentioned this issue Oct 9, 2017

Uncaught TypeError: Cannot read property 'tf' of undefined #243

Open

olivernn mentioned this issue Nov 22, 2017

wildcards search sometimes gives error "TypeError: Cannot read property '_index' of undefined" #314

Closed

nbuonin mentioned this issue Mar 2, 2018

PMT #114394: Fix search in Safari PoLAR-Hub/polarhub#38

Merged

aecorredor mentioned this issue Apr 20, 2018

Search breaks in react-native release build but not in debug build #340

Open

coreyward added a commit to coreyward/lunr.js that referenced this issue Jul 18, 2018

Normalize unicode to avoid index corruption; fixes issue olivernn#279

6244890

lucaong mentioned this issue Jul 20, 2018

Fix issue #279 (bug with Safari) #361

Merged

mortenpi mentioned this issue Aug 1, 2018

Offline documentation search error: Number of results: loading... JuliaDocs/Documenter.jl#743

Closed

max-ci mentioned this issue Nov 14, 2018

Searching for 'search' returns no results in Safari on Mac and iPhone squidfunk/mkdocs-material#915

Closed

Lunr throws in Safari sometimes when calling the query method #279

Lunr throws in Safari sometimes when calling the query method #279

Comments

imaustink commented Jun 30, 2017 • edited

chasenlehara commented Jul 3, 2017 • edited

chasenlehara commented Jul 3, 2017

olivernn commented Jul 3, 2017

chasenlehara commented Jul 3, 2017

olivernn commented Jul 3, 2017

Julinar commented Jul 13, 2017 • edited

olivernn commented Jul 13, 2017

imaustink commented Aug 31, 2017

olivernn commented Sep 1, 2017

olivernn commented Sep 1, 2017

olivernn commented Sep 3, 2017

olivernn commented Sep 3, 2017

imaustink commented Sep 3, 2017

olivernn commented Sep 3, 2017

olivernn commented Sep 3, 2017

olivernn commented Sep 3, 2017

imaustink commented Sep 4, 2017

olivernn commented Sep 4, 2017

olivernn commented Sep 6, 2017

escofield commented Jan 31, 2018

fbennett commented Feb 2, 2018

nbuonin commented Mar 1, 2018 • edited

dawez commented Jun 8, 2018

lucaong commented Jul 20, 2018

lucaong commented Jul 20, 2018 • edited

lucaong commented Jul 20, 2018 • edited

lucaong commented Jul 21, 2018

lucaong commented Jul 21, 2018

lucaong commented Jul 21, 2018

olivernn commented Jul 22, 2018

lucaong commented Jul 23, 2018 • edited

chasenlehara commented Jul 23, 2018

olivernn commented Jul 24, 2018

lucaong commented Jul 24, 2018

olivernn commented Jul 24, 2018

hftf commented Aug 26, 2019

brutuscat commented Apr 19, 2020

imaustink commented Jun 30, 2017 •

edited

chasenlehara commented Jul 3, 2017 •

edited

Julinar commented Jul 13, 2017 •

edited

nbuonin commented Mar 1, 2018 •

edited

lucaong commented Jul 20, 2018 •

edited

lucaong commented Jul 20, 2018 •

edited

lucaong commented Jul 23, 2018 •

edited