Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolves #1682: Support auto-complete via querying main Lucene directory #1683

Merged
merged 14 commits into from May 25, 2022

Conversation

alecgrieser
Copy link
Contributor

This modifies the auto-complete cursor to do a query over the main Lucene directory and then search the original records to recover the text.

This resolves #1682.

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: fb8d0db
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: fb8d0db
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: e3eb142
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: e3eb142
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

…ucene directory

This modifies the auto-complete cursor so that it reads from the main Lucene index and then reconstitutes the original text by reading from the base record data.
@alecgrieser alecgrieser marked this pull request as ready for review May 19, 2022 20:49
@@ -1153,7 +1144,6 @@ void testAutoCompleteSearchForPhrase() throws Exception {
"states united as a country",
"states have been united as a country",
"all the states united as a country",
"all the states have been united as a country",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text is from a document that is then updated with a new document. The old behavior was to return both versions of the text, whereas the new behavior only retrieves the text from the current version of the document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I am getting why this test data is not needed anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at addIndexAndSaveRecordsForAutoCompleteOfPhrase (that is, the test-data saving API), there are these two lines:

recordStore.saveRecord(createSimpleDocument(1630L, "all the states have been united as a country", 1));
recordStore.saveRecord(createSimpleDocument(1630L, "united states is a country in the continent of america", 1));

Note that they save data with the same record ID (1630L) but with different texts. In the old approach, both results would be presented because auto-complete didn't handle deletes, whereas in the new approach, the second save causes the first document to be deleted from the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alecgrieser sorry that was a mistake. My original purpose for that was not to use the same record ID. That text was supposed to be a separate record to verify the phrase search for auto-complete. Could you please revert this change and change its recordID to 1631L?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Made that change


// Assert the corresponding field for the suggestions
List<String> fields = results.stream().map(i -> (String) i.getKey().get(i.getKeySize() - 2)).collect(Collectors.toList());
assertEquals(ImmutableList.of("text", "text", "text", "text", "text2", "text2"), fields);

// Assert the suggestions are sorted according to their values, which are determined by the position of the term into the indexed text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results are now returned in the same order as the original Lucene query, so these asserts sorting by score is no longer true. If we wanted to continue this, we would need to add some kind of sort step, which might actually be possible given that we have bounded results. But I haven't done that in this PR

.setShard(searchAfter.shardIndex)
.build();
return ByteArrayContinuation.fromNullable(continuation.toByteArray());
return LuceneCursorContinuation.fromScoreDoc(searchAfter);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I abstracted this kind of continuation into its own class. This change preserves the continuation value, but it means we will no longer need to serialize the proto to a byte array (if the continuation isn't needed)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 302db72
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 302db72
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 39fdb54
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 39fdb54
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 1989447
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 1989447
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: aaf522e
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: aaf522e
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: f2560b4
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: f2560b4
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

fieldQuery.add(new TermQuery(new Term(field, token)), BooleanClause.Occur.MUST);
}
if (prefixToken != null) {
fieldQuery.add(new PrefixQuery(new Term(field, prefixToken)), BooleanClause.Occur.MUST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we already decided to give up the edge ngram for the tail case that is configurable? I recall the proposal was to still support that by having a system owned ngram field for each customer's field, which is a little complex though, especially when the customer's fields are tokenized with some specific analyzers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I don't know if having edge n-grams built into this is super doable... If this is executed against an n-gram index, it might be able to pick up string interiors, but that's currently a little untested.

}

@Nullable
private Query buildQueryForTermsMatching(@Nonnull Collection<String> fieldNames,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is minor, can you move this method to be next to the buildQueryForPhraseMatching? So it might be easier to read the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it around. I like the new placement better 👍

Set<String> fieldNames = new HashSet<>();
indexReader.leaves().forEach(leaf -> leaf.reader().getFieldInfos().forEach(fieldInfo -> fieldNames.add(fieldInfo.name)));
fieldNames.remove(LuceneIndexMaintainer.PRIMARY_KEY_FIELD_NAME);
fieldNames.remove(LuceneIndexMaintainer.PRIMARY_KEY_SEARCH_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to maintain a list of "system-owned" field names somewhere, so we won't forget to change here when we modify the list in future if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. I added a set of "unsearchable system fields" that should be excluded from these kinds of searches and then used it here.

final Set<String> tokenSet = new HashSet<>(tokens);
Query finalQuery = phraseQueryNeeded
? buildQueryForPhraseMatching(fieldNames, tokens, prefixToken)
: buildQueryForTermsMatching(fieldNames, tokenSet, prefixToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you refactored the original code to only get the tokens in getQueryTokens and then build the finalQuery in separate methods, to avoid duplicate code. And you have to pass in a list for phrase query building and set for terms query building. This looks fine to me, but maybe it is worthy to have a comment for '''buildQueryForPhraseMatchingmethod to explain why it needs a List but thebuildQueryForTermsMatching``` uses Set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to that affect. But yeah, you're correct, it is about the order of tokens (and whether they need to be duplicated)

// Not having the primary key is fine for auto-complete queries that just want the
// text, but queries wanting to do something with both the auto-completed text and the
// original record need to do something else
IndexEntry indexEntry = new IndexEntry(state.index, key, Tuple.from(scoreDoc.score));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expected to use the doc's score rather than the concept "weights", and get rid of the logic to re-calculate the weights based on the positions of the matches in their texts and sort the results based on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc score is the best score we have, at this point. In terms of sorting, this version does not return results in that sort order, and adding it back would be somewhat doable, though we'd need some of the in-development work to resort values later because the record layer doesn't support in-memory sorting. So, I've left this returning results in their original order as returned by the query, but maybe that's wrong.


@SuppressWarnings("squid:S3776") // Cognitive complexity is too high. Candidate for later refactoring
@Nullable
private String searchAllMaybeHighlight(String text, Set<String> matchedTokens, @Nullable String prefixToken, boolean highlight) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is worth unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I added LuceneAutoCompleteResultCursorTest and put some unit tests of this functionality in here.

We may want to refactor the tokenize-and-search logic, at some point, because I could see us wanting to re-use this code somewhere else, but I didn't do that here.

}

if (matchedField == null) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to throw exception here rather than returning null. We need to understand why no matching fields are found from the matched doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add that logic, but actually, there is a case where we don't get a matched document. In particular, if the matched text is too big, then the document might show up in the initial query and then get filtered out by the lookup. (This is something I figured out when I ran the tests locally.)

In theory, we could try and do something like match first, then find the matches for each document, then filter out by size, but then we'd have to perform the filter on the largest text values, which seems like the CPU work we'd most like to avoid. I could still rework it do that if we thought that was valuable, though.

return directoryManager.getAutocompleteSuggester(groupingKey, autoCompleteIndexAnalyzerChooser.chooseAnalyzer(texts),
autoCompleteQueryAnalyzerChooser.chooseAnalyzer(texts), highlightForAutoCompleteIfEnabled, indexOptions);
private Analyzer getAutocompleteQueryAnalyzer(@Nonnull List<String> texts) {
return autoCompleteQueryAnalyzerChooser.chooseAnalyzer(texts).getAnalyzer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the auto-complete uses the same index as basic index now, not sure whether it still needs a separate query analyzer. I think this depends on the behavior we want to support and we can further discuss this offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good. I moved this to use the index's main analyzer, though there's some weirdness here regarding how synonyms are (or, in the current implementation, aren't) handled. There's also some stuff around getting the analyzer by type that could be cleaned up, though we could also push that to a follow up PR if this approach goes well. (I'm a little hesitant to do it here because it will affect the API of the analyzer registry, and if we decide to revert this PR, that would mean also reverting any changes to implementor of the analyzer registry, too.)

@@ -1153,7 +1144,6 @@ void testAutoCompleteSearchForPhrase() throws Exception {
"states united as a country",
"states have been united as a country",
"all the states united as a country",
"all the states have been united as a country",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I am getting why this test data is not needed anymore.

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 5170dfb
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 5170dfb
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 80b81f5
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 80b81f5
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)


IndexReader indexReader = getIndexReader();
Set<String> fieldNames = new HashSet<>();
indexReader.leaves().forEach(leaf -> leaf.reader().getFieldInfos().forEach(fieldInfo -> fieldNames.add(fieldInfo.name)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also omit field infos with IndexOptions.NONE? We don't have stored-only fields today, but there isn't any reason we couldn't that I can think of.

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 438019c
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 438019c
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

MMcM
MMcM previously approved these changes May 24, 2022
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 662edfc
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 662edfc
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@alecgrieser
Copy link
Contributor Author

We should be sure to squash this when merging, I think

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: 5813e47
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: 5813e47
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: f659f62
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: f659f62
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: d50ca45
  • Result: FAILED
  • Error: Error while executing command: ./gradlew --no-daemon --console=plain -b ./build.gradle build destructiveTest sonarqube -PcoreNotStrict -PreleaseBuild=false -PpublishBuild=false -PspotbugsEnableHtmlReport. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: d50ca45
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@sonarcloud
Copy link

sonarcloud bot commented May 25, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 17 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto2
  • Commit ID: c8624b0
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-record-layer-pr-proto3
  • Commit ID: c8624b0
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@alecgrieser alecgrieser merged commit 606bfb8 into FoundationDB:main May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support auto-complete via querying main Lucene directory
5 participants