Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregation bucket limit #1363

Merged
merged 10 commits into from Jun 23, 2022
Merged

Add aggregation bucket limit #1363

merged 10 commits into from Jun 23, 2022

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented May 10, 2022

Change SegmentCollector.collect to return a Result

Validation happens on different phases depending on the aggregation
Term: During segment collection
Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370
Range: When validating the request

Closes #1331

@codecov-commenter
Copy link

codecov-commenter commented May 11, 2022

Codecov Report

Merging #1363 (2e2822f) into main (cbd06ab) will increase coverage by 0.01%.
The diff coverage is 99.15%.

@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
+ Coverage   94.29%   94.31%   +0.01%     
==========================================
  Files         234      236       +2     
  Lines       42769    43524     +755     
==========================================
+ Hits        40331    41050     +719     
- Misses       2438     2474      +36     
Impacted Files Coverage Δ
src/aggregation/agg_result.rs 73.46% <33.33%> (-19.79%) ⬇️
src/aggregation/segment_agg_result.rs 91.50% <97.14%> (+0.53%) ⬆️
src/aggregation/agg_req_with_accessor.rs 94.36% <100.00%> (+0.66%) ⬆️
src/aggregation/bucket/histogram/histogram.rs 99.60% <100.00%> (+<0.01%) ⬆️
src/aggregation/bucket/range.rs 96.31% <100.00%> (+0.07%) ⬆️
src/aggregation/bucket/term_agg.rs 99.10% <100.00%> (+0.03%) ⬆️
src/aggregation/collector.rs 100.00% <100.00%> (ø)
src/aggregation/intermediate_agg_result.rs 98.24% <100.00%> (+0.65%) ⬆️
src/aggregation/metric/stats.rs 97.20% <100.00%> (ø)
src/aggregation/mod.rs 98.77% <100.00%> (+<0.01%) ⬆️
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbd06ab...2e2822f. Read the comment docs.

Validation happens on different phases depending on the aggregation
Term: During segment collection
Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370
Range: When validating the request

update CHANGELOG
@PSeitz PSeitz changed the title refactor aggregations Add aggregation bucket limit May 12, 2022
@PSeitz PSeitz requested a review from fulmicoton May 12, 2022 04:28
@PSeitz PSeitz force-pushed the refactor_aggregation branch 3 times, most recently from f83dcb7 to 11792a0 Compare May 13, 2022 05:17
let value = self.fast_field_reader.get(doc) as f64;
self.stats.count += 1;
self.stats.sum += value;
self.stats.squared_sum += value * value;
Ok(())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the bench before and after this change to check it does not impact the compiler perf?
(Top + Count on "the" term query)

The MultiCollector is likely to be the trickiest because of the dynamic dispatch. Unfortunately, it is not part of the benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but in my benches so far adding Result didn't add overhead, except for very tight and simple loops.

E.g. here let value = self.fast_field_reader.get(doc) as f64; probably considerably outweighs the Result overhead.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm scared of the tantivy API change (Collector::collect returning a Result).

If I understand correctly, the goal is to detect that we have reached the collection limit and abort the query.

An alternative could be to detect the excess of buckets, and keep on the collection but avoid adding extra bucket, and return an error when harvesting the segment.
Would that be unreasonable?

@PSeitz
Copy link
Contributor Author

PSeitz commented May 17, 2022

I'm scared of the tantivy API change (Collector::collect returning a Result).

If I understand correctly, the goal is to detect that we have reached the collection limit and abort the query.

Scared because of performance?

I checked the performance and the "the" TOP10_COUNT query is indeed slower. (TOP_10 and COUNT are the same speed though) I introduced a change which I considered for some time now, which is to collect hits and pass them as block. I need to add a benchmark for the MultiCollector, I think this would profit above average from that. Aggregation may also benefit.

In the current change, the collect_block is quite simple implemented, just fowarding to the underlying function, but I with manual unrolling, we could probably gain some more performance.

Orig With Result With Result collect_block
1,632 μs 1,847 μs +13.2 % 1,664 μs +2.3 %
336,574 docs 336,574 docs 336,574 docs

Currently untested which may help is to pass to collect_block always a fixed size array of e.g. 64.

Currently we always pass the score, but having some mechanism to only request docs without score could make make some gains on some scenarios.

An alternative could be to detect the excess of buckets, and keep on the collection but avoid adding extra bucket, and return an error when harvesting the segment. Would that be unreasonable?

I thought about that, but returning a result in collect is the right approach imo, since we could also want to read from other sources than infallible in-memory sources in the segment collection.

@fulmicoton
Copy link
Collaborator

Scared because of performance?

Performance and misuse yes.
I prefer the collect method to not return a Result.

Maybe a bool would do the trick?

COUNT and TOP_10 have a special code path. They are irrelevant here.

@PSeitz
Copy link
Contributor Author

PSeitz commented May 17, 2022

Performance and misuse yes. I prefer the collect method to not return a Result.

Misuse how? Every tantivy user should know how to handle Result.

Maybe a bool would do the trick?

Like a C-style API, on false collect the error somewhere? It's likely faster, but also pretty ugly.

I added a bench with the MultiCollector (Count + Top10). It is faster with the collect_block approach.

Orig With Result collect_block
2,275 μs +26.1 % 1,804 μs
336,574 docs 336,574 docs

collect_block has other potential performance gains, e.g. for TopN we could check upfront if the best score from the block would make it in the TopN, if not skip the whole block. That would probably work pretty well for large results, when the TopN is becoming saturated.

add collect_block in segment_collector to handle groups of documents as performance optimization
add collect_block for MultiCollector
CHANGELOG.md Outdated Show resolved Hide resolved
@@ -153,7 +154,7 @@ impl SegmentRangeCollector {
) -> crate::Result<IntermediateBucketResult> {
let field_type = self.field_type;

let buckets = self
let buckets: FnvHashMap<SerializedKey, IntermediateRangeBucketEntry> = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's super helpful! thanks

@PSeitz PSeitz merged commit 6ca5f77 into main Jun 23, 2022
@PSeitz PSeitz deleted the refactor_aggregation branch June 23, 2022 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregation Max Bucket Limit
3 participants