feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pr improves the merging performance by adding min/max filters to the early filter.
The number of files scanned from the target file table is reduced by using the table statistics.
I have extended the early filter for this purpose. This filter is responsible for pre-filtering the target table.
Previously, the early filter only consisted of partition columns by filtering for all unique values from the source. Now the non-partition columns are also used by aggregating the min/max values from the source and adding a between expression to the early filter.
It is also automatically part of the conflict detection based on the predicate.
I added a property
extended_early_filter
to make this advanced filtering optional. I don't know if this is important, and maybe we can replace the bool with an enum. What do you think about this?Example:
Merge into table t with partition date
Predicate: source.date = target.date and source.timestamp = target.timestamp and source.id = target.id and frob > 42
Early filter before:
date = '2024-‚05-14' and frob > 42
Early filter now:
date = '2024-05-14' and timestamp BETWEEN '…15:00' AND '…15:05' and id BETWEEN 'A' AND 'B' and frob > 42