feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

JonasDev1 · 2024-05-14T14:36:01Z

Description

This pr improves the merging performance by adding min/max filters to the early filter.
The number of files scanned from the target file table is reduced by using the table statistics.
I have extended the early filter for this purpose. This filter is responsible for pre-filtering the target table.
Previously, the early filter only consisted of partition columns by filtering for all unique values from the source. Now the non-partition columns are also used by aggregating the min/max values from the source and adding a between expression to the early filter.

It is also automatically part of the conflict detection based on the predicate.

I added a property extended_early_filter to make this advanced filtering optional. I don't know if this is important, and maybe we can replace the bool with an enum. What do you think about this?

Example:

Merge into table t with partition date

Predicate: source.date = target.date and source.timestamp = target.timestamp and source.id = target.id and frob > 42

Early filter before: date = '2024-‚05-14' and frob > 42
Early filter now: date = '2024-05-14' and timestamp BETWEEN '…15:00' AND '…15:05' and id BETWEEN 'A' AND 'B' and frob > 42

github-actions · 2024-05-14T14:36:22Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

JonasDev1 · 2024-05-14T14:39:04Z

#2411

ion-elgreco · 2024-05-17T23:04:30Z

@JonasDev1 why did you make the advanced filtering optional?

If this provides better performance across the board, we should enable it always (so also for python bindings)

JonasDev1 · 2024-05-22T07:41:13Z

My concern was if you want to do e.g. merges via columns with null this would not work, but I think that it would not work without the advanced filtering either as is equal for null is not defined in sql.

Spark has an extra null safe operator <=> for this, which is not in Datafusion available yet.

JonasDev1 · 2024-05-24T13:41:16Z

What about the review?

I can of course also remove the flag again

Jonas Schmitz added 2 commits May 14, 2024 16:25

Add logic

dc169cd

Update name

f518d46

JonasDev1 requested review from wjones127, roeap and rtyler as code owners May 14, 2024 14:36

github-actions bot added the binding/rust Issues for the Rust crate label May 14, 2024

JonasDev1 changed the title ~~feat: Improve merge performance by using predicate non-partition columns min/max for prefiltering~~ feat: improve merge performance by using predicate non-partition columns min/max for prefiltering May 14, 2024

ion-elgreco requested a review from Blajda May 14, 2024 15:03

Merge branch 'main' into merge-non-partition-col-filtering

27db66a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

JonasDev1 commented May 14, 2024

github-actions bot commented May 14, 2024

JonasDev1 commented May 14, 2024

ion-elgreco commented May 17, 2024

JonasDev1 commented May 22, 2024

JonasDev1 commented May 24, 2024

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

Are you sure you want to change the base?

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

Conversation

JonasDev1 commented May 14, 2024

Description

Example:

github-actions bot commented May 14, 2024

JonasDev1 commented May 14, 2024

ion-elgreco commented May 17, 2024

JonasDev1 commented May 22, 2024

JonasDev1 commented May 24, 2024