Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parquet filter pushdown by default #3463

Open
5 of 6 tasks
Tracked by #3462
alamb opened this issue Sep 13, 2022 · 13 comments
Open
5 of 6 tasks
Tracked by #3462

Enable parquet filter pushdown by default #3463

alamb opened this issue Sep 13, 2022 · 13 comments
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented Sep 13, 2022

In #3380 @thinkharderdev added support for evaluating filters during the parquet scan via the RowIndex mechanism 馃帀

This feature is currently enabled via a feature flag, which is disabled by default.

This ticket tracks enabling this feature by default.

Currently known items are:

@alamb alamb self-assigned this Oct 13, 2022
@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2022

I plan to review the parquet test coverage over the next day or two

My basic plan is to:

  1. Make a draft PR that enables pushdown by default and see if any regression tests fail
  2. Do some manual testing using datafusion-cli and some internal datasets

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2022

The PR that enables the feature is #3828 and it looks quite promising

Also, I have a proposed config change that I would like to get in, #3822, which would allow users to quickly disable this feature if they found issues

@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2022

Update here: I am working on some fuzz testing for the predicate pushdown code

@alamb
Copy link
Contributor Author

alamb commented Oct 28, 2022

I found a bug in pushdown by writing a test #4005

@alamb
Copy link
Contributor Author

alamb commented Oct 28, 2022

I found another one with my test. I will keep the list on this ticket updated

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2022

I think we are getting close

@alamb
Copy link
Contributor Author

alamb commented Nov 2, 2022

Update here is I think once we get #3976 merged I'll put up the PR to enable the feature by default

@Dandandan
Copy link
Contributor

@alamb What's the remaining work that needs to be done to enable it by default? Anything I can do to get this over the finish line?

@alamb
Copy link
Contributor Author

alamb commented Feb 5, 2023

@Dandandan 鉂わ笍 thank you

I know of two major items:

  1. Enable page index by default Enable parquet page level skipping (page index pruning) by default聽#4085
  2. Ensure that enabling filter pushdown by default doesn't cause performance regressions in cases where the filter isn't very selective. I know @tustvold had some thoughts in this area.

For item 1, I tried a few times, but am currently blocked by #5104 and haven't had time to return to it

For item 2, I think someone needs to look at some benchmark results (perhaps TPCH) and figure out what we can do to avoid the regression (or measure and determine it isn't significant)

I keep hoping to have time to work on item 1, as I am pretty sure I know what the problem is, but other things keep coming up :(

I haven't had a chance to work on 2 yet

@alamb
Copy link
Contributor Author

alamb commented Mar 2, 2023

Update here is I am working on a larger benchmarking story, part of which would give us more confidence to merge changes like this in. I hope to have that done early next week

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2023

With the introduction of bench.sh with tpch and clickbench (#6994) I think we are now in a better position to do this experiment again and fix any performance regressions

@tustvold
Copy link
Contributor

apache/arrow-rs#5523 might help mitigate the impact of pushing down predicates that turn out to not be very selective

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2024

It would be really nice to (finally) be able to turn this optimization on -- thank you @tustvold

cc @Dandandan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants