Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a blog about parquet predicate pushdown #3464

Closed
Tracked by #3462 ...
alamb opened this issue Sep 13, 2022 · 4 comments · Fixed by apache/arrow-site#280
Closed
Tracked by #3462 ...

Write a blog about parquet predicate pushdown #3464

alamb opened this issue Sep 13, 2022 · 4 comments · Fixed by apache/arrow-site#280

Comments

@alamb
Copy link
Contributor

alamb commented Sep 13, 2022

I think it would be super valuable to write a blog post about all the work from @thinkharderdev @Ted-Jiang, @tustvold and others to make reading from parquet in DataFusion very fast

I have gathered a list of items on #3462 which will perhaps spark some thoughts / ideas.

@tustvold
Copy link
Contributor

I made a bit of a start on collecting some data for this. In particular I created something to allow generating parquet files for use in some test benchmarks here.

The basic idea was to show the performance of a selection of relatively simple queries across datafusion-cli and compare it to some other systems like duckdb, trino, polars, spark, etc... Hopefully this would provide ample opportunity to describe the various work that has been performed over the last 9 or so months, and would ground the performance in easily understandable terms.

We could also potentially run benchmarks with various forms of pushdown disabled, to quantify the impact of those changes. Or against older versions of the parquet reader, to quantify the performance impact of things like dictionary preservation

@alamb
Copy link
Contributor Author

alamb commented Sep 13, 2022

Hopefully this would provide ample opportunity to describe the various work that has been performed over the last 9 or so months, and would ground the performance in easily understandable terms.

I agree -- this would be a great start

@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2022

We have a draft of this post ready here apache/arrow-site#280

@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2022

For a variety of reasons we posted this on the Influxdata site first: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/ -- very cool stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants