Write a blog about parquet predicate pushdown #3464

alamb · 2022-09-13T10:46:12Z

I think it would be super valuable to write a blog post about all the work from @thinkharderdev @Ted-Jiang, @tustvold and others to make reading from parquet in DataFusion very fast

I have gathered a list of items on #3462 which will perhaps spark some thoughts / ideas.

tustvold · 2022-09-13T11:09:01Z

I made a bit of a start on collecting some data for this. In particular I created something to allow generating parquet files for use in some test benchmarks here.

The basic idea was to show the performance of a selection of relatively simple queries across datafusion-cli and compare it to some other systems like duckdb, trino, polars, spark, etc... Hopefully this would provide ample opportunity to describe the various work that has been performed over the last 9 or so months, and would ground the performance in easily understandable terms.

We could also potentially run benchmarks with various forms of pushdown disabled, to quantify the impact of those changes. Or against older versions of the parquet reader, to quantify the performance impact of things like dictionary preservation

alamb · 2022-09-13T19:09:22Z

Hopefully this would provide ample opportunity to describe the various work that has been performed over the last 9 or so months, and would ground the performance in easily understandable terms.

I agree -- this would be a great start

alamb · 2022-11-30T21:53:38Z

We have a draft of this post ready here apache/arrow-site#280

alamb · 2022-12-07T20:58:58Z

For a variety of reasons we posted this on the Influxdata site first: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/ -- very cool stuff

alamb mentioned this issue Sep 13, 2022

[EPIC] Parquet filter pushdown into scan #3462

Open

27 tasks

tustvold mentioned this issue Oct 20, 2022

Add documentation for support for skipping Parquet row groups #825

Open

alamb mentioned this issue Nov 30, 2022

[WEBSITE]: Querying Parquet with Millisecond Latency apache/arrow-site#280

Merged

alamb closed this as completed Dec 7, 2022

alamb mentioned this issue Mar 7, 2023

[Epic]: Improve Documentation, Tutorials, and Examples #3058

Closed

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a blog about parquet predicate pushdown #3464

Write a blog about parquet predicate pushdown #3464

alamb commented Sep 13, 2022 •

edited

tustvold commented Sep 13, 2022

alamb commented Sep 13, 2022

alamb commented Nov 30, 2022

alamb commented Dec 7, 2022

Write a blog about parquet predicate pushdown #3464

Write a blog about parquet predicate pushdown #3464

Comments

alamb commented Sep 13, 2022 • edited

tustvold commented Sep 13, 2022

alamb commented Sep 13, 2022

alamb commented Nov 30, 2022

alamb commented Dec 7, 2022

alamb commented Sep 13, 2022 •

edited