Implement parquet page-level skipping with column index, using min/max stats #847

alamb · 2021-08-10T11:02:20Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
(this is summarized version of some comments in a discussion thread between @sunchao @jorgecarleitao and @nju_yaho (not sure if that is the correct github handle)

While reading data from parquet files, the more data that can be immediately ruled out without decompressing, the faster the query will go

@sunchao pointed out that the structure of Parquet also allows page-level skipping with column index, using min/max stats, which is pretty effective when data is sorted. The data being sorted is important because otherwise a data page could contain random data within a big range [min, max] and predicates such as col < 42 won’t be very effective.

There is a good blog post about this feature: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

Note that page level min/max statistics is a relatively new feature. We only know of parquet-mr and impala which have implemented it. Spark also recently added the support in apache/spark#32753. The page indexes are stored in the column chunk metadata: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L798

Describe the solution you'd like

To achieve this, latest Parquet version has introduced two kinds of indexes: ColumnIndex and OffsetIndex.

They are all stored in the file footer for each ColumnChunk of each RowGroup.

For ColumnIndex, it includes min, max for each page; while for OffsetIndex, it includes row ranges, file offset range for each page.

For example, to filter by Column A to achieve filtering on column B

	1. For Column A:
		a. According to the ColumnIndex, filter qualified pages
		b. According to the OffsetIndex, achieve the row ranges for the qualified pages
	2. For Column B:
		a. According to the row ranges from Column A and its OffsetIndex, find out qualified pages whose row ranges are overlapped
        b. According to the filtered OffsetIndex, read related pages

In the case Column B above you also need to use row ranges when scanning a page, and skip those rows if they are not within the range. In the case of multiple predicates on different columns, you’d also need to calculate row range intersects or union.

Describe alternatives you've considered
TBD

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

nevi-me · 2021-09-01T22:59:00Z

I've followed the Spark implementation for a while, and I think this would be a great feature to have.

Hoeze · 2021-10-18T23:45:38Z

Also, pyspark 3.2 supports Parquet column index:
https://issues.apache.org/jira/browse/SPARK-26345

alamb · 2022-08-15T11:55:09Z

I believe through the hard work of @Ted-Jiang @thinkharderdev @tustvold and others, this feature is nearing fruition - apache/arrow-rs#1191 and apache/arrow-rs#2270

Ted-Jiang · 2022-10-08T02:25:07Z

working on it.😊

tustvold · 2022-10-20T00:51:25Z

I believe this has been implemented by #3780, feel free to reopen if I have missed anything

alamb added enhancement New feature or request datafusion Changes in the datafusion crate labels Aug 10, 2021

westonpace mentioned this issue Apr 11, 2022

does arrow support parquet column index feathure? apache/arrow#12851

Closed

alamb mentioned this issue Aug 15, 2022

Push additional parquet filtering into the parquet scan [EPIC] #3147

Closed

5 tasks

alamb mentioned this issue Sep 13, 2022

[EPIC] Parquet filter pushdown into scan #3462

Open

27 tasks

Ted-Jiang mentioned this issue Oct 10, 2022

Implement parquet page-level skipping with column index, using min/ma… #3780

Merged

tustvold closed this as completed Oct 20, 2022

alamb mentioned this issue Nov 2, 2022

Enable parquet page level skipping (page index pruning) by default #4085

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parquet page-level skipping with column index, using min/max stats #847

Implement parquet page-level skipping with column index, using min/max stats #847

alamb commented Aug 10, 2021

nevi-me commented Sep 1, 2021

Hoeze commented Oct 18, 2021

alamb commented Aug 15, 2022 •

edited

Ted-Jiang commented Oct 8, 2022

tustvold commented Oct 20, 2022

Implement parquet page-level skipping with column index, using min/max stats #847

Implement parquet page-level skipping with column index, using min/max stats #847

Comments

alamb commented Aug 10, 2021

nevi-me commented Sep 1, 2021

Hoeze commented Oct 18, 2021

alamb commented Aug 15, 2022 • edited

Ted-Jiang commented Oct 8, 2022

tustvold commented Oct 20, 2022

alamb commented Aug 15, 2022 •

edited