Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762

Ted-Jiang · 2022-05-28T11:30:40Z

Which issue does this PR close?

Closes #1761 .

Rationale for this change

Get this info in memory then we can apply page-level filter in future.

What changes are included in this PR?

Add an option to read page index in parquet/src/file/serialized_reader.rs.

Are there any user-facing changes?

In parquet-testing only data_index_bloom_encoding_stats.parquet has one rowgroup with pageIndex.
I will generate test file base on alltypes_plain.parquet (this file not contains any pageindex) in repo parquet-testing, and support multi-RG in Next Pr.

 parquet-tools column-index ./alltypes_plain.parquet
row group 0:
column index for column id:
NONE
offset index for column id:
NONE
.
.
.
column index for column timestamp_col:
NONE
offset index for column timestamp_col:

codecov-commenter · 2022-05-28T11:57:50Z

Codecov Report

Merging #1762 (d8bd4ce) into master (722fcfc) will increase coverage by 0.12%.
The diff coverage is 56.93%.

@@            Coverage Diff             @@
##           master    #1762      +/-   ##
==========================================
+ Coverage   83.27%   83.40%   +0.12%     
==========================================
  Files         195      198       +3     
  Lines       55896    56145     +249     
==========================================
+ Hits        46549    46825     +276     
+ Misses       9347     9320      -27

Impacted Files	Coverage Δ
parquet/src/basic.rs	`91.49% <ø> (ø)`
parquet/src/data_type.rs	`75.66% <0.00%> (-0.19%)`	⬇️
parquet/src/util/bit_util.rs	`93.01% <0.00%> (-1.02%)`	⬇️
parquet/src/file/page_index/index.rs	`23.07% <23.07%> (ø)`
parquet/src/file/page_index/index_reader.rs	`80.30% <80.30%> (ø)`
parquet/src/file/serialized_reader.rs	`95.57% <94.87%> (-0.08%)`	⬇️
parquet/src/file/metadata.rs	`95.15% <100.00%> (+0.72%)`	⬆️
arrow/src/datatypes/datatype.rs	`65.42% <0.00%> (-0.38%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (-0.23%)`	⬇️
arrow/src/array/array_struct.rs	`88.40% <0.00%> (-0.05%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ba1ef4...d8bd4ce. Read the comment docs.

Ted-Jiang · 2022-05-28T13:25:23Z

@alamb @tustvold plz have a review

Ted-Jiang · 2022-05-28T13:30:23Z

parquet/src/file/serialized_reader.rs

+    ///```
+    //
+    fn test_page_index_reader() {
+        let test_file = get_test_file("data_index_bloom_encoding_stats.parquet");


Will add more test after add file in parquet-testing

tustvold · 2022-05-29T08:16:33Z

Awesome, will review tomorrow 😀

tustvold

This is a really nice start, awesome work, I've left a few thoughts 😄

parquet/src/file/page_index/index.rs

tustvold · 2022-05-30T09:50:03Z

parquet/src/file/page_index/index.rs

+                } else {
+                    let min = min.as_slice();
+                    let max = max.as_slice();
+                    (Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max)))


I can't find any documentation on what the endianess is supposed to be, but I suspect little endian. Using native endian feels off?

I found Int96 only support from_ne_slice

arrow-rs/parquet/src/data_type.rs

Lines 1197 to 1214 in 90cf78c

impl FromBytes for Int96 {

type Buffer = [u8; 12];

fn from_le_bytes(_bs: Self::Buffer) -> Self {

unimplemented!()

}

fn from_be_bytes(_bs: Self::Buffer) -> Self {

unimplemented!()

}

fn from_ne_bytes(bs: Self::Buffer) -> Self {

let mut i = Int96::new();

i.set_data(

from_ne_slice(&bs[0..4]),

from_ne_slice(&bs[4..8]),

from_ne_slice(&bs[8..12]),

);

i

}

}

I will add more test in future PR after generate test data.

Aah - should be easy enough to fix FWIW

Thanks fix in d8bd4ce.
Could you where should i get the info should use from_le_bytes?
Is it all java app should use little_endian for deserialize.

parquet/src/file/metadata.rs

parquet/src/file/page_index/index_reader.rs

parquet/src/file/metadata.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

tustvold · 2022-05-30T20:09:56Z

parquet/src/file/page_index/index.rs

+
+#[derive(Debug, Clone, PartialEq)]
+pub enum Index {
+    BOOLEAN(BooleanIndex),


CamelCase is generally preferred to shouty case

I think it should align with Type

arrow-rs/parquet/src/basic.rs

Lines 45 to 53 in 90cf78c

pub enum Type {

BOOLEAN,

INT32,

INT64,

INT96,

FLOAT,

DOUBLE,

BYTE_ARRAY,

FIXED_LEN_BYTE_ARRAY,

Fair, I mean to change that but it has been low on my priority list 😅

tustvold · 2022-05-30T20:10:12Z

parquet/src/file/page_index/index.rs

+                } else {
+                    let min = min.as_slice();
+                    let max = max.as_slice();
+                    (Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max)))


Aah - should be easy enough to fix FWIW

tustvold · 2022-05-30T20:12:18Z

parquet/src/file/serialized_reader.rs

@@ -189,6 +203,27 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
        })
    }

+    /// Creates file reader from a Parquet file with page Index.
+    /// Returns error if Parquet file does not exist or is corrupt.
+    pub fn new_with_page_index(chunk_reader: R) -> Result<Self> {


I see you've removed the new_with_options approach, I think I would prefer you revert that and remove this additional constructor instead. With this I'm not sure how you would set ReadOptions and also enable the page index.

parquet/src/file/page_index/index_reader.rs

tustvold · 2022-05-30T20:14:21Z

parquet/src/file/page_index/index_reader.rs

+    let mut data = vec![0; length];
+    reader.read_exact(&mut data)?;
+
+    let mut start = 0;


In read_pages_locations you simply reuse the same cursor to continue where the previous one left off, here you instead explicitly slice the data buffer and feed this into TCompactInputProtocol. I couldn't see a particular reason why there were two different approaches to reading the same "style" of data

parquet/src/file/page_index/index_reader.rs

tustvold

I think this is a good step forward, and can be iterated on in subsequent PRs. Thank you 😀

tustvold · 2022-05-31T08:19:07Z

parquet/src/data_type.rs

-    fn from_le_bytes(_bs: Self::Buffer) -> Self {
-        unimplemented!()
+    fn from_le_bytes(bs: Self::Buffer) -> Self {
+        Self::from_ne_bytes(bs)


This will be incorrect on any platform that isn't little endian...

I am not familiar with this, is this 2b5264b , how should i test this

It looks correct to me, it is a bit funky because Int96 is internally represented as a [u32; 3] with the least significant u32 first (i.e. little endian).

So on a big endian machine, you need to decode to big endian u32, which are then stored in a little endian array 🤯

In practice Int96 is deprecated, and I'm not sure there are any big endian platforms supported by this crate, but good to be thorough 👍.

how should i test this

It should be covered by the existing tests, I'll create a ticket for clarifying how this is being handled crate-wide

tustvold · 2022-05-31T08:19:52Z

parquet/src/file/page_index/index.rs

+
+#[derive(Debug, Clone, PartialEq)]
+pub enum Index {
+    BOOLEAN(BooleanIndex),


Fair, I mean to change that but it has been low on my priority list 😅

tustvold · 2022-05-31T08:23:54Z

parquet/src/file/page_index/index_reader.rs

+    let mut data = vec![0; length];
+    reader.read_exact(&mut data)?;
+
+    let mut start = 0;


But surely the thrift decoder will only read what it needs, i.e. regardless of the variable length nature of the messages it will only read the corresponding bytes? It's not a big deal, I just think we should consistently either rely on the data to be "self-delimiting" or use the lengths to decode slices

parquet/src/data_type.rs

tustvold · 2022-05-31T10:15:18Z

I intend to merge this once CI clears

alamb

I only skimmed this PR but it looks really nice -- thank you @Ted-Jiang and @tustvold

alamb · 2022-05-31T20:44:42Z

🎉

Ted-Jiang · 2022-06-02T08:26:49Z

@tustvold @alamb Thanks a lot again! ❤️

Ted-Jiang added 4 commits May 26, 2022 18:01

Add read options for column index based filtering

684b660

try to deserialize pageIndex from parquet

90cf78c

unable read pageIndex from parquet and test

5f3e2f3

fix fmt

847813e

github-actions bot added the parquet Changes to the parquet crate label May 28, 2022

remove FixedLenByteIndex use ByteIndex

3468a21

Ted-Jiang commented May 28, 2022

View reviewed changes

tustvold reviewed May 30, 2022

View reviewed changes

Ted-Jiang and others added 3 commits May 30, 2022 21:00

Apply suggestions from code review

e96e537

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

fix comments

dbbd655

use enum instead of trait object

0f3b241

Ted-Jiang requested a review from tustvold May 30, 2022 15:31

tustvold reviewed May 30, 2022

View reviewed changes

Ted-Jiang added 3 commits May 31, 2022 14:43

fix and add comment

d4bbb86

use from_le_slice instead of from_ne_slice

d8bd4ce

use builder option

1e315d2

tustvold approved these changes May 31, 2022

View reviewed changes

Ted-Jiang added 2 commits May 31, 2022 17:29

remove useless trait

db575cf

fix from_le_slice

2b5264b

alamb changed the title ~~Prepare and construct index from col metadata for skipping pages at reading~~ Support reading PageIndex from parquet metadata, prepare for skipping pages at reading May 31, 2022

alamb reviewed May 31, 2022

View reviewed changes

tustvold merged commit ac5073f into apache:master May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762

Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762

Ted-Jiang commented May 28, 2022

codecov-commenter commented May 28, 2022 •

edited

Ted-Jiang commented May 28, 2022

Ted-Jiang May 28, 2022

tustvold commented May 29, 2022

tustvold left a comment

tustvold May 30, 2022

Ted-Jiang May 30, 2022

tustvold May 30, 2022

Ted-Jiang May 31, 2022

tustvold May 30, 2022

Ted-Jiang May 31, 2022

tustvold May 31, 2022

tustvold May 30, 2022

tustvold May 30, 2022

tustvold May 30, 2022

tustvold left a comment

tustvold May 31, 2022

Ted-Jiang May 31, 2022

tustvold May 31, 2022

tustvold May 31, 2022

tustvold May 31, 2022

tustvold commented May 31, 2022

alamb left a comment

alamb commented May 31, 2022

Ted-Jiang commented Jun 2, 2022

	impl FromBytes for Int96 {
	type Buffer = [u8; 12];
	fn from_le_bytes(_bs: Self::Buffer) -> Self {
	unimplemented!()
	}
	fn from_be_bytes(_bs: Self::Buffer) -> Self {
	unimplemented!()
	}
	fn from_ne_bytes(bs: Self::Buffer) -> Self {
	let mut i = Int96::new();
	i.set_data(
	from_ne_slice(&bs[0..4]),
	from_ne_slice(&bs[4..8]),
	from_ne_slice(&bs[8..12]),
	);
	i
	}
	}

	pub enum Type {
	BOOLEAN,
	INT32,
	INT64,
	INT96,
	FLOAT,
	DOUBLE,
	BYTE_ARRAY,
	FIXED_LEN_BYTE_ARRAY,

Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762

Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762

Conversation

Ted-Jiang commented May 28, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented May 28, 2022 • edited

Codecov Report

Ted-Jiang commented May 28, 2022

Choose a reason for hiding this comment

tustvold commented May 29, 2022

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented May 31, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented May 31, 2022

Ted-Jiang commented Jun 2, 2022

codecov-commenter commented May 28, 2022 •

edited