Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762

Merged
merged 13 commits into from May 31, 2022

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #1761 .

Rationale for this change

Get this info in memory then we can apply page-level filter in future.

What changes are included in this PR?

Add an option to read page index in parquet/src/file/serialized_reader.rs.

Are there any user-facing changes?

In parquet-testing only data_index_bloom_encoding_stats.parquet has one rowgroup with pageIndex.
I will generate test file base on alltypes_plain.parquet (this file not contains any pageindex) in repo parquet-testing, and support multi-RG in Next Pr.

 parquet-tools column-index ./alltypes_plain.parquet
row group 0:
column index for column id:
NONE
offset index for column id:
NONE
.
.
.
column index for column timestamp_col:
NONE
offset index for column timestamp_col:

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 28, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 28, 2022

Codecov Report

Merging #1762 (d8bd4ce) into master (722fcfc) will increase coverage by 0.12%.
The diff coverage is 56.93%.

@@            Coverage Diff             @@
##           master    #1762      +/-   ##
==========================================
+ Coverage   83.27%   83.40%   +0.12%     
==========================================
  Files         195      198       +3     
  Lines       55896    56145     +249     
==========================================
+ Hits        46549    46825     +276     
+ Misses       9347     9320      -27     
Impacted Files Coverage Δ
parquet/src/basic.rs 91.49% <ø> (ø)
parquet/src/data_type.rs 75.66% <0.00%> (-0.19%) ⬇️
parquet/src/util/bit_util.rs 93.01% <0.00%> (-1.02%) ⬇️
parquet/src/file/page_index/index.rs 23.07% <23.07%> (ø)
parquet/src/file/page_index/index_reader.rs 80.30% <80.30%> (ø)
parquet/src/file/serialized_reader.rs 95.57% <94.87%> (-0.08%) ⬇️
parquet/src/file/metadata.rs 95.15% <100.00%> (+0.72%) ⬆️
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.23%) ⬇️
arrow/src/array/array_struct.rs 88.40% <0.00%> (-0.05%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ba1ef4...d8bd4ce. Read the comment docs.

@Ted-Jiang
Copy link
Member Author

@alamb @tustvold plz have a review

///```
//
fn test_page_index_reader() {
let test_file = get_test_file("data_index_bloom_encoding_stats.parquet");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add more test after add file in parquet-testing

@tustvold
Copy link
Contributor

Awesome, will review tomorrow 😀

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice start, awesome work, I've left a few thoughts 😄

parquet/src/file/page_index/index.rs Outdated Show resolved Hide resolved
parquet/src/file/page_index/index.rs Outdated Show resolved Hide resolved
parquet/src/file/page_index/index.rs Outdated Show resolved Hide resolved
parquet/src/file/page_index/index.rs Outdated Show resolved Hide resolved
} else {
let min = min.as_slice();
let max = max.as_slice();
(Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any documentation on what the endianess is supposed to be, but I suspect little endian. Using native endian feels off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found Int96 only support from_ne_slice

impl FromBytes for Int96 {
type Buffer = [u8; 12];
fn from_le_bytes(_bs: Self::Buffer) -> Self {
unimplemented!()
}
fn from_be_bytes(_bs: Self::Buffer) -> Self {
unimplemented!()
}
fn from_ne_bytes(bs: Self::Buffer) -> Self {
let mut i = Int96::new();
i.set_data(
from_ne_slice(&bs[0..4]),
from_ne_slice(&bs[4..8]),
from_ne_slice(&bs[8..12]),
);
i
}
}

I will add more test in future PR after generate test data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah - should be easy enough to fix FWIW

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks fix in d8bd4ce.
Could you where should i get the info should use from_le_bytes?
Is it all java app should use little_endian for deserialize.

parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/page_index/index_reader.rs Show resolved Hide resolved
parquet/src/file/page_index/index_reader.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
Ted-Jiang and others added 3 commits May 30, 2022 21:00
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@Ted-Jiang Ted-Jiang requested a review from tustvold May 30, 2022 15:31

#[derive(Debug, Clone, PartialEq)]
pub enum Index {
BOOLEAN(BooleanIndex),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CamelCase is generally preferred to shouty case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should align with Type

pub enum Type {
BOOLEAN,
INT32,
INT64,
INT96,
FLOAT,
DOUBLE,
BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I mean to change that but it has been low on my priority list 😅

} else {
let min = min.as_slice();
let max = max.as_slice();
(Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah - should be easy enough to fix FWIW

@@ -189,6 +203,27 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
})
}

/// Creates file reader from a Parquet file with page Index.
/// Returns error if Parquet file does not exist or is corrupt.
pub fn new_with_page_index(chunk_reader: R) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've removed the new_with_options approach, I think I would prefer you revert that and remove this additional constructor instead. With this I'm not sure how you would set ReadOptions and also enable the page index.

parquet/src/file/page_index/index_reader.rs Show resolved Hide resolved
let mut data = vec![0; length];
reader.read_exact(&mut data)?;

let mut start = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In read_pages_locations you simply reuse the same cursor to continue where the previous one left off, here you instead explicitly slice the data buffer and feed this into TCompactInputProtocol. I couldn't see a particular reason why there were two different approaches to reading the same "style" of data

parquet/src/file/page_index/index_reader.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step forward, and can be iterated on in subsequent PRs. Thank you 😀

fn from_le_bytes(_bs: Self::Buffer) -> Self {
unimplemented!()
fn from_le_bytes(bs: Self::Buffer) -> Self {
Self::from_ne_bytes(bs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be incorrect on any platform that isn't little endian...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this, is this 2b5264b , how should i test this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks correct to me, it is a bit funky because Int96 is internally represented as a [u32; 3] with the least significant u32 first (i.e. little endian).

So on a big endian machine, you need to decode to big endian u32, which are then stored in a little endian array 🤯

In practice Int96 is deprecated, and I'm not sure there are any big endian platforms supported by this crate, but good to be thorough 👍.

how should i test this

It should be covered by the existing tests, I'll create a ticket for clarifying how this is being handled crate-wide


#[derive(Debug, Clone, PartialEq)]
pub enum Index {
BOOLEAN(BooleanIndex),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I mean to change that but it has been low on my priority list 😅

let mut data = vec![0; length];
reader.read_exact(&mut data)?;

let mut start = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But surely the thrift decoder will only read what it needs, i.e. regardless of the variable length nature of the messages it will only read the corresponding bytes? It's not a big deal, I just think we should consistently either rely on the data to be "self-delimiting" or use the lengths to decode slices

parquet/src/data_type.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

I intend to merge this once CI clears

@alamb alamb changed the title Prepare and construct index from col metadata for skipping pages at reading Support reading PageIndex from parquet metadata, prepare for skipping pages at reading May 31, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed this PR but it looks really nice -- thank you @Ted-Jiang and @tustvold

@tustvold tustvold merged commit ac5073f into apache:master May 31, 2022
@alamb
Copy link
Contributor

alamb commented May 31, 2022

🎉

@Ted-Jiang
Copy link
Member Author

@tustvold @alamb Thanks a lot again! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading PageIndex from column metadata
4 participants