New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading PageIndex from parquet metadata, prepare for skipping pages at reading #1762
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1762 +/- ##
==========================================
+ Coverage 83.27% 83.40% +0.12%
==========================================
Files 195 198 +3
Lines 55896 56145 +249
==========================================
+ Hits 46549 46825 +276
+ Misses 9347 9320 -27
Continue to review full report at Codecov.
|
///``` | ||
// | ||
fn test_page_index_reader() { | ||
let test_file = get_test_file("data_index_bloom_encoding_stats.parquet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add more test after add file in parquet-testing
Awesome, will review tomorrow 😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice start, awesome work, I've left a few thoughts 😄
parquet/src/file/page_index/index.rs
Outdated
} else { | ||
let min = min.as_slice(); | ||
let max = max.as_slice(); | ||
(Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any documentation on what the endianess is supposed to be, but I suspect little endian. Using native endian feels off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found Int96 only support from_ne_slice
arrow-rs/parquet/src/data_type.rs
Lines 1197 to 1214 in 90cf78c
impl FromBytes for Int96 { | |
type Buffer = [u8; 12]; | |
fn from_le_bytes(_bs: Self::Buffer) -> Self { | |
unimplemented!() | |
} | |
fn from_be_bytes(_bs: Self::Buffer) -> Self { | |
unimplemented!() | |
} | |
fn from_ne_bytes(bs: Self::Buffer) -> Self { | |
let mut i = Int96::new(); | |
i.set_data( | |
from_ne_slice(&bs[0..4]), | |
from_ne_slice(&bs[4..8]), | |
from_ne_slice(&bs[8..12]), | |
); | |
i | |
} | |
} |
I will add more test in future PR after generate test data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah - should be easy enough to fix FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks fix in d8bd4ce.
Could you where should i get the info should use from_le_bytes
?
Is it all java app should use little_endian
for deserialize.
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
|
||
#[derive(Debug, Clone, PartialEq)] | ||
pub enum Index { | ||
BOOLEAN(BooleanIndex), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CamelCase is generally preferred to shouty case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should align with Type
Lines 45 to 53 in 90cf78c
pub enum Type { | |
BOOLEAN, | |
INT32, | |
INT64, | |
INT96, | |
FLOAT, | |
DOUBLE, | |
BYTE_ARRAY, | |
FIXED_LEN_BYTE_ARRAY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I mean to change that but it has been low on my priority list 😅
parquet/src/file/page_index/index.rs
Outdated
} else { | ||
let min = min.as_slice(); | ||
let max = max.as_slice(); | ||
(Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah - should be easy enough to fix FWIW
@@ -189,6 +203,27 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { | |||
}) | |||
} | |||
|
|||
/// Creates file reader from a Parquet file with page Index. | |||
/// Returns error if Parquet file does not exist or is corrupt. | |||
pub fn new_with_page_index(chunk_reader: R) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've removed the new_with_options approach, I think I would prefer you revert that and remove this additional constructor instead. With this I'm not sure how you would set ReadOptions and also enable the page index.
let mut data = vec![0; length]; | ||
reader.read_exact(&mut data)?; | ||
|
||
let mut start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In read_pages_locations you simply reuse the same cursor to continue where the previous one left off, here you instead explicitly slice the data
buffer and feed this into TCompactInputProtocol
. I couldn't see a particular reason why there were two different approaches to reading the same "style" of data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step forward, and can be iterated on in subsequent PRs. Thank you 😀
parquet/src/data_type.rs
Outdated
fn from_le_bytes(_bs: Self::Buffer) -> Self { | ||
unimplemented!() | ||
fn from_le_bytes(bs: Self::Buffer) -> Self { | ||
Self::from_ne_bytes(bs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be incorrect on any platform that isn't little endian...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this, is this 2b5264b , how should i test this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks correct to me, it is a bit funky because Int96 is internally represented as a [u32; 3] with the least significant u32 first (i.e. little endian).
So on a big endian machine, you need to decode to big endian u32, which are then stored in a little endian array 🤯
In practice Int96 is deprecated, and I'm not sure there are any big endian platforms supported by this crate, but good to be thorough 👍.
how should i test this
It should be covered by the existing tests, I'll create a ticket for clarifying how this is being handled crate-wide
|
||
#[derive(Debug, Clone, PartialEq)] | ||
pub enum Index { | ||
BOOLEAN(BooleanIndex), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I mean to change that but it has been low on my priority list 😅
let mut data = vec![0; length]; | ||
reader.read_exact(&mut data)?; | ||
|
||
let mut start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But surely the thrift decoder will only read what it needs, i.e. regardless of the variable length nature of the messages it will only read the corresponding bytes? It's not a big deal, I just think we should consistently either rely on the data to be "self-delimiting" or use the lengths to decode slices
I intend to merge this once CI clears |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only skimmed this PR but it looks really nice -- thank you @Ted-Jiang and @tustvold
🎉 |
Which issue does this PR close?
Closes #1761 .
Rationale for this change
Get this info in memory then we can apply page-level filter in future.
What changes are included in this PR?
Add an option to read page index in
parquet/src/file/serialized_reader.rs
.Are there any user-facing changes?
In
parquet-testing
onlydata_index_bloom_encoding_stats.parquet
has onerowgroup
with pageIndex.I will generate test file base on
alltypes_plain.parquet
(this file not contains any pageindex) in repoparquet-testing
, and support multi-RG in Next Pr.