Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support skip_def_levels for ColumnLevelDecoder #2111

Merged
merged 6 commits into from Jul 24, 2022

Conversation

Ted-Jiang
Copy link
Member

@Ted-Jiang Ted-Jiang commented Jul 20, 2022

Which issue does this PR close?

Closes #2107 .

Rationale for this change

This pr only support skip def_levels in DefinitionLevelBufferDecoder,
@tustvold Just make sure all page with max_def_level = 1 and max_rep_level = 0,
will only use DefinitionLevelBufferDecoder decode def_level.

In another word, this will support the skip val in col like OPTIONAL type R:0 D:1

Added :
Seems it still use ColumnLevelDecoderImpl in some case facing R:0 D:1
when decoding some col`s def_level

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 20, 2022
@Ted-Jiang Ted-Jiang changed the title Support skip_def_levels (only max_def_levels=1) for ColumnLevelDecoder Support skip_def_levels (only max_def_levels=1) for ColumnLevelDecoder Jul 20, 2022
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Jul 20, 2022

@tustvold @thinkharderdev PTAL 😊, this block the #2106 because all type are R:0 D:1.

parquet-tools meta  ./alltypes_tiny_pages_plain.parquet
file:            ~/github/arrow-rs/parquet-testing/data/alltypes_tiny_pages_plain.parquet
creator:         parquet-mr version 1.12.0-SNAPSHOT (build 6901a2040848c6b37fa61f4b0a76246445f396db)
extra:           writer.model.name = 2.1.1-cdh6.x-SNAPSHOT

file schema:     hive_schema
--------------------------------------------------------------------------------
id:              OPTIONAL INT32 R:0 D:1
bool_col:        OPTIONAL BOOLEAN R:0 D:1
tinyint_col:     OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1
smallint_col:    OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1
int_col:         OPTIONAL INT32 R:0 D:1
bigint_col:      OPTIONAL INT64 R:0 D:1
float_col:       OPTIONAL FLOAT R:0 D:1
double_col:      OPTIONAL DOUBLE R:0 D:1
date_string_col: OPTIONAL BINARY L:STRING R:0 D:1
string_col:      OPTIONAL BINARY L:STRING R:0 D:1
timestamp_col:   OPTIONAL INT96 R:0 D:1
year:            OPTIONAL INT32 R:0 D:1
month:           OPTIONAL INT32 R:0 D:1

@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2022

Codecov Report

Merging #2111 (24c4519) into master (7746e7d) will decrease coverage by 0.01%.
The diff coverage is 72.72%.

@@            Coverage Diff             @@
##           master    #2111      +/-   ##
==========================================
- Coverage   82.86%   82.85%   -0.02%     
==========================================
  Files         237      237              
  Lines       61296    61381      +85     
==========================================
+ Hits        50792    50856      +64     
- Misses      10504    10525      +21     
Impacted Files Coverage Δ
parquet/src/column/reader/decoder.rs 50.00% <0.00%> (-11.34%) ⬇️
...rquet/src/arrow/record_reader/definition_levels.rs 87.76% <91.42%> (+1.96%) ⬆️
arrow/src/datatypes/datatype.rs 62.69% <0.00%> (-0.32%) ⬇️
parquet/src/encodings/encoding.rs 93.62% <0.00%> (+0.19%) ⬆️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us.

@@ -399,6 +442,55 @@ mod tests {
assert_eq!(decoded.as_slice(), expected.as_slice());
}

#[test]
fn test_packed_decoder_skip() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Ted-Jiang Ted-Jiang changed the title Support skip_def_levels (only max_def_levels=1) for ColumnLevelDecoder Support skip_def_levels (only max_def_levels=1) for DefinitionLevelBufferDecoder Jul 20, 2022
@Ted-Jiang Ted-Jiang changed the title Support skip_def_levels (only max_def_levels=1) for DefinitionLevelBufferDecoder Support skip_def_levels (only max_def_levels=1) for ColumnLevelDecoder Jul 20, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately max_def_level = 1 is an insufficient condition for using a PackedDecoder, as the nulls may not exist at the leaf level.

The way this currently works is the ArrayReader implementation will pass either a packed or full DefinitionLevelBuffer to RecordReader, which will in turn pass this to DefinitionLevelBufferDecoder. This is a bit of a hack but avoided leaking arrow-specific details into the GenericRecordReader.

Unfortunately this means that DefinitionLevelBufferDecoder does not know if it should use a PackedDecoder until it has actually been asked to read data.

I will have a brief play and see if I can sort this out for you, it's my mess to cleanup 😆

) -> Result<(usize, usize)> {
Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
// For now only support max_def_level == 1
if max_def_level == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't technically correct, if you have a nullable StructArray, containing a non-nullable child. The child will have a max definition level of 1, but it will be using ColumnLevelDecoderImpl not PackedDecoder.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i understand it 🤔 thanks!
I think when using PackedDecoder just need the bitmask of the col.
And the unit of the bitmask is per row.
So you use the max_def_level == 1 and max_rep_level == 0 as the checker
am i right 😂?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, we only preserve the bitmask if this is a nullable leaf.

I.e we would preserve the bitmask for this

message Schema {
    OPTIONAL int32 field;
}

But not

message Schema {
    OPTIONAL GROUP group {
        REQUIRED int32 field;
    }
}

Despite field having a max_def_level == 1 in both cases.

Following #2117 I would remove all max_def_level logic, and just match the type of decoder.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it
One question: if we preserve the bitmask if this is a nullable leaf
like

message Schema {
    OPTIONAL GROUP group {
        OPTIONAL int32 field;
    }
}

we can not use the PackedDecoder right ? because the value can be 0,1,2

Copy link
Contributor

@tustvold tustvold Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i will read carefully. Seems you know the everything about parquet 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet/src/arrow/record_reader/definition_levels.rs Outdated Show resolved Hide resolved
parquet/src/arrow/record_reader/definition_levels.rs Outdated Show resolved Hide resolved
parquet/src/arrow/record_reader/definition_levels.rs Outdated Show resolved Hide resolved
parquet/src/column/reader/decoder.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

I've created #2116 which should hopefully help with the DefinitionLevelBufferDecoder logic, which I think is currently incorrect in this PR.

@tustvold
Copy link
Contributor

#2117 has now been merged which should make it possible to significantly simplify this PR (i.e. remove the max_def_level shenanigans)

@Ted-Jiang
Copy link
Member Author

#2117 has now been merged which should make it possible to significantly simplify this PR (i.e. remove the max_def_level shenanigans)

@tustvold i think you linked the wrong pr.

@tustvold
Copy link
Contributor

@Ted-Jiang I meant #2116

for _ in 0..num_levels {
// Values are delimited by max_def_level
if max_def_level
== reader
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will test in #2106

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just some minor nits

let indicator_value = self.decode_header()?;
let indicator_value = self
.decode_header()
.expect("decode_header fail in PackedDecoder");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before met not set decode_header when coding. Better have the error info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we propogated the error to the caller, instead of panicking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! my fault. it should use previous code pass the error. fix in fix comment

parquet/src/column/reader/decoder.rs Show resolved Hide resolved
@tustvold tustvold merged commit fce6626 into apache:master Jul 24, 2022
@ursabot
Copy link

ursabot commented Jul 24, 2022

Benchmark runs are scheduled for baseline = 7746e7d and contender = fce6626. fce6626 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@Ted-Jiang Ted-Jiang changed the title Support skip_def_levels (only max_def_levels=1) for ColumnLevelDecoder Support skip_def_levels for ColumnLevelDecoder Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support skip_def_levels for ColumnLevelDecoder
5 participants