Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support skip_page missing OffsetIndex Fallback in SerializedPageReader #2460

Merged
merged 3 commits into from Aug 18, 2022

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #2459

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@Ted-Jiang Ted-Jiang marked this pull request as draft August 16, 2022 03:43
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 16, 2022
@@ -480,13 +480,17 @@ pub(crate) fn decode_page(
Ok(result)
}

#[allow(clippy::large_enum_variant)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this clippy get

error: large size difference between variants
   --> parquet/src/file/serialized_reader.rs:484:5
    |
484 | /     Values {
485 | |         /// The current byte offset in the reader
486 | |         offset: usize,
487 | |
...   |
492 | |         next_page_header: Option<PageHeader>,
493 | |     },
    | |_____^ this variant is 336 bytes
    |
    = note: `-D clippy::large-enum-variant` implied by `-D warnings`
note: and the second-largest variant is 72 bytes:
   --> parquet/src/file/serialized_reader.rs:494:5

but i think the 336 is wrong because in PageHeader, most of the large arg are option, clippy can not know when running only one is some , so i think remove this check here is reasonable (or wrap 'Box' i think this will downgrade)🤔

  pub data_page_header: Option<DataPageHeader>,
  pub index_page_header: Option<IndexPageHeader>,
  pub dictionary_page_header: Option<DictionaryPageHeader>,
  pub data_page_header_v2: Option<DataPageHeaderV2>,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold PTAL😊

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option takes up the same space regardless of it is set, they're like C++ union types in that sense. I think you should box the page header probably...

Copy link
Member Author

@Ted-Jiang Ted-Jiang Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right! tested in ut

#[test]
    fn test() {
        enum  Test {
            Large {
                page: Option<PageHeader>,
            }
        }
        let test1 = Test::Large { page: None };

        println!("{}", std::mem::size_of_val(&test1));
        ()
    }

got 320 😂 thanks for your info!
I should try before ask😔

@Ted-Jiang Ted-Jiang closed this Aug 18, 2022
@Ted-Jiang Ted-Jiang reopened this Aug 18, 2022
@Ted-Jiang Ted-Jiang marked this pull request as ready for review August 18, 2022 04:17
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just some minor nits

parquet/src/arrow/arrow_reader/selection.rs Outdated Show resolved Hide resolved
parquet/src/file/serialized_reader.rs Outdated Show resolved Hide resolved
if *remaining_bytes == 0 {
return Ok(None);
}
return if let Some(header) = next_page_header.take() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably shouldn't take but just be as_ref?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only way set it none. Is after skip_next_page.

@tustvold tustvold merged commit 12cc067 into apache:master Aug 18, 2022
@tustvold
Copy link
Contributor

tustvold commented Aug 18, 2022

I'm going to get this in as the fix for parquet_derive is in #2494 and I want this to make the release 😄

@ursabot
Copy link

ursabot commented Aug 18, 2022

Benchmark runs are scheduled for baseline = 9f77e4e and contender = 12cc067. 12cc067 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support SerializedPageReader::skip_page without OffsetIndex
3 participants