Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generify parquet write path (#1764) #2045

Merged
merged 8 commits into from Jul 17, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #1764

Rationale for this change

This will allow encoding arrow arrays directly instead of the current approach which must first convert them to a row format.

What changes are included in this PR?

Adds generics to ColumnWriterImpl to allow custom encoders

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2022
@@ -212,7 +212,7 @@ pub enum Repetition {
/// Encodings supported by Parquet.
/// Not all encodings are valid for all types. These enums are also used to specify the
/// encoding of definition and repetition levels.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Ord, PartialOrd)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to allow for stable encoding ordering

{
const PHYSICAL_TYPE: Type;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved onto ParquetValueType as it allows implementing methods in terms of ParquetValueType which has the advantage that type inference works


// Find out number of batches to process.
let write_batch_size = self.props.write_batch_size();
let num_batches = min_len / write_batch_size;
let num_batches = num_levels / write_batch_size;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a bug before, previously the presence of nulls (which would lead to less values), would result in skewed batching

pub(crate) mod private {
use super::*;

pub trait MakeStatistics {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of this, but it was the only way I could work out to do this, and it is crate-local so if we come up with a better way we can change it in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another reason to not switch from TypedStatistics to ValueStatistics 🤔

@codecov-commenter
Copy link

codecov-commenter commented Jul 11, 2022

Codecov Report

Merging #2045 (496bb70) into master (c585544) will increase coverage by 0.00%.
The diff coverage is 88.31%.

@@           Coverage Diff           @@
##           master    #2045   +/-   ##
=======================================
  Coverage   83.73%   83.73%           
=======================================
  Files         224      225    +1     
  Lines       59391    59373   -18     
=======================================
- Hits        49733    49718   -15     
+ Misses       9658     9655    -3     
Impacted Files Coverage Δ
parquet/src/basic.rs 91.49% <ø> (ø)
parquet/src/data_type.rs 74.25% <50.00%> (-1.49%) ⬇️
parquet/src/file/statistics.rs 91.49% <78.57%> (-1.07%) ⬇️
parquet/src/column/writer/encoder.rs 88.46% <88.46%> (ø)
parquet/src/column/writer/mod.rs 92.53% <91.37%> (ø)
parquet/src/encodings/encoding.rs 93.62% <0.00%> (+0.19%) ⬆️
arrow/src/datatypes/datatype.rs 64.41% <0.00%> (+0.35%) ⬆️
...row/src/array/builder/string_dictionary_builder.rs 91.36% <0.00%> (+0.71%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c585544...496bb70. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jul 11, 2022

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I followed all the details of this PR (especially all the generics) but the code seems easier to read and other than a few questions on the tests I think things look good to me

Epic 🏅

num_column_nulls: u64,
column_distinct_count: Option<u64>,
encodings: BTreeSet<Encoding>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a BTreeSet (which requires Ord on Encoding) rather than a HashSet? Is there any reason the ordering is important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the output deterministic, which makes testing easier, but in terms of correctness, there should be no requirement for a particular order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- Perhaps it is worth a comment explaining the rationale

parquet/src/basic.rs Show resolved Hide resolved
pub struct TypedStatistics<T: DataType> {
min: Option<T::T>,
max: Option<T::T>,
pub type TypedStatistics<T> = ValueStatistics<<T as DataType>::T>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry my rust-fu isn't good enough -- what is the purpose of renaming TypedStatistics to ValueStatistics (and doing the DataType::T nonsense in the typedef)? Is that needed to #[derive] works?

Copy link
Contributor Author

@tustvold tustvold Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed so you can create Statistics from a T: ParquetValueType, whereas previously it was being created (in a very hacky way) using DataType and casting via byte slices

pub(crate) mod private {
use super::*;

pub trait MakeStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another reason to not switch from TypedStatistics to ValueStatistics 🤔

);
check_encoding_write_support::<Int32Type>(
WriterVersion::PARQUET_2_0,
false,
&[1, 2],
None,
&[Encoding::DELTA_BINARY_PACKED, Encoding::RLE],
&[Encoding::RLE, Encoding::DELTA_BINARY_PACKED],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand these changes. Why did the order of supported encodings change? Is it due to the use of a Set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we track the encodings has changed slightly (using a set) which results in these changes

// Column writer properties
descr: ColumnDescPtr,
props: WriterPropertiesPtr,
statistics_enabled: EnabledStatistics,

page_writer: Box<dyn PageWriter + 'a>,
has_dictionary: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be a significant improvement / change -- not tracking DictEncoder specially

parquet/src/column/writer/encoder.rs Show resolved Hide resolved
parquet/src/column/writer/encoder.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jul 11, 2022

It might be worth asking @sunchao or @nevi-me to take a look at this PR as well

@tustvold
Copy link
Contributor Author

@nevi-me are you still reviewing this?

@tustvold
Copy link
Contributor Author

I'm going to get this in, @nevi-me if you have any feedback I'll be more than happy to address it in a follow up

@tustvold tustvold merged commit 02371d2 into apache:master Jul 17, 2022
@ursabot
Copy link

ursabot commented Jul 17, 2022

Benchmark runs are scheduled for baseline = c585544 and contender = 02371d2. 02371d2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants