Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MapArrayReader (#2484) (#1699) (#1561) #2500

Merged
merged 3 commits into from Aug 18, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #2484
Closes #1699
Closes #1561

Rationale for this change

The ListArrayReader logic is better tested, more complete and can be used to read MapArrays

What changes are included in this PR?

Replaces the logic in MapArrayReader with ListArrayReader

Are there any user-facing changes?

No

use arrow::record_batch::RecordBatch;
use bytes::Bytes;

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit to @d-willis for this test

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 18, 2022
ArrowType::Map(element, _) => match element.data_type() {
ArrowType::Struct(fields) if fields.len() == 2 => {
// The inner map field must always non-nullable (#1697)
assert!(!element.is_nullable(), "map struct cannot be nullable");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in #1697 the encoding of DataType::Map permits greater nullability than the type can actually contain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1697 is a question -- so is the solution "we will assume that the inner struct can not be nullable until we have an existence proof to the contrary?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it is more, if this level is nullable the schema is inconsistent as there is no way to represent that in parquet 😅 The schema inference logic will never generate this - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/complex.rs#L350

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a comment there

@tvincent2
Copy link

Hi!
You just made my week with this PR!

It's quite a coincidence, but we've been hit by this very problem and spent some time on it. We created a small project to show how to reproduce our issue and just found #2484 as we were about to open our issue.
We can confirm that we can no longer reproduce our issue when using this PR branch.

Thanks a lot!

Cheers,
Thomas

@alamb alamb changed the title Fix MapArrayReader (#2484) (#1699) (#1561) Fix MapArrayReader (#2484) (#1699) (#1561) Aug 18, 2022
@d-willis
Copy link

Giving this a test against our code now. Hopefully will have run by tomorrow...
I can confirm that the reproducer test now works correctly locally.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold - I think this looks really nice (it is awesome to fix bugs by deleting code).

I think the test should be updated to assert that nullness on the read array was correct, but otherwise this PR looks good to go to me.

The feedback from @tvincent2 (thanks for that btw 👋 ) also makes a compelling case for the benefits of this PR. 👍

cc @nevi-me and @viirya

ArrowType::Map(element, _) => match element.data_type() {
ArrowType::Struct(fields) if fields.len() == 2 => {
// The inner map field must always non-nullable (#1697)
assert!(!element.is_nullable(), "map struct cannot be nullable");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1697 is a question -- so is the solution "we will assume that the inner struct can not be nullable until we have an existence proof to the contrary?"

map_def_level: rep_level,
map_rep_level: def_level,
}
let struct_def_level = match nullable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could document what nullable means in this context as part of a docstring

assert!(!element.is_nullable(), "map struct cannot be nullable");
element
}
_ => unreachable!("expected struct with two fields"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually it might be a nicer UX to return an error rather than panic'ing here (e.g. MapArrayReader::try_new()) but that would also change the API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a public API so I could change it, but that also makes me less inclined to do so. It is a bug in the builder if you run into this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense 👍

let data = array.data().clone();
let builder = data.into_builder().data_type(self.data_type.clone());
Ok(Arc::new(MapArray::from(unsafe {
builder.build_unchecked()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unsafe is OK because the inner StructArrayReader already validated the contents of the structs, and MapArrayReader::new() already validated the DataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much, will add a doc comment

map_def_level: i16,
#[allow(unused)]
map_rep_level: i16,
reader: ListArrayReader<i32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling this inner might me more idiomatic

image

for maybe_record_batch in record_batch_reader {
let record_batch = maybe_record_batch.expect("Getting current batch");
let col = record_batch.column(0);
let map_entry = array::as_map_array(col).value(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this test also validate that the first two entries are null? Something like

assert!(!array::as_map_array(col).is_valid(0));
assert!(!array::as_map_array(col).is_valid(1));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was originally written to test my very specific failure condition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a worthwhile addition though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry I probably should have phrased that as "I suggest this test also validate the first two entries are null"

@tustvold tustvold merged commit c34f07f into apache:master Aug 18, 2022
@ursabot
Copy link

ursabot commented Aug 18, 2022

Benchmark runs are scheduled for baseline = f3afdd2 and contender = c34f07f. c34f07f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@@ -193,6 +201,8 @@ mod tests {
for maybe_record_batch in record_batch_reader {
let record_batch = maybe_record_batch.expect("Getting current batch");
let col = record_batch.column(0);
assert!(col.is_null(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@d-willis
Copy link

Thanks for the quick turn around on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
5 participants