Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation logic for StructBuilder::finish #2413

Merged
merged 4 commits into from Aug 13, 2022
Merged

Add validation logic for StructBuilder::finish #2413

merged 4 commits into from Aug 13, 2022

Conversation

psvri
Copy link
Contributor

@psvri psvri commented Aug 11, 2022

Which issue does this PR close?

Closes #2252.

Rationale for this change

StructBuilder::finish calls ArrayDataBuilder::build_unchecked without verifying the length of the child arrays. This allows forming invalid ArrayData

What changes are included in this PR?

I have added a new private function StructBuilder::validate_content which is called first before running the finish logic. This validates contents in the struct to ensure that fields and field_builders are of equal length and the number of items in individual field_builders are equal to self.len

Are there any user-facing changes?

No, Since I have used panics to handle the errors.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 11, 2022
@liukun4515
Copy link
Contributor

liukun4515 commented Aug 11, 2022

The validation should be done in each push operation for Builder or not the finish

@psvri
Copy link
Contributor Author

psvri commented Aug 11, 2022

In the linked issue the expected behavior was for finish to panic or return an error if its children are not all the same length. Hence I made the code this way.

tustvold
tustvold previously approved these changes Aug 11, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you

}
if !self.field_builders.iter().all(|x| x.len() == self.len) {
panic!("Field_builders are of unequal lengths.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also verify the length of the null buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this logic in 29ff8a6

@tustvold tustvold dismissed their stale review August 11, 2022 14:47

null buffer

/// - the number of null items in individual field_builders are equal to null_count in self.null_buffer_builder.
fn validate_and_construct(&mut self) -> (Vec<ArrayData>, Option<Buffer>) {
if self.fields.len() != self.field_builders.len() {
panic!("Number of fields is not equal to the number of field_builders.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check this when constructing the struct builder, such as moving this check to StructBuilder::new()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too was thinking the same. But I made my code according to issues expected behavior. If the team needs it, I shall move this check to new function.

}

if child_null_count != self_null_count {
panic!("StructBuilder and field_builders contain unequal null counts.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't struct builder and child builders have different null counts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say a struct is null, its child elements should be null. Hence they should have same null counts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say a struct is null, its child elements should be null.

Sorry, I'm afraid this is not the truth. Each array has its own null buffer, which means the validation of elements in different arrays are independent.

For example, let's say we have a struct array ["name": String, "age": Int]. We could have elements whose fields could be None:

[
    ["Name": Some("Bob"), "age": Some(32)],
    ["Name": Some("Andy"), "age": None],
    ["Name": None, "age": None],
]

Also, we could have a null field in the Struct array:

[
    Some(["Name": Some("Bob"), "age": Some(32)]),
    None, // the underlying "Name" and "age" fields could be anything. Because they are masked by the null buffer. 
    Some(["Name": None, "age": None]),
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In array::data::tests::test_try_new_sliced_struct , there was this below code and thats why I had understood it that way.

// struct = { a: int, b: bool }
// struct[1] = null
        builder
            .field_builder::<Int32Builder>(0)
            .unwrap()
            .append_option(None);
        builder
            .field_builder::<BooleanBuilder>(1)
            .unwrap()
            .append_option(None);
        builder.append(false);

So I added this logic. My bad,I had understood it wrong. I shall remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check should be that the bit length of null_buffer_builder is the same as the length of the inner builders, the null counts can and likely will differ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if the latest commit looks good ?

Copy link
Contributor

@HaoYang670 HaoYang670 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Leave some nits.

if !self
.field_builders
.iter()
.all(|x| x.len() == self.null_buffer_builder.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have checked

self.field_builders.iter().all(|x| x.len() == self.len)

before, we don't need to check

self .field_builders.iter().all(|x| x.len() == self.null_buffer_builder.len())

Based on the transitivity of integer equality, checking

self.len == self.null_buffer_builder.len()

is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait! We don't need to check the null buffer length at all. As we can make sure that the length of null buffer builder is always equal to self.len. @tustvold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah yes, does beg the question why we even have self.len as a separate variable...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #2429 to track this.


let mut builder = StructBuilder::new(fields, field_builders);
builder.finish();
}

This comment was marked as abuse.

@HaoYang670
Copy link
Contributor

Could we merge this? And then we can do some clean-up.

@tustvold tustvold merged commit f0d7d0b into apache:master Aug 13, 2022
@ursabot
Copy link

ursabot commented Aug 13, 2022

Benchmark runs are scheduled for baseline = 98eeb01 and contender = f0d7d0b. f0d7d0b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StructBuilder Does not Verify Child Lengths
5 participants