Ignore null buffer when creating ArrayData if null count is zero #2056

jhorstmann · 2022-07-12T21:25:55Z

Which issue does this PR close?

Closes #2055.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Semantically a null buffer with all bits set / a null count of 0 is the same as a None null buffer, but this is an observable behavior change.

codecov-commenter · 2022-07-12T21:44:44Z

Codecov Report

Merging #2056 (a0cac4d) into master (330505c) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2056      +/-   ##
==========================================
+ Coverage   83.55%   83.57%   +0.01%     
==========================================
  Files         222      222              
  Lines       58230    58244      +14     
==========================================
+ Hits        48656    48679      +23     
+ Misses       9574     9565       -9

Impacted Files	Coverage Δ
arrow/src/array/array_boolean.rs	`94.15% <100.00%> (+0.77%)`	⬆️
arrow/src/array/builder/generic_list_builder.rs	`95.09% <0.00%> (-1.61%)`	⬇️
...row/src/array/builder/string_dictionary_builder.rs	`90.64% <0.00%> (-0.72%)`	⬇️
arrow/src/datatypes/datatype.rs	`65.31% <0.00%> (-0.37%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️
arrow/src/array/equal/mod.rs	`96.48% <0.00%> (+0.28%)`	⬆️
arrow/src/ffi.rs	`87.52% <0.00%> (+0.34%)`	⬆️
arrow/src/compute/kernels/boolean.rs	`97.69% <0.00%> (+0.88%)`	⬆️
arrow/src/datatypes/ffi.rs	`76.56% <0.00%> (+3.83%)`	⬆️
arrow/src/array/builder/generic_string_builder.rs	`92.13% <0.00%> (+10.08%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 330505c...a0cac4d. Read the comment docs.

HaoYang670

Why don't we use the BooleanBuilder directly? I find it use the same logic:
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/builder/boolean_builder.rs#L128-L138

If we don't do further optimization, I guess we could write the code like this:

    fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {
        let iter = iter.into_iter();
        let (_, data_len) = iter.size_hint();
        let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound.
        let mut array_builder = BooleanBuilder::new(data_len);

        iter.for_each(|i| {
            array_builder.append_option(i.borrow().clone()).unwrap();
        });

        array_builder.finish()
    }

BTW, we could do some optimization in the BooleanBuilder, such as lazily materializing the null builder, just as what we have done in the primitive builder: https://github.com/apache/arrow-rs/blob/master/arrow/src/array/builder/primitive_builder.rs#L33-L35

alamb · 2022-07-13T20:31:36Z

Not sure what happened with the windows builder, but I have restarted the failed CI check and hopefully it will pass on rerun

jhorstmann · 2022-07-13T22:11:54Z

Why don't we use the BooleanBuilder directly?

From a correctness perspective that would probably be better and would also solve the reliance on size_hint (#138). We would probably need to do some optimizations on the builders to get similar performance.

HaoYang670 · 2022-07-13T23:43:23Z

Why don't we use the BooleanBuilder directly?

We would probably need to do some optimizations on the builders to get similar performance.

Do you mean that the Boolean builder is slower than this implementation?

tustvold · 2022-07-15T15:12:11Z

FWIW #2038 by @heyrutvik will overlap with this. This isn't a problem, just an FYI

jhorstmann · 2022-07-15T15:40:31Z

#2038 looks like the more extensive and correct solution. I'll take a look at the performance results for that PR.

heyrutvik · 2022-07-15T16:19:57Z

@jhorstmann FYI, see #2038 (comment) for some discussion about performance of the PR (and subsequent comments). I tried original value for buffer size for some builders but not seeing much difference.

tustvold · 2022-07-17T16:48:11Z

arrow/src/array/array_boolean.rs

@@ -242,14 +242,19 @@ impl<Ptr: Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray {
            }
        });

+        let null_buf: Buffer = null_builder.into();


I wonder if we could push this optimisation into ArrayData::new_unchecked??

That's an interesting idea and it passes all the existing test cases. It also changes behavior when someone passes an explicit null buffer, but I can't think of a reason why people would rely on that.

…er if there are no null values

ursabot · 2022-07-19T21:31:55Z

Benchmark runs are scheduled for baseline = efd3152 and contender = b2cf02c. b2cf02c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Avoid creating null buffer for BooleanArray if null count is zero

58f3aee

github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2022

viirya approved these changes Jul 12, 2022

View reviewed changes

Clippy fix

a0cac4d

HaoYang670 mentioned this pull request Jul 13, 2022

Add support of converting FixedSizeBinaryArray to DecimalArray #2041

Merged

HaoYang670 reviewed Jul 13, 2022

View reviewed changes

HaoYang670 mentioned this pull request Jul 13, 2022

Lazily materialize the null buffer builder of BooleanBuilder #2058

Closed

tustvold reviewed Jul 17, 2022

View reviewed changes

Check null_count in ArrayData::new_unchecked and ignore null_bit_buff…

f936030

…er if there are no null values

tustvold approved these changes Jul 18, 2022

View reviewed changes

jhorstmann changed the title ~~Avoid creating null buffer for BooleanArray if null count is zero~~ Ignore null buffer when creating ArrayData if null count is zero Jul 18, 2022

tustvold merged commit b2cf02c into apache:master Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore null buffer when creating ArrayData if null count is zero #2056

Ignore null buffer when creating ArrayData if null count is zero #2056

jhorstmann commented Jul 12, 2022 •

edited

codecov-commenter commented Jul 12, 2022 •

edited

HaoYang670 left a comment •

edited

alamb commented Jul 13, 2022 •

edited

jhorstmann commented Jul 13, 2022

HaoYang670 commented Jul 13, 2022

tustvold commented Jul 15, 2022

jhorstmann commented Jul 15, 2022

heyrutvik commented Jul 15, 2022

tustvold Jul 17, 2022

jhorstmann Jul 17, 2022

ursabot commented Jul 19, 2022

Ignore null buffer when creating ArrayData if null count is zero #2056

Ignore null buffer when creating ArrayData if null count is zero #2056

Conversation

jhorstmann commented Jul 12, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jul 12, 2022 • edited

Codecov Report

HaoYang670 left a comment • edited

Choose a reason for hiding this comment

alamb commented Jul 13, 2022 • edited

jhorstmann commented Jul 13, 2022

HaoYang670 commented Jul 13, 2022

tustvold commented Jul 15, 2022

jhorstmann commented Jul 15, 2022

heyrutvik commented Jul 15, 2022

tustvold Jul 17, 2022

Choose a reason for hiding this comment

jhorstmann Jul 17, 2022

Choose a reason for hiding this comment

ursabot commented Jul 19, 2022

jhorstmann commented Jul 12, 2022 •

edited

codecov-commenter commented Jul 12, 2022 •

edited

HaoYang670 left a comment •

edited

alamb commented Jul 13, 2022 •

edited