Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FromIterator for Builders #2038

Conversation

heyrutvik
Copy link
Contributor

@heyrutvik heyrutvik commented Jul 10, 2022

Which issue does this PR close?

Closes #1841

Rationale for this change

  1. it allows us to create builder from iterators
  2. helps us to avoid wiring FromIterator for specific array types and utilize builder implementation for the same

What changes are included in this PR?

It adds FromIterator to builder types and make specific array types to use corresponding builder type.

Note: it doesn't add new FromIterator implementation but "moves" existing implementation to builder types. I haven't added new tests for this and relied upon existing tests since specific types will call this implementations. Let me know if we want to add tests this change. (cc @tustvold @alamb)

Are there any user-facing changes?

N/A

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 10, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2022

Codecov Report

Merging #2038 (642e065) into master (a7181dd) will decrease coverage by 0.03%.
The diff coverage is 57.04%.

@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
- Coverage   83.68%   83.64%   -0.04%     
==========================================
  Files         224      224              
  Lines       58833    58796      -37     
==========================================
- Hits        49234    49181      -53     
- Misses       9599     9615      +16     
Impacted Files Coverage Δ
arrow/src/array/builder/fixed_size_list_builder.rs 89.00% <0.00%> (ø)
arrow/src/array/builder/generic_list_builder.rs 95.09% <16.66%> (ø)
arrow/src/array/builder/decimal_builder.rs 82.85% <47.82%> (-2.51%) ⬇️
arrow/src/array/builder/primitive_builder.rs 91.16% <52.63%> (-1.42%) ⬇️
arrow/src/array/builder/generic_binary_builder.rs 82.41% <56.52%> (-1.37%) ⬇️
...row/src/array/builder/string_dictionary_builder.rs 88.96% <57.14%> (-1.69%) ⬇️
arrow/src/array/array_dictionary.rs 93.53% <66.66%> (+1.43%) ⬆️
arrow/src/array/builder/boolean_builder.rs 85.34% <70.00%> (-1.92%) ⬇️
arrow/src/array/builder/generic_string_builder.rs 89.32% <71.42%> (-2.82%) ⬇️
arrow/src/array/array_binary.rs 92.88% <100.00%> (-0.31%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7181dd...642e065. Read the comment docs.

@@ -222,38 +222,7 @@ impl<'a> BooleanArray {

impl<Ptr: Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray {
fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {
let iter = iter.into_iter();
let (_, data_len) = iter.size_hint();
let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold I noticed some implementations panic if the upper size hint is not available and others pick whatever is available. Is this a deliberate decision? AFAIK we use size hint for slight optimization and it should not be enforced such a way. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be optional for the non-trusted length variants

@heyrutvik heyrutvik marked this pull request as ready for review July 12, 2022 08:03
@heyrutvik heyrutvik changed the title [WIP] Implement FromIterator for Builders Implement FromIterator for Builders Jul 12, 2022
let size_hint = upper.unwrap_or(lower);
let fixed_len = 16_usize;

let mut builder = FixedSizeListBuilder::with_capacity(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use DecimalBuilder directly?

Copy link
Contributor Author

@heyrutvik heyrutvik Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I was too focused on building builders out of its sub components. I totally forgot that builder contains handy methods to append values. It was fun experience though.

You will notice this in almost all implementations (except one). I'll update the PR.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick review, looking good. I'll try to run the benchmarks later today to double check this hasn't introduced any regressions

let fixed_len = 16_usize;

let mut builder = FixedSizeListBuilder::with_capacity(
UInt8Builder::new(size_hint),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be size_hint * fixed_len although I do think the UX of this isn't great. I'll file a ticket for this

@heyrutvik heyrutvik requested a review from tustvold July 12, 2022 15:30
@tustvold
Copy link
Contributor

Currently this represents a non-trivial performance regression for strings,

array_from_vec 128      time:   [200.16 ns 200.18 ns 200.20 ns]                               
                        change: [-10.896% -10.821% -10.772%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

array_from_vec 256      time:   [317.37 ns 317.43 ns 317.50 ns]                               
                        change: [+35.360% +35.522% +35.637%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

array_from_vec 512      time:   [407.97 ns 408.07 ns 408.20 ns]                               
                        change: [+2.4341% +2.4621% +2.4892%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe

array_string_from_vec 128                                                                             
                        time:   [3.3538 us 3.3546 us 3.3554 us]
                        change: [+65.462% +65.526% +65.588%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

array_string_from_vec 256                                                                             
                        time:   [4.8467 us 4.8477 us 4.8488 us]
                        change: [+57.833% +57.953% +58.074%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

array_string_from_vec 512                                                                             
                        time:   [7.5808 us 7.5830 us 7.5855 us]
                        change: [+52.732% +52.786% +52.842%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 128                                                                             
                        time:   [4.6968 us 4.6996 us 4.7025 us]
                        change: [+52.108% +52.200% +52.281%] (p = 0.00 < 0.05)
                        Performance has regressed.

struct_array_from_vec 256                                                                             
                        time:   [6.7720 us 6.7772 us 6.7843 us]
                        change: [+52.493% +52.585% +52.681%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [10.426 us 10.432 us 10.439 us]
                        change: [+45.728% +45.796% +45.867%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [17.381 us 17.391 us 17.404 us]
                        change: [+41.415% +41.488% +41.575%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

I suspect this is down to buffer resizing and should just be a case of setting the correct capacity for the builder

let (lower, upper) = iter.size_hint();
let size_hint = upper.unwrap_or(lower);

let mut builder = GenericStringBuilder::new(size_hint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to use GenericStringBuilder::with_capacity here

let (lower, upper) = iter.size_hint();
let size_hint = upper.unwrap_or(lower);

let mut builder = GenericBinaryBuilder::new(size_hint);
Copy link
Contributor

@tustvold tustvold Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely want to use a GenericBinaryBuilder::with_capacity (which will need to be added) here, to avoid this being a performance regression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold I have a doubt about the with_capacity method. It takes two parameters: capacity of items and capacity of data (bytes). We can't compute data capacity without iterating (hence consuming elements) and cloning it could be costly.

Can we have some default value for the data capacity parameter? And if yes, what should it be?

It applies to both GenericStringBuilder and GenericBinaryBuilder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the original code, it uses a values capacity of 0 bytes - https://github.com/apache/arrow-rs/pull/2038/files#diff-d90c701e089ed03b4767c4c4ee8b7fe20410d22f98ea95313370a2c259550d3eL227

It should therefore not represent a regression to just do the same. We can then revisit this in the context of #2054

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So should we wait till we close #2054 or use 0 for data capacity for now? If later is the answer, then keeping the size_hint won't hurt.

Copy link
Contributor

@tustvold tustvold Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should preserve the pre-existing behaviour of using a values capacity of 0, and an offset capacity from the size hint (upper bound if available, lower bound otherwise). Otherwise this PR will represent a non-trivial performance regression.

@heyrutvik
Copy link
Contributor Author

@tustvold I don't trust regression test on my machine since I'm getting 'Performance has regressed' even with the same code which I used to generate a baseline. 😄 Let me know if it has made any improvements. I also double checked that we use the same initial values as the original code while creating builders.

@heyrutvik heyrutvik requested a review from tustvold July 15, 2022 18:29
@tustvold
Copy link
Contributor

I'll re-run the benchmarks later this afternoon or failing that tomorrow morning

@tustvold
Copy link
Contributor

Unfortunately this still represents a non-trivial performance regression for strings...

array_string_from_vec 128                                                                             
                        time:   [3.0274 us 3.0277 us 3.0279 us]
                        change: [+49.381% +49.449% +49.526%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  4 (4.00%) high severe

array_string_from_vec 256                                                                             
                        time:   [4.2068 us 4.2092 us 4.2123 us]
                        change: [+37.016% +37.296% +37.749%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

array_string_from_vec 512                                                                             
                        time:   [6.6659 us 6.6680 us 6.6700 us]
                        change: [+34.300% +34.358% +34.416%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 128                                                                             
                        time:   [4.2799 us 4.2823 us 4.2844 us]
                        change: [+38.535% +38.605% +38.688%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

struct_array_from_vec 256                                                                             
                        time:   [6.2571 us 6.2585 us 6.2599 us]
                        change: [+40.854% +40.924% +40.986%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [9.6797 us 9.6819 us 9.6845 us]
                        change: [+35.304% +35.364% +35.419%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [16.488 us 16.493 us 16.500 us]
                        change: [+33.968% +34.039% +34.116%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

I wonder if you might like to try using something like hotspot or cargo-flamegraph to see where the additional slowdown is coming from? I can try to take a look, but I'm a little bit swamped at the moment so not sure when I'll have time to investigate this

@tustvold
Copy link
Contributor

Perhaps we could split this up and get the change for the primitive readers in or something?

@heyrutvik
Copy link
Contributor Author

Perhaps we could split this up and get the change for the primitive readers in or something?

Hey @tustvold, apologies for the late reply. I didn't get much time to work on this.

Good suggestion. I'll split it up and test bench against the baseline. Thanks.

@tustvold
Copy link
Contributor

I've marked this PR as a draft to make it easier to see that it isn't waiting for review. Feel free to unmark once ready.

Btw #2181 might have helped with the string regression

@tustvold
Copy link
Contributor

This PR has been inactive for a while so closing to clear the backlog, please feel free to reopen if you come back to this

@tustvold tustvold closed this Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Extend for Builder
3 participants