Implement FromIterator for Builders #2038

heyrutvik · 2022-07-10T19:19:46Z

Which issue does this PR close?

Rationale for this change

it allows us to create builder from iterators
helps us to avoid wiring FromIterator for specific array types and utilize builder implementation for the same

What changes are included in this PR?

It adds FromIterator to builder types and make specific array types to use corresponding builder type.

Note: it doesn't add new FromIterator implementation but "moves" existing implementation to builder types. I haven't added new tests for this and relied upon existing tests since specific types will call this implementations. Let me know if we want to add tests this change. (cc @tustvold @alamb)

Are there any user-facing changes?

N/A

codecov-commenter · 2022-07-10T19:41:59Z

Codecov Report

Merging #2038 (642e065) into master (a7181dd) will decrease coverage by 0.03%.
The diff coverage is 57.04%.

@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
- Coverage   83.68%   83.64%   -0.04%     
==========================================
  Files         224      224              
  Lines       58833    58796      -37     
==========================================
- Hits        49234    49181      -53     
- Misses       9599     9615      +16

Impacted Files	Coverage Δ
arrow/src/array/builder/fixed_size_list_builder.rs	`89.00% <0.00%> (ø)`
arrow/src/array/builder/generic_list_builder.rs	`95.09% <16.66%> (ø)`
arrow/src/array/builder/decimal_builder.rs	`82.85% <47.82%> (-2.51%)`	⬇️
arrow/src/array/builder/primitive_builder.rs	`91.16% <52.63%> (-1.42%)`	⬇️
arrow/src/array/builder/generic_binary_builder.rs	`82.41% <56.52%> (-1.37%)`	⬇️
...row/src/array/builder/string_dictionary_builder.rs	`88.96% <57.14%> (-1.69%)`	⬇️
arrow/src/array/array_dictionary.rs	`93.53% <66.66%> (+1.43%)`	⬆️
arrow/src/array/builder/boolean_builder.rs	`85.34% <70.00%> (-1.92%)`	⬇️
arrow/src/array/builder/generic_string_builder.rs	`89.32% <71.42%> (-2.82%)`	⬇️
arrow/src/array/array_binary.rs	`92.88% <100.00%> (-0.31%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7181dd...642e065. Read the comment docs.

heyrutvik · 2022-07-11T09:23:23Z

arrow/src/array/array_boolean.rs

@@ -222,38 +222,7 @@ impl<'a> BooleanArray {

 impl<Ptr: Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray {
    fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {
-        let iter = iter.into_iter();
-        let (_, data_len) = iter.size_hint();
-        let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound.


@tustvold I noticed some implementations panic if the upper size hint is not available and others pick whatever is available. Is this a deliberate decision? AFAIK we use size hint for slight optimization and it should not be enforced such a way. Thoughts?

It should be optional for the non-trusted length variants

tustvold · 2022-07-12T13:26:54Z

arrow/src/array/builder/decimal_builder.rs

+        let size_hint = upper.unwrap_or(lower);
+        let fixed_len = 16_usize;
+
+        let mut builder = FixedSizeListBuilder::with_capacity(


Why not use DecimalBuilder directly?

Ah, right. I was too focused on building builders out of its sub components. I totally forgot that builder contains handy methods to append values. It was fun experience though.

You will notice this in almost all implementations (except one). I'll update the PR.

tustvold

Had a quick review, looking good. I'll try to run the benchmarks later today to double check this hasn't introduced any regressions

tustvold · 2022-07-12T13:43:17Z

arrow/src/array/builder/decimal_builder.rs

+        let fixed_len = 16_usize;
+
+        let mut builder = FixedSizeListBuilder::with_capacity(
+            UInt8Builder::new(size_hint),


I think this should be size_hint * fixed_len although I do think the UX of this isn't great. I'll file a ticket for this

…hods to append values

tustvold · 2022-07-12T16:09:23Z

Currently this represents a non-trivial performance regression for strings,

array_from_vec 128      time:   [200.16 ns 200.18 ns 200.20 ns]                               
                        change: [-10.896% -10.821% -10.772%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

array_from_vec 256      time:   [317.37 ns 317.43 ns 317.50 ns]                               
                        change: [+35.360% +35.522% +35.637%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

array_from_vec 512      time:   [407.97 ns 408.07 ns 408.20 ns]                               
                        change: [+2.4341% +2.4621% +2.4892%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe

array_string_from_vec 128                                                                             
                        time:   [3.3538 us 3.3546 us 3.3554 us]
                        change: [+65.462% +65.526% +65.588%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

array_string_from_vec 256                                                                             
                        time:   [4.8467 us 4.8477 us 4.8488 us]
                        change: [+57.833% +57.953% +58.074%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

array_string_from_vec 512                                                                             
                        time:   [7.5808 us 7.5830 us 7.5855 us]
                        change: [+52.732% +52.786% +52.842%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 128                                                                             
                        time:   [4.6968 us 4.6996 us 4.7025 us]
                        change: [+52.108% +52.200% +52.281%] (p = 0.00 < 0.05)
                        Performance has regressed.

struct_array_from_vec 256                                                                             
                        time:   [6.7720 us 6.7772 us 6.7843 us]
                        change: [+52.493% +52.585% +52.681%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [10.426 us 10.432 us 10.439 us]
                        change: [+45.728% +45.796% +45.867%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [17.381 us 17.391 us 17.404 us]
                        change: [+41.415% +41.488% +41.575%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

I suspect this is down to buffer resizing and should just be a case of setting the correct capacity for the builder

tustvold · 2022-07-12T16:10:42Z

arrow/src/array/builder/generic_string_builder.rs

+        let (lower, upper) = iter.size_hint();
+        let size_hint = upper.unwrap_or(lower);
+
+        let mut builder = GenericStringBuilder::new(size_hint);


We probably want to use GenericStringBuilder::with_capacity here

tustvold · 2022-07-12T16:11:55Z

arrow/src/array/builder/generic_binary_builder.rs

+        let (lower, upper) = iter.size_hint();
+        let size_hint = upper.unwrap_or(lower);
+
+        let mut builder = GenericBinaryBuilder::new(size_hint);


We likely want to use a GenericBinaryBuilder::with_capacity (which will need to be added) here, to avoid this being a performance regression

@tustvold I have a doubt about the with_capacity method. It takes two parameters: capacity of items and capacity of data (bytes). We can't compute data capacity without iterating (hence consuming elements) and cloning it could be costly.

Can we have some default value for the data capacity parameter? And if yes, what should it be?

It applies to both GenericStringBuilder and GenericBinaryBuilder.

Looking at the original code, it uses a values capacity of 0 bytes - https://github.com/apache/arrow-rs/pull/2038/files#diff-d90c701e089ed03b4767c4c4ee8b7fe20410d22f98ea95313370a2c259550d3eL227

It should therefore not represent a regression to just do the same. We can then revisit this in the context of #2054

Right. So should we wait till we close #2054 or use 0 for data capacity for now? If later is the answer, then keeping the size_hint won't hurt.

We should preserve the pre-existing behaviour of using a values capacity of 0, and an offset capacity from the size hint (upper bound if available, lower bound otherwise). Otherwise this PR will represent a non-trivial performance regression.

heyrutvik · 2022-07-15T18:29:00Z

@tustvold I don't trust regression test on my machine since I'm getting 'Performance has regressed' even with the same code which I used to generate a baseline. 😄 Let me know if it has made any improvements. I also double checked that we use the same initial values as the original code while creating builders.

tustvold · 2022-07-15T19:00:11Z

I'll re-run the benchmarks later this afternoon or failing that tomorrow morning

tustvold · 2022-07-15T22:40:20Z

Unfortunately this still represents a non-trivial performance regression for strings...

array_string_from_vec 128                                                                             
                        time:   [3.0274 us 3.0277 us 3.0279 us]
                        change: [+49.381% +49.449% +49.526%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  4 (4.00%) high severe

array_string_from_vec 256                                                                             
                        time:   [4.2068 us 4.2092 us 4.2123 us]
                        change: [+37.016% +37.296% +37.749%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

array_string_from_vec 512                                                                             
                        time:   [6.6659 us 6.6680 us 6.6700 us]
                        change: [+34.300% +34.358% +34.416%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 128                                                                             
                        time:   [4.2799 us 4.2823 us 4.2844 us]
                        change: [+38.535% +38.605% +38.688%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

struct_array_from_vec 256                                                                             
                        time:   [6.2571 us 6.2585 us 6.2599 us]
                        change: [+40.854% +40.924% +40.986%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [9.6797 us 9.6819 us 9.6845 us]
                        change: [+35.304% +35.364% +35.419%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [16.488 us 16.493 us 16.500 us]
                        change: [+33.968% +34.039% +34.116%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

I wonder if you might like to try using something like hotspot or cargo-flamegraph to see where the additional slowdown is coming from? I can try to take a look, but I'm a little bit swamped at the moment so not sure when I'll have time to investigate this

tustvold · 2022-07-21T22:05:25Z

Perhaps we could split this up and get the change for the primitive readers in or something?

heyrutvik · 2022-07-24T18:23:47Z

Perhaps we could split this up and get the change for the primitive readers in or something?

Hey @tustvold, apologies for the late reply. I didn't get much time to work on this.

Good suggestion. I'll split it up and test bench against the baseline. Thanks.

tustvold · 2022-07-29T21:33:33Z

I've marked this PR as a draft to make it easier to see that it isn't waiting for review. Feel free to unmark once ready.

Btw #2181 might have helped with the string regression

tustvold · 2023-01-15T23:16:38Z

This PR has been inactive for a while so closing to clear the backlog, please feel free to reopen if you come back to this

moves boolean array fromiter to its builder

daf1209

github-actions bot added the arrow Changes to the arrow crate label Jul 10, 2022

heyrutvik mentioned this pull request Jul 10, 2022

Implement Extend for Builder #1841

Closed

heyrutvik added 4 commits July 11, 2022 10:46

Merge branch 'master' into 1841-implement-fromiterator-for-builders

66a0b92

improves existing implementation

01d11cd

improves existing implementation

fa67101

impls fromiterator for primitive builder

8737ec6

heyrutvik commented Jul 11, 2022

View reviewed changes

heyrutvik added 4 commits July 11, 2022 17:02

impls fromiterator for decimal builder

1d35164

impls fromiterator for binary array builder

b810c3c

impls fromiterator for string array builder

6eadc59

impls fromiterator for string dict builder

eddcf11

heyrutvik marked this pull request as ready for review July 12, 2022 08:03

heyrutvik changed the title ~~[WIP] Implement FromIterator for Builders~~ Implement FromIterator for Builders Jul 12, 2022

tustvold reviewed Jul 12, 2022

View reviewed changes

heyrutvik added 2 commits July 12, 2022 19:38

updates fromiterator implemetation to use builder value and their met…

8c7beb2

…hods to append values

runs fmt

dfd9b7c

heyrutvik requested a review from tustvold July 12, 2022 15:30

tustvold reviewed Jul 12, 2022

View reviewed changes

This was referenced Jul 12, 2022

Inconsistent Builder Constructors #2054

Closed

Ignore null buffer when creating ArrayData if null count is zero #2056

Merged

mimics original data capacity value

a9d3337

heyrutvik requested a review from tustvold July 15, 2022 18:29

resolves conflict

642e065

tustvold mentioned this pull request Jul 28, 2022

Huge amount of llvm code generated by comparison kernels, potentially slowing compile times #1858

Closed

tustvold marked this pull request as draft July 29, 2022 21:32

tustvold closed this Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FromIterator for Builders #2038

Implement FromIterator for Builders #2038

heyrutvik commented Jul 10, 2022 •

edited

codecov-commenter commented Jul 10, 2022 •

edited

heyrutvik Jul 11, 2022

tustvold Jul 12, 2022

tustvold Jul 12, 2022

heyrutvik Jul 12, 2022 •

edited

tustvold left a comment

tustvold Jul 12, 2022

tustvold commented Jul 12, 2022

tustvold Jul 12, 2022

tustvold Jul 12, 2022 •

edited

heyrutvik Jul 14, 2022

tustvold Jul 14, 2022

heyrutvik Jul 14, 2022

tustvold Jul 14, 2022 •

edited

heyrutvik commented Jul 15, 2022

tustvold commented Jul 15, 2022

tustvold commented Jul 15, 2022

tustvold commented Jul 21, 2022

heyrutvik commented Jul 24, 2022

tustvold commented Jul 29, 2022

tustvold commented Jan 15, 2023

Implement FromIterator for Builders #2038

Implement FromIterator for Builders #2038

Conversation

heyrutvik commented Jul 10, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jul 10, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyrutvik Jul 12, 2022 • edited

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jul 12, 2022

Choose a reason for hiding this comment

tustvold Jul 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 14, 2022 • edited

Choose a reason for hiding this comment

heyrutvik commented Jul 15, 2022

tustvold commented Jul 15, 2022

tustvold commented Jul 15, 2022

tustvold commented Jul 21, 2022

heyrutvik commented Jul 24, 2022

tustvold commented Jul 29, 2022

tustvold commented Jan 15, 2023

heyrutvik commented Jul 10, 2022 •

edited

codecov-commenter commented Jul 10, 2022 •

edited

heyrutvik Jul 12, 2022 •

edited

tustvold Jul 12, 2022 •

edited

tustvold Jul 14, 2022 •

edited