New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement FromIterator for Builders #2038
Implement FromIterator for Builders #2038
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2038 +/- ##
==========================================
- Coverage 83.68% 83.64% -0.04%
==========================================
Files 224 224
Lines 58833 58796 -37
==========================================
- Hits 49234 49181 -53
- Misses 9599 9615 +16
Continue to review full report at Codecov.
|
@@ -222,38 +222,7 @@ impl<'a> BooleanArray { | |||
|
|||
impl<Ptr: Borrow<Option<bool>>> FromIterator<Ptr> for BooleanArray { | |||
fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self { | |||
let iter = iter.into_iter(); | |||
let (_, data_len) = iter.size_hint(); | |||
let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold I noticed some implementations panic if the upper size hint is not available and others pick whatever is available. Is this a deliberate decision? AFAIK we use size hint for slight optimization and it should not be enforced such a way. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be optional for the non-trusted length variants
let size_hint = upper.unwrap_or(lower); | ||
let fixed_len = 16_usize; | ||
|
||
let mut builder = FixedSizeListBuilder::with_capacity( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use DecimalBuilder directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I was too focused on building builders out of its sub components. I totally forgot that builder contains handy methods to append values. It was fun experience though.
You will notice this in almost all implementations (except one). I'll update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick review, looking good. I'll try to run the benchmarks later today to double check this hasn't introduced any regressions
let fixed_len = 16_usize; | ||
|
||
let mut builder = FixedSizeListBuilder::with_capacity( | ||
UInt8Builder::new(size_hint), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be size_hint * fixed_len although I do think the UX of this isn't great. I'll file a ticket for this
…hods to append values
Currently this represents a non-trivial performance regression for strings,
I suspect this is down to buffer resizing and should just be a case of setting the correct capacity for the builder |
let (lower, upper) = iter.size_hint(); | ||
let size_hint = upper.unwrap_or(lower); | ||
|
||
let mut builder = GenericStringBuilder::new(size_hint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to use GenericStringBuilder::with_capacity here
let (lower, upper) = iter.size_hint(); | ||
let size_hint = upper.unwrap_or(lower); | ||
|
||
let mut builder = GenericBinaryBuilder::new(size_hint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We likely want to use a GenericBinaryBuilder::with_capacity
(which will need to be added) here, to avoid this being a performance regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold I have a doubt about the with_capacity
method. It takes two parameters: capacity of items and capacity of data (bytes). We can't compute data capacity without iterating (hence consuming elements) and cloning it could be costly.
Can we have some default value for the data capacity parameter? And if yes, what should it be?
It applies to both GenericStringBuilder
and GenericBinaryBuilder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the original code, it uses a values capacity of 0 bytes - https://github.com/apache/arrow-rs/pull/2038/files#diff-d90c701e089ed03b4767c4c4ee8b7fe20410d22f98ea95313370a2c259550d3eL227
It should therefore not represent a regression to just do the same. We can then revisit this in the context of #2054
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So should we wait till we close #2054 or use 0
for data capacity for now? If later is the answer, then keeping the size_hint
won't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should preserve the pre-existing behaviour of using a values capacity of 0, and an offset capacity from the size hint (upper bound if available, lower bound otherwise). Otherwise this PR will represent a non-trivial performance regression.
@tustvold I don't trust regression test on my machine since I'm getting 'Performance has regressed' even with the same code which I used to generate a baseline. 😄 Let me know if it has made any improvements. I also double checked that we use the same initial values as the original code while creating builders. |
I'll re-run the benchmarks later this afternoon or failing that tomorrow morning |
Unfortunately this still represents a non-trivial performance regression for strings...
I wonder if you might like to try using something like hotspot or cargo-flamegraph to see where the additional slowdown is coming from? I can try to take a look, but I'm a little bit swamped at the moment so not sure when I'll have time to investigate this |
Perhaps we could split this up and get the change for the primitive readers in or something? |
Hey @tustvold, apologies for the late reply. I didn't get much time to work on this. Good suggestion. I'll split it up and test bench against the baseline. Thanks. |
I've marked this PR as a draft to make it easier to see that it isn't waiting for review. Feel free to unmark once ready. Btw #2181 might have helped with the string regression |
This PR has been inactive for a while so closing to clear the backlog, please feel free to reopen if you come back to this |
Which issue does this PR close?
Closes #1841
Rationale for this change
FromIterator
for specific array types and utilize builder implementation for the sameWhat changes are included in this PR?
It adds
FromIterator
to builder types and make specific array types to use corresponding builder type.Note: it doesn't add new
FromIterator
implementation but "moves" existing implementation to builder types. I haven't added new tests for this and relied upon existing tests since specific types will call this implementations. Let me know if we want to add tests this change. (cc @tustvold @alamb)Are there any user-facing changes?
N/A