Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Builder Constructors #2054

Closed
tustvold opened this issue Jul 12, 2022 · 12 comments · Fixed by #2592
Closed

Inconsistent Builder Constructors #2054

tustvold opened this issue Jul 12, 2022 · 12 comments · Fixed by #2592
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@tustvold
Copy link
Contributor

tustvold commented Jul 12, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The various builder constructors take capacities to pre-allocate. However, they aren't consistent about whether they take a capacity in terms of elements or bytes, or what these are capacities for.

  • PrimitiveBuilder::new - capacity in elements of T
  • BooleanBuilder::new - capacity in elements of bool
  • DecimalBuilder::new - capacity in terms of bytes
  • FixedSizeBinaryBuilder::new - capacity in terms of bytes
  • FixedSizeListBuilder::with_capacity - capacity in terms of list elements
  • GenericStringBuilder::new - capacity in terms of bytes of string data
  • GenericStringBuilder::with_capacity - capacity in terms of number of items and bytes of string data
  • MapBuilder::with_capacity - capacity in terms of number of items
  • UnionBuilder::with_capacity - capacity in terms of number of items

Describe the solution you'd like

I would like to propose the following:

  • Remove capacity from the new constructors, instead using a static default capacity (e.g. 1024)
  • Make with_capacity take capacities in terms of elements, potentially with an additional number of bytes for variable length arrays

This has a couple of advantages:

The only major disadvantage being that it results in API churn.

Describe alternatives you've considered

We could not do this, but the current situation leads to hard to spot performance bugs.

Additional context

Noticed whilst reviewing #2038

Thoughts @alamb @viirya @jhorstmann

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jul 12, 2022
@jhorstmann
Copy link
Contributor

Sounds good, especially forcing users to specify both item and nested capacities for List/String/Binary arrays. My gut feeling also says that 1024 could be a good default, I'm assuming that arrays will usually be much larger than general purpose collection types.

@alamb
Copy link
Contributor

alamb commented Jul 13, 2022

I agree the proposal sounds good as well

@HaoYang670
Copy link
Contributor

HaoYang670 commented Jul 20, 2022

While working on #2104, I find it might be good to let users decide both the value_capacity and the number of elements:

pub struct GenericBinaryBuilder<OffsetSize: OffsetSizeTrait> {
    value_builder: UInt8BufferBuilder,
    offsets_builder: BufferBuilder<OffsetSize>,
    bitmap_builder: BooleanBufferBuilder,
}

impl<OffsetSize: OffsetSizeTrait> GenericBinaryBuilder<OffsetSize> {
    /// Creates a new `GenericBinaryBuilder` with at least 
    /// `num_elements` binary slots in the array and
    /// `value_capacity` bytes in the values buffer.
    pub fn new(num_elements: Option<usize>, value_capacity: Option<usize>) -> Self {
        Self { 
            value_builder: UInt8BufferBuilder::new(value_capacity.unwrap_or(1024)), 
            offsets_builder: BufferBuilder::<OffsetSize>::new(num_elements.unwrap_or(1023) + 1), 
            bitmap_builder: BooleanBufferBuilder::new(num_elements.unwrap_or(1024)) 
        }
    }

@tustvold
Copy link
Contributor Author

Thanks, I've updated the ticket to be hopefully slightly clearer. We should not remove the ability to pass a value_capacity to with_capacity, as we currently support for StringArray 👍

@psvri
Copy link
Contributor

psvri commented Aug 16, 2022

Hello,

Is this still a valid issue ? . If so I would like to pick this up.

Given that this can lead to big code changes, may I request the team to split this issue into multiple smaller issues or tasks so that its easier to implement and review.

@tustvold
Copy link
Contributor Author

Hi, this is definitely still an issue. I would recommend working on each builder separately in turn, perhaps starting with the more esoteric builders such as unions, lists, and then working through to strings and eventually primitives. The latter builders are likely to cause significantly more churn and so leaving them to last will help I think?

@psvri
Copy link
Contributor

psvri commented Aug 19, 2022

So based on this issue the following PR's are created. Some I have kept in draft, but the rest are ready for review. The buffer builder refactoring I will take it last once the rest are approved.

On a side note , I have replace all new calls with with_capacity in the API's so that the logic would remain the same.

Only in test module I have used both API's to test them.

@andygrove andygrove added the api-change Changes to the arrow API label Aug 22, 2022
@psvri
Copy link
Contributor

psvri commented Aug 23, 2022

Hello @tustvold ,

Just confirming since it was not mentioned in the issue description. Should I also refactor buffer builder constructors. I think thats the only one left ?

@tustvold
Copy link
Contributor Author

I think we can leave the BufferBuilder constructors for now, I think all that remains now are the builders for BinaryArray and StringArray

@psvri
Copy link
Contributor

psvri commented Aug 25, 2022

Those 2 builders already have new and with capacity, But since the sizes of strings are not constant the API was taking the number of bytes to pre allocate the buffer as a parameter. Whats the best way to handle this.

Shall we remove the buffer_capacity parameters from the constructors and allocate a fixed buffer of size 1024 ?

@tustvold
Copy link
Contributor Author

I think it is just a case of removing the capacity parameter from new and leaving with_capacity alone

@psvri
Copy link
Contributor

psvri commented Aug 25, 2022

Okay

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 8, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 8, 2022
tustvold added a commit that referenced this issue Sep 8, 2022
* Simplify DictionaryBuilder constructors (#2684) (#2054)

* Apply suggestions from code review

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants