Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make sure that only concat preallocates buffers #382

Merged
merged 7 commits into from
Jun 8, 2021

Conversation

ritchie46
Copy link
Contributor

Which issue does this PR close?

Partial fix for #347. In #348 @jorgecarleitao pointed out that the memory savings don't work for the filter and the zip kernel.
This PR restores the implementation for those two kernels, and keeps the new preallocation for the concat kernel.

This is achieved by create a builder pattern for MutableArrayData. This way we don't break the API, and we may choose to preallocate buffers.

Could you take a look at this @jorgecarleitao ?

Are there any user-facing changes?

There is an builder pattern struct for MutableArrayData.

If there are any breaking changes to public APIs, please add the breaking change label.

No

@codecov-commenter
Copy link

codecov-commenter commented May 30, 2021

Codecov Report

Merging #382 (531a5f3) into master (4ff2e56) will decrease coverage by 0.04%.
The diff coverage is 59.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #382      +/-   ##
==========================================
- Coverage   82.61%   82.57%   -0.05%     
==========================================
  Files         162      162              
  Lines       44228    44280      +52     
==========================================
+ Hits        36538    36563      +25     
- Misses       7690     7717      +27     
Impacted Files Coverage Δ
arrow/src/array/transform/mod.rs 85.97% <49.15%> (-3.22%) ⬇️
arrow/src/compute/kernels/concat.rs 100.00% <100.00%> (ø)
parquet/src/encodings/encoding.rs 94.85% <0.00%> (-0.20%) ⬇️
arrow/src/array/transform/boolean.rs 84.61% <0.00%> (+7.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ff2e56...531a5f3. Read the comment docs.

@jorgecarleitao
Copy link
Member

Thanks @ritchie46 . My small concerns with this PR:

  1. We are introducing yet another builder API for something relatively easy to accomplish with an extra method
  2. preallocate_buffers assumes that "preallocate" means allocate the maximum possible capacity, which IMO is not really pre-allocating, is just being very conservative and allocating the maximum.
  3. The API is rather limited to the concatenate case.

I left some ideas as a comment on the issue

@ritchie46
Copy link
Contributor Author

@jorgecarleitao I implemented your proposal from #347. As we now need to define an enum in the with_capacities constructor, I want to be extra certain that I added all possible variants, as adding one later is backwards incompatable. Could you check if I missed something.

For now, only the (large)-utf8 ones uses this, but if the design is ok, I can add more in later PRs.

@alamb
Copy link
Contributor

alamb commented Jun 8, 2021

@jorgecarleitao what is the status of this one?

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go: this is a solid improvement to me.

@alamb alamb merged commit 0cbf85a into apache:master Jun 8, 2021
alamb pushed a commit that referenced this pull request Jun 8, 2021
* MutableArrayData::with_capacities

* better pattern matching

* add binary capacities

* add list child data

* add struct capacities

* add panic for dictionary type

* change dictionary capacity enum variant
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Jun 8, 2021
@alamb
Copy link
Contributor

alamb commented Jun 8, 2021

Included in #411 cherry pick

alamb added a commit that referenced this pull request Jun 9, 2021
…se (#411)

* Reduce memory usage of concat (large)utf8 (#348)

* reduce memory needed for concat

* reuse code for str allocation buffer

* make sure that only concat preallocates buffers (#382)

* MutableArrayData::with_capacities

* better pattern matching

* add binary capacities

* add list child data

* add struct capacities

* add panic for dictionary type

* change dictionary capacity enum variant

Co-authored-by: Ritchie Vink <ritchie46@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked PR that was backported to active release (will be included in maintenance release)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants