Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IntoIter trait for write_batch/write_mini_batch #43

Closed
alamb opened this issue Apr 26, 2021 · 2 comments
Closed

Use IntoIter trait for write_batch/write_mini_batch #43

alamb opened this issue Apr 26, 2021 · 2 comments
Labels
parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5153

Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:

{{struct MyData {}}{{  name: String,}}{{  address: Option}}{{}}}

Over the course of working sets of this data, you'll have the bulk data Vec,  the names column in a Vec<&String>, the address column in a Vec<Option>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.

What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like

{{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)}}

and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.

I am writing data with many columns and I think this would really help to speed things up.

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Xavier Lange(xrl) @ 2019-04-09T19:22:29.534+0000:

[~csun] [~sadikovi] what do you think of this potentially breaking change? I need to confirm the backwards compatibility but I think it might still be a useful change.

Comment from Ivan Sadikov(sadikovi) @ 2019-04-09T19:33:52.271+0000:

[~xrl] Yes, sure. I will be happy to review if you open a PR with changes. We can create a new method "write_batch_iter", which implements new API and make "write_batch" to call the new method, since as you pointed out slice implements IntoIter.

@jorgecarleitao jorgecarleitao changed the title [Parquet] Use IntoIter trait for write_batch/write_mini_batch Use IntoIter trait for write_batch/write_mini_batch Apr 29, 2021
@jorgecarleitao jorgecarleitao added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 29, 2021
@tustvold
Copy link
Contributor

Closed by #2045

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants