Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory usage of json reader #3150

Closed
rockyzhengwu opened this issue Nov 21, 2022 · 0 comments
Closed

Optimize memory usage of json reader #3150

rockyzhengwu opened this issue Nov 21, 2022 · 0 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@rockyzhengwu
Copy link

rockyzhengwu commented Nov 21, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The implementation of decode json to arrow array need convert batch_size of json str to serde_json Value .
this equires a lot of memory for serde_json Value. if with a big batch_size will OOM , usually a large batch_size will have a good compression rate.

let mut rows: Vec<Value> = Vec::with_capacity(batch_size);

Describe the solution you'd like
current implementation in pseudocode:

for batch in value_iter{
   let mut rows: Vec<Value> = Vec::with_capacity(batch_size);
   let arrays = convert_function(rows)
}

If convert ony one json str to serde_json Value will save 3x-5x memory or more, i didn't record carefully .
I had implement a version in our online product in this way , because we use a large batch_size . the pseudocde is

let  field_builder:  Vec<Box<dyn ArrayBuilder>> = create_array_builder(batch_size);
for (i, row) in value_iter.enumerate(){
    let value = serde_json::from_str(row);
    for (index, field) in schema.field.fields{
         let col_name = field.name();
         field_builder[i].append(value.get(col_name))
     }
     if i == batch_size{
        let array_refs = builder.iter_mut().map(|builder| builder.finish()).collect();
        .....
    }
}

this implementation didn't effect the performance.
But it didn't support deep nested list and map.
I'm not sure this is a elegant way for this. or it's possiable to support deep nested list and map.
if this is a good idea , I can try to make a PR for this .

Describe alternatives you've considered

Additional context

@rockyzhengwu rockyzhengwu added the enhancement Any new improvement worthy of a entry in the changelog label Nov 21, 2022
@alamb alamb added the arrow Changes to the arrow crate label Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants