Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

Open
sundy-li opened this issue Dec 9, 2022 · 3 comments · May be fixed by #1337
Open

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

sundy-li opened this issue Dec 9, 2022 · 3 comments · May be fixed by #1337

Comments

@sundy-li
Copy link
Collaborator

sundy-li commented Dec 9, 2022

Let's look at these codes in
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/primitive/basic.rs#L219-L226

  State::Required(page) => {
                values.extend(
                    page.values
                        .by_ref()
                        .map(decode)
                        .map(self.op)
                        .take(remaining),
                );
            }

It had extra memcpy in values.extend and decode, I think maybe we could optimize it by using Buffer clone.

The first motivation is to move

#[derive(Debug, Clone)]
pub struct DataPage {
    pub(super) header: DataPageHeader,
    pub(super) buffer: Vec<u8>,
    ...
}

to

#[derive(Debug, Clone)]
pub struct DataPage {
    pub(super) header: DataPageHeader,
    pub(super) buffer: Buffer<u8>,
    ...
}

@jorgecarleitao what do you think about this?

I found arrow-rs had addressed this improvement in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L115-L138

@jorgecarleitao
Copy link
Owner

parquet's plain encoding of each row is [length as i32][bytes]. Could you explain how would this could be more efficiently decoded?

I think the root cause of the issue is that when the batch is very large, we need to perform large re-allocations. I do think we can do better, though: given encoded binary values, we can estimate their size via encoded.len() - 4 * num_values. This only works for required. For optional, we could use encoded.len() - 4 * (num_values - num_nulls)

@sundy-li
Copy link
Collaborator Author

parquet's plain encoding of each row is [length as i32][bytes]. Could you explain how would this could be more efficiently decoded?

  1. For binary column with plain encoding, I have no good idea.

  2. But for DeltaLengthByteArray, we may can ref the offset/values to the page state.

  3. For Primitive Column with plain encoding, we can also ref the values to page decode state instead of values.extend and decode row by row.

But the decoder is mutable, so I still can't figure out a better idea that avoids too much memory copy.

@jorgecarleitao jorgecarleitao linked a pull request Dec 18, 2022 that will close this issue
@jorgecarleitao
Copy link
Owner

Makes sense. I tried to modify the design to support this in #1337 .

That has two main ideas:

  • required pages should not allocate the bitmap
  • with capacity now accepts the "state of the page". This allows using the page's details to reserve.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants