Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

sundy-li · 2022-12-09T03:45:21Z

Let's look at these codes in
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/primitive/basic.rs#L219-L226

  State::Required(page) => {
                values.extend(
                    page.values
                        .by_ref()
                        .map(decode)
                        .map(self.op)
                        .take(remaining),
                );
            }

It had extra memcpy in values.extend and decode, I think maybe we could optimize it by using Buffer clone.

The first motivation is to move

#[derive(Debug, Clone)]
pub struct DataPage {
    pub(super) header: DataPageHeader,
    pub(super) buffer: Vec<u8>,
    ...
}

to

#[derive(Debug, Clone)]
pub struct DataPage {
    pub(super) header: DataPageHeader,
    pub(super) buffer: Buffer<u8>,
    ...
}

@jorgecarleitao what do you think about this?

I found arrow-rs had addressed this improvement in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L115-L138

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-12-12T07:14:05Z

parquet's plain encoding of each row is [length as i32][bytes]. Could you explain how would this could be more efficiently decoded?

I think the root cause of the issue is that when the batch is very large, we need to perform large re-allocations. I do think we can do better, though: given encoded binary values, we can estimate their size via encoded.len() - 4 * num_values. This only works for required. For optional, we could use encoded.len() - 4 * (num_values - num_nulls)

sundy-li · 2022-12-12T07:41:15Z

parquet's plain encoding of each row is [length as i32][bytes]. Could you explain how would this could be more efficiently decoded?

For binary column with plain encoding, I have no good idea.
But for DeltaLengthByteArray, we may can ref the offset/values to the page state.
For Primitive Column with plain encoding， we can also ref the values to page decode state instead of values.extend and decode row by row.

But the decoder is mutable, so I still can't figure out a better idea that avoids too much memory copy.

jorgecarleitao · 2022-12-18T18:30:48Z

Makes sense. I tried to modify the design to support this in #1337 .

That has two main ideas:

required pages should not allocate the bitmap
with capacity now accepts the "state of the page". This allows using the page's details to reserve.

sundy-li mentioned this issue Dec 10, 2022

add bytes_estimate for binary push in parquet deserialize #1308

Merged

jorgecarleitao linked a pull request Dec 18, 2022 that will close this issue

Reduced re-alloc in parquet #1337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

sundy-li commented Dec 9, 2022 •

edited

jorgecarleitao commented Dec 12, 2022

sundy-li commented Dec 12, 2022

jorgecarleitao commented Dec 18, 2022

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

Comments

sundy-li commented Dec 9, 2022 • edited

jorgecarleitao commented Dec 12, 2022

sundy-li commented Dec 12, 2022

jorgecarleitao commented Dec 18, 2022

sundy-li commented Dec 9, 2022 •

edited