Improve Documentation of Parquet ChunkReader #4118

zilder · 2023-04-24T14:58:30Z

Describe the bug
Not sure that it's a bug, but it seems that arrow-rs version 37 performs more read operations from parquet files compared to version 19 (which we have been using so far). Some of the byte ranges seem to be overlapping (see the output below). For the context we use a custom implementation of ChunkReader with ParquetRecordBatchReader (and with SerializedFileReader in v19) to access S3 storage. Here's a reduced implementation:

pub struct S3Request {
    client: Client,
    bucket: String,
    key: String,
    len: u64,
    rt: Runtime,
}

impl ChunkReader for S3Request {
    type T = ByteBuf;

    fn get_read(
        &self,
        start: u64,
        length: usize,
    ) -> Result<Self::T, parquet::errors::ParquetError> {
        let end = start + length as u64 - 1;
        println!("S3Request::get_read(): {}, {}", start, end);

        let data = self
            .rt
            .block_on(async {
                let resp = match self
                    .client
                    .get_object()
                    .bucket(&self.bucket)
                    .key(&self.key)
                    .range(format!("bytes={}-{}", start, end))
                    .send()
                    .await
                {
                    Ok(r) => r,
                    Err(e) => {
                        panic!("{}", e);
                    },
                };

                resp.body.collect().await
            })
            .unwrap();

        Ok(ByteBuf(data))
    }
}

(I added println!("S3Request::get_read(): {}, {}", start, end); to track each read operations)

In the output we get 8 read operations (v37):

S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 18, 121
S3Request::get_read(): 43, 121
S3Request::get_read(): 214, 331
S3Request::get_read(): 228, 331
S3Request::get_read(): 253, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1  | 111         |
| 5  | 555         |
+----+-------------+

While with the same implementation we only get 4 read operations using SerializedFileReader and ParquetFileArrowReader (in v19):

S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 214, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1  | 111         |
| 5  | 555         |
+----+-------------+

Was that an intended change?

The text was updated successfully, but these errors were encountered:

tustvold · 2023-04-24T17:07:28Z

This is a consequence of #2464 which causes ChunkReader::get_read to be called per page, instead of per row group. This change was made to enable page-level predicate push down. We should definitely improve the documentation around ChunkReader, and its implicit assumptions regarding buffering at the application and/or OS level. I will add this to my list.

The reason for the overlapping byte ranges, is that if the OffsetIndex isn't read, the reader doesn't know where the pages are located or even how many there are, only the end position of the column chunk. It therefore has to assume a given page may run to the end of the range. If you enable reading the PageIndex it shouldn't perform overlapping reads (although it will now need to perform IO to read the page index).

Taking a step back I wonder if you've considered using the async_reader. Not only does this provide a native async interface, but the AsyncFileReader interface naturally lends itself to IO pre-fetching for an entire row group at a time.

There is also out of the box integration with object_store which may be of interest

zilder · 2023-04-25T07:58:16Z

Hi @tustvold, thanks for the explanation! Yes, we considered async reader and object_store, and now this is compelling enough reason to prioritize working on it : ) The API looks simple enough, should be easy to integrate. We'll have to extend ParquetObjectReader a little to collect some statistics (# of requests, amount of data) for internal use, but otherwise looks good.

PS: happy to see how fast this project is developing, awesome job, guys!

* Remove length from ChunkReader (#4118) * Remove ChunkReader::T Send bound * Remove FileSource * Tweak docs

zilder added the bug label Apr 24, 2023

tustvold changed the title ~~ParquetRecordBatchReader reads overlapping byte ranges~~ Improve Documentation of Parquet ChunkReader Apr 24, 2023

tustvold added documentation Improvements or additions to documentation and removed bug labels Apr 24, 2023

tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 27, 2023

Document ChunkReader (apache#4118)

2d260a1

tustvold mentioned this issue Apr 27, 2023

Document ChunkReader (#4118) #4147

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 28, 2023

Remove length from ChunkReader (apache#4118)

aecb1c1

tustvold mentioned this issue Apr 28, 2023

Cleanup ChunkReader (#4118) #4156

Merged

tustvold closed this as completed in #4156 May 2, 2023

tustvold added a commit that referenced this issue May 2, 2023

Cleanup ChunkReader (#4118) (#4156)

eb5ac69

* Remove length from ChunkReader (#4118) * Remove ChunkReader::T Send bound * Remove FileSource * Tweak docs

tustvold added the parquet Changes to the parquet crate label May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Documentation of Parquet ChunkReader #4118

Improve Documentation of Parquet ChunkReader #4118

zilder commented Apr 24, 2023

tustvold commented Apr 24, 2023 •

edited

zilder commented Apr 25, 2023

Improve Documentation of Parquet ChunkReader #4118

Improve Documentation of Parquet ChunkReader #4118

Comments

zilder commented Apr 24, 2023

tustvold commented Apr 24, 2023 • edited

zilder commented Apr 25, 2023

tustvold commented Apr 24, 2023 •

edited