Cleanup ChunkReader (#4118) #4156

tustvold · 2023-04-28T18:10:05Z

Which issue does this PR close?

Closes #4118
Closes #1163

Rationale for this change

The length parameter provided to get_read is very confusing, and whilst attempting to document it in #4147 it became clear that we should probably just remove it. Similarly whilst looking in FileSource, I realised that it doesn't actually protect against concurrent reads #4157.

I therefore opted to just bite the bullet and close off #1163 by removing the length parameter from get_read, and removing FileSource entirely.

What changes are included in this PR?

ChunkReader::get_read no longer takes a length parameter
Removed FileSource
More clearly document that ChunkReader does not support concurrent or interleaved reads, much like File itself

Are there any user-facing changes?

Yes, this changes some, admittedly low-level, public interfaces

tustvold · 2023-04-28T18:22:11Z

parquet/src/util/io.rs

@@ -112,9 +109,6 @@ impl<R: ParquetReader> FileSource<R> {

 impl<R: ParquetReader> Read for FileSource<R> {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize> {
-        let bytes_to_read = cmp::min(buf.len(), (self.end - self.start) as usize);
-        let buf = &mut buf[0..bytes_to_read];


This could actually benefit performance, as it will avoid artificially truncating the input buffer, causing it to not skip the FileSource buffer when it could

tustvold · 2023-04-28T20:45:25Z

parquet/src/file/reader.rs

+
+    /// Get a [`Read`] starting at the provided file offset
+    ///
+    /// Subsequent or concurrent calls to [`Self::get_read`] or [`Self::get_bytes`] may


FileSource provided protection against subsequent calls to get_read, by calling Seek on every read, but provided no protection against concurrent access. I think it is less risky to just clearly not support non-serial usage, than to only break on concurrent usage.

TBC there are no safety implications of not synchronising this access. You will just get interleaved reads, which is no different from just reading gibberish.

One option would be to add Mutex to synchronise access, however, this solution is necessarily incomplete as a user can just call File::try_clone. Ultimately there is no reliable way to synchronise file IO, I think if no synchronisation is fine for the standard library, it is fine for the parquet crate.

…m-chunk-reader

crepererum

Simply better 👍

viirya

Looks much cleaner.

zeevm · 2023-05-18T18:30:26Z

@tustvold the length argument was quite useful I think, my main use case is processing Parquet files from cloud storage (azure, s3 etc.) so I'm trying to minimize on downloaded data to both save time and cost, I used the length argument (with some additional extension if it was below some threshold) to only download what the reader required in a single 'read' op (cloud storage services usually charge per 10K ops + bandwidth)

Now I don't know how much to download, I'll have to develop a bunch of heuristics to get to some sensible values.

Since the reader always needs some specific entity (say page, or entire column chunk, or the file metadata) it knows what 'length' it needs, why not provide this as a 'hint' to the implementation?

tustvold · 2023-05-18T18:48:22Z

It was removed precisely to discourage this style of usage, see the linked issues, as the ranges passed to it are fairly arbitrary. I would recommend using the metadata to determine the necessary byte ranges up front, or using the async reader which handles this for you

zeevm · 2023-05-18T23:03:13Z

@tustvold IIUC the async reader reads into Arrow in memory format, not providing a lower level direct column or page reading correct?

This doesn't work for me as we have our own in memory format for our db, going through Arrow would incur a lot of expansive allocations and format translation.

tustvold · 2023-05-19T08:05:05Z

Correct, the high-level API reads into arrow and supports object stores, in addition to predicate pushdown, late materialization, row filtering, etc... The low-level APIs support advanced usage, but are not nearly as batteries included. I would encourage you to try out the arrow APIs, you may find they are actually faster 😅

zeevm · 2023-05-19T08:32:35Z

Understood. My considerations are: 1. Using arrow I'll have to allocate double the memory and incur a lot of mem copies into our engine native in memory format. 2. Our engine already does predicate push down, late materialization, aplying predicates to dictionaries and very fast scanning of raw dictionary IDs (skipping dictionary decoding entirely) with SIMD instructions and other optimizations Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Raphael Taylor-Davies ***@***.***> Sent: Friday, May 19, 2023 11:05:15 AM To: apache/arrow-rs ***@***.***> Cc: zeevm ***@***.***>; Comment ***@***.***> Subject: Re: [apache/arrow-rs] Cleanup ChunkReader (#4118) (PR #4156) Correct, the high-level API reads into arrow and supports object stores, in addition to predicate pushdown, late materialization, row filtering, etc... The low-level APIs support advanced usage, but are not nearly as batteries included. I would encourage you to try out the arrow APIs, you may find they are actually faster 😅 — Reply to this email directly, view it on GitHub<#4156 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAA33OPY2XVBQOVZ6PSXOC3XG4STXANCNFSM6AAAAAAXPRRIHY>. You are receiving this because you commented.Message ID: ***@***.***>

Remove length from ChunkReader (apache#4118)

aecb1c1

tustvold added the api-change Changes to the arrow API label Apr 28, 2023

github-actions bot added the parquet Changes to the parquet crate label Apr 28, 2023

tustvold commented Apr 28, 2023

View reviewed changes

tustvold marked this pull request as draft April 28, 2023 19:24

tustvold added 2 commits April 28, 2023 20:57

Remove ChunkReader::T Send bound

9cea58f

Remove FileSource

c14c8f3

tustvold changed the title ~~Remove length from ChunkReader (#4118)~~ Cleanup ChunkReader (#4118) Apr 28, 2023

Tweak docs

1bbc02e

tustvold commented Apr 28, 2023

View reviewed changes

tustvold marked this pull request as ready for review April 28, 2023 20:47

This was referenced Apr 28, 2023

Remove unnecessary RefCell from FileSource #4157

Closed

Document ChunkReader (#4118) #4147

Closed

Merge remote-tracking branch 'upstream/master' into remove-length-fro…

906606d

…m-chunk-reader

crepererum approved these changes May 2, 2023

View reviewed changes

viirya approved these changes May 2, 2023

View reviewed changes

tustvold merged commit eb5ac69 into apache:master May 2, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup ChunkReader (#4118) #4156

Cleanup ChunkReader (#4118) #4156

tustvold commented Apr 28, 2023 •

edited

tustvold Apr 28, 2023

tustvold Apr 28, 2023 •

edited

crepererum left a comment

viirya left a comment

zeevm commented May 18, 2023

tustvold commented May 18, 2023

zeevm commented May 18, 2023

tustvold commented May 19, 2023

zeevm commented May 19, 2023 via email

Cleanup ChunkReader (#4118) #4156

Cleanup ChunkReader (#4118) #4156

Conversation

tustvold commented Apr 28, 2023 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Apr 28, 2023

Choose a reason for hiding this comment

tustvold Apr 28, 2023 • edited

Choose a reason for hiding this comment

crepererum left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

zeevm commented May 18, 2023

tustvold commented May 18, 2023

zeevm commented May 18, 2023

tustvold commented May 19, 2023

zeevm commented May 19, 2023 via email

tustvold commented Apr 28, 2023 •

edited

tustvold Apr 28, 2023 •

edited