Make current position available in `FileWriter`. #1691

pacman82 · 2022-05-12T06:05:28Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like a way to track the file size of a parquet file I am writing, so I can split my dataset into chunks of roughly the same size. For more context, please see this issue in the downstream odbc2parquet crate: pacman82/odbc2parquet#190

Describe the solution you'd like
Make the current stream position (i.e bytes currently written into the inner io::Write) available in the implementation of SerializedFileWriter or even through the FileWriter trait.

Describe alternatives you've considered
As a workaround I could create a wrapper of File which shares an Rc<usize> counter with the application logic.

The text was updated successfully, but these errors were encountered:

pacman82 · 2022-05-30T18:18:10Z

Hello, since I opened this feature request parquet 15.0.0 has been released. Thanks to everybody involved in that effort! While breaking changes are of course a pain, I loved each and every one of them. The API got a lot cleaner and as a consequence so got my use of it. Keep up the good work!

While adapting to the breaking changes, I got the feeling that RowGroupMetadata::compressed_size maybe already does what I want. So you can feel free to close this issue. If I am not able to use it downstream to the satisfaction of my users, I'll just open a new one.

Cheers, Markus

tustvold · 2022-06-02T06:55:22Z

Thank you for your feedback, and glad to hear the API is moving in a direction that you like 😄

I think RowGroupMetadata::total_byte_size is probably what you're after, as this will tell you the size of the written row groups. I'm not sure there is an API that would give you access to this whilst writing a file though, but it should be fairly straightforward to add one. Will see what I can come up with

Edit: I think you're right and compressed_size is what you want, I think we might also be writing total_byte_size wrong 🤔

pacman82 · 2022-06-02T18:18:40Z

Hello @tustvold , thanks for your help here. Now I am a little bit confused. In order to implement pacman82/odbc2parquet#190 (tl;dr I want to stop writing row groups as soon as the file size surpasses a user defined threshold, and start writing the next row group into a new file) should I add the compressed_size of the row groups or used the total_byte_size of the flushed groups? What would be the difference between the two? Shouldn't be hard to change on my end. If you do not advice otherwise, I'll run with the compressed_size and see if my users are happy about it.

tustvold · 2022-06-02T18:22:45Z

Sorry for confusing things, compressed size is the correct thing to use. I think the crate might be writing the wrong thing for total_byte_size but that's a separate issue I'll file if/when I confirm it.

pacman82 · 2022-06-06T09:46:39Z

My test cases and users are happy both. See: pacman82/odbc2parquet#190. So compressed_size has been indeed what I've been looking for. Closing this issue would be fine with me.

* Access metadata of flushed row groups on write (#1691) * Add tests

pacman82 added the enhancement Any new improvement worthy of a entry in the changelog label May 12, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 2, 2022

Access metadata of flushed row groups on write (apache#1691)

6c69397

tustvold mentioned this issue Jun 2, 2022

Access metadata of flushed row groups on write (#1691) #1774

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 2, 2022

Access metadata of flushed row groups on write (apache#1691)

4cad205

tustvold closed this as completed in #1774 Jun 6, 2022

tustvold added a commit that referenced this issue Jun 6, 2022

Access metadata of flushed row groups on write (#1691) (#1774)

d4df1d9

* Access metadata of flushed row groups on write (#1691) * Add tests

alamb added the parquet Changes to the parquet crate label Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make current position available in `FileWriter`. #1691

Make current position available in `FileWriter`. #1691

pacman82 commented May 12, 2022 •

edited

pacman82 commented May 30, 2022

tustvold commented Jun 2, 2022 •

edited

pacman82 commented Jun 2, 2022

tustvold commented Jun 2, 2022

pacman82 commented Jun 6, 2022

Make current position available in FileWriter. #1691

Make current position available in FileWriter. #1691

Comments

pacman82 commented May 12, 2022 • edited

pacman82 commented May 30, 2022

tustvold commented Jun 2, 2022 • edited

pacman82 commented Jun 2, 2022

tustvold commented Jun 2, 2022

pacman82 commented Jun 6, 2022

Make current position available in `FileWriter`. #1691

Make current position available in `FileWriter`. #1691

pacman82 commented May 12, 2022 •

edited

tustvold commented Jun 2, 2022 •

edited