Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make current position available in FileWriter. #1691

Closed
pacman82 opened this issue May 12, 2022 · 5 comments · Fixed by #1774
Closed

Make current position available in FileWriter. #1691

pacman82 opened this issue May 12, 2022 · 5 comments · Fixed by #1774
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@pacman82
Copy link

pacman82 commented May 12, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like a way to track the file size of a parquet file I am writing, so I can split my dataset into chunks of roughly the same size. For more context, please see this issue in the downstream odbc2parquet crate: pacman82/odbc2parquet#190

Describe the solution you'd like
Make the current stream position (i.e bytes currently written into the inner io::Write) available in the implementation of SerializedFileWriter or even through the FileWriter trait.

Describe alternatives you've considered
As a workaround I could create a wrapper of File which shares an Rc<usize> counter with the application logic.

@pacman82 pacman82 added the enhancement Any new improvement worthy of a entry in the changelog label May 12, 2022
@pacman82
Copy link
Author

Hello, since I opened this feature request parquet 15.0.0 has been released. Thanks to everybody involved in that effort! While breaking changes are of course a pain, I loved each and every one of them. The API got a lot cleaner and as a consequence so got my use of it. Keep up the good work!

While adapting to the breaking changes, I got the feeling that RowGroupMetadata::compressed_size maybe already does what I want. So you can feel free to close this issue. If I am not able to use it downstream to the satisfaction of my users, I'll just open a new one.

Cheers, Markus

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2022

Thank you for your feedback, and glad to hear the API is moving in a direction that you like 😄

I think RowGroupMetadata::total_byte_size is probably what you're after, as this will tell you the size of the written row groups. I'm not sure there is an API that would give you access to this whilst writing a file though, but it should be fairly straightforward to add one. Will see what I can come up with

Edit: I think you're right and compressed_size is what you want, I think we might also be writing total_byte_size wrong 🤔

@pacman82
Copy link
Author

pacman82 commented Jun 2, 2022

Hello @tustvold , thanks for your help here. Now I am a little bit confused. In order to implement pacman82/odbc2parquet#190 (tl;dr I want to stop writing row groups as soon as the file size surpasses a user defined threshold, and start writing the next row group into a new file) should I add the compressed_size of the row groups or used the total_byte_size of the flushed groups? What would be the difference between the two? Shouldn't be hard to change on my end. If you do not advice otherwise, I'll run with the compressed_size and see if my users are happy about it.

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2022

Sorry for confusing things, compressed size is the correct thing to use. I think the crate might be writing the wrong thing for total_byte_size but that's a separate issue I'll file if/when I confirm it.

@pacman82
Copy link
Author

pacman82 commented Jun 6, 2022

My test cases and users are happy both. See: pacman82/odbc2parquet#190. So compressed_size has been indeed what I've been looking for. Closing this issue would be fine with me.

tustvold added a commit that referenced this issue Jun 6, 2022
* Access metadata of flushed row groups on write (#1691)

* Add tests
@alamb alamb added the parquet Changes to the parquet crate label Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants