Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Explicitly guarantee row group ordering in the parquet reader. #15697

Open
nvdbaranec opened this issue May 7, 2024 · 0 comments
Open
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented May 7, 2024

From @devavret , the question came up as to whether we guarantee the relative ordering of row groups across multiple input files in the parquet reader. That is, if you have two files [f1, f2] and the row groups within the files (in one column) are specified as [[r0,r3], [r0,r1]], do we guarantee the output ordering would be [f1r0, f1r3, f2r0, f2r1]

The code does in fact do this for both the explicitly specified case and the unspecified (empty user input / all row groups), but we don't make any guarantees about it. Seems like a safe and easy thing to add.

for (size_t src_idx = 0; src_idx < row_group_indices.size(); ++src_idx) {

@nvdbaranec nvdbaranec added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Development

No branches or pull requests

2 participants
@nvdbaranec and others