Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

Open
GregoryKimball opened this issue Feb 4, 2024 · 0 comments · May be fixed by #15728
Open

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

GregoryKimball opened this issue Feb 4, 2024 · 0 comments · May be fixed by #15728
Assignees
Labels
cuDF (Python) Affects Python cuDF API. feature request New feature or request

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 4, 2024

Is your feature request related to a problem? Please describe.
libcudf provides a chunked_parquet_reader in its public API. This reader uses new reader options to process the data in a parquet file in sub-file units. The chunk_read_limit option limits the table size in bytes to be returned per read by only decoding a subset of pages per chunked read. The pass_read_limit option limits the memory used for reading and decompressing data by only decompressing a subset of pages per chunked read.

The chunked parquet reader allows cuDF-python to expose two types of useful functionality:

  1. an API that acts as an iterator to yield dataframe chunks. This is similar to the iter_row_groups behavior in fastparquet. This approach would let users work with parquet files that contain more rows than 2.1B rows (see [FEA] Add 64-bit size type option at build-time for libcudf #13159 for more information about the row limit in libcudf).
  2. a "low_memory" mode that reads the full file, but has a lower peak memory footprint thanks to the smaller sizes of intermediate allocations. This is similar to the the low_memory argument in polars. This approach would make it easier to read large parquet datasets with limited GPU memory.

Describe the solution you'd like
We should make chunked parquet reading available to cuDF-python users. Perhaps this functionality could be made available to cudf.pandas users as well.

Additional context
Pandas does not seem to have a method for chunking parquet reads, and I'm not sure if pandas makes use of the iter_row_groups behavior in fastparquet as a pass-through parameter.

API docs references:

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment cuDF (Python) Affects Python cuDF API. labels Feb 4, 2024
@galipremsagar galipremsagar self-assigned this May 10, 2024
@galipremsagar galipremsagar removed the 0 - Backlog In queue waiting for assignment label May 10, 2024
@galipremsagar galipremsagar linked a pull request May 13, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuDF (Python) Affects Python cuDF API. feature request New feature or request
Projects
Status: In Progress
Status: To be revisited
Development

Successfully merging a pull request may close this issue.

2 participants