[FEA] Incorporate chunked parquet reading into cuDF-python #14966

GregoryKimball · 2024-02-04T19:10:28Z

Is your feature request related to a problem? Please describe.
libcudf provides a chunked_parquet_reader in its public API. This reader uses new reader options to process the data in a parquet file in sub-file units. The chunk_read_limit option limits the table size in bytes to be returned per read by only decoding a subset of pages per chunked read. The pass_read_limit option limits the memory used for reading and decompressing data by only decompressing a subset of pages per chunked read.

The chunked parquet reader allows cuDF-python to expose two types of useful functionality:

an API that acts as an iterator to yield dataframe chunks. This is similar to the iter_row_groups behavior in fastparquet. This approach would let users work with parquet files that contain more rows than 2.1B rows (see [FEA] Add 64-bit size type option at build-time for libcudf #13159 for more information about the row limit in libcudf).
a "low_memory" mode that reads the full file, but has a lower peak memory footprint thanks to the smaller sizes of intermediate allocations. This is similar to the the low_memory argument in polars. This approach would make it easier to read large parquet datasets with limited GPU memory.

Describe the solution you'd like
We should make chunked parquet reading available to cuDF-python users. Perhaps this functionality could be made available to cudf.pandas users as well.

Additional context
Pandas does not seem to have a method for chunking parquet reads, and I'm not sure if pandas makes use of the iter_row_groups behavior in fastparquet as a pass-through parameter.

API docs references:

pandas: read_parquet
pyarrow: parquet.read_table
fastparquet: read_parquet
polars: read_parquet
cudf: read_parquet

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment cuDF (Python) Affects Python cuDF API. labels Feb 4, 2024

GregoryKimball added this to the Stabilizing large workflows (OOM, spilling, partitioning) milestone Feb 4, 2024

galipremsagar self-assigned this May 10, 2024

galipremsagar removed the 0 - Backlog In queue waiting for assignment label May 10, 2024

galipremsagar linked a pull request May 13, 2024 that will close this issue

Implement chunked parquet reader in cudf-python #15728

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

GregoryKimball commented Feb 4, 2024 •

edited

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

Comments

GregoryKimball commented Feb 4, 2024 • edited

GregoryKimball commented Feb 4, 2024 •

edited