Should we deprecate aggregate_files from read_parquet? #9051

rjzamora · 2022-05-06T17:58:57Z

This issue is similar to #8937 (already done) and #9043 in the sense that it aims to remove unnecessary (and rarely-utilized) options from dd.read_parquet.

TLDR: I’d like to propose that we deprecate the aggregate_files argument from dd.read_parquet.

Although I do believe there is strong motivation for a file-aggregation feature (especially for hive/directory-partitioned datasets), the current implementation of this feature actually aggregates row-groups (rather than files), which is extremely inefficient at scale (or on remote storage). This implementation “snafu” is my own fault. However, rather than directly changing the current behavior, I suggest that we simply remove the option altogether. Removing both aggregate_files and chunksize (see #9043) should allow us to cut out a lot of unnecessarily-complex core/engine code and reduce general maintenance burden.

In the future, we may wish to re-introduce an aggregate_files-like feature, but that (simpler) feature should be designed to aggregate full files (rather than arbitrary row-groups). Users (or down-stream libraries) that need more flexibility than simple “per-row-groups” or “per-files” partitioning should be able to feed their own custom logic into the new from_map API.

The text was updated successfully, but these errors were encountered:

…` and ``aggregate_files`` (#9052) As discussed in #9043 (for `chunksize`) and #9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.

…` and ``aggregate_files`` (dask#9052) As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.

alienscience · 2022-05-18T14:08:15Z

If aggregate_files=True is removed, would it be possible to use split_row_groups=N where N is a number to get similar functionality?

If not, then then I guess that a df.repartition() would have to be used. I have seen parquet datasets that cause problems in Dask for any non-trivial operation. I think that these datasets start off as too many partitions and, since we started using aggregate_files=True, processing the datasets has been less problematic. My guess is that datasets such as these would be painful to process if there is no option to aggregate_files or similar.

rjzamora · 2022-05-18T15:08:00Z

since we started using aggregate_files=True, processing the datasets has been less problematic.

Thank you fo sharing this @alienscience ! Do your datasets typically contain many single row-group files such that using read_parquet(..., split_row_groups=N, aggregate_files=True) means that you are aggregating N small files into each Dask partition? If so, we may be able to change the specific meaning of aggregate_files to preserve that behavior (e.g. aggregate_files=N). The logic that I am most interested in removing supports the current relationship between aggregate_files and chunksize/split_row_groups.

alienscience · 2022-05-18T15:15:04Z

Do your datasets typically contain many single row-group files such that using read_parquet(..., split_row_groups=N, aggregate_files=True) means that you are aggregating N small files into each Dask partition?

Yes, and it would be good to have a way to combine multiple small files into a single Dask partition.

rjzamora added dataframe io parquet labels May 6, 2022

rjzamora mentioned this issue May 6, 2022

Add pre-deprecation warnings for chunksize and aggregate_files to read_parquet #9052

Merged

rjzamora mentioned this issue May 18, 2022

Should we deprecate chunksize from read_parquet? #9043

Closed

This was referenced Jun 13, 2022

Concatenate partitions based on arbitrary large chunksize passed to dd.read_parquet #6765

Closed

[WIP] Revise aggregate_files behavior in read_parquet #9197

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we deprecate aggregate_files from read_parquet? #9051

Should we deprecate aggregate_files from read_parquet? #9051

rjzamora commented May 6, 2022

alienscience commented May 18, 2022

rjzamora commented May 18, 2022

alienscience commented May 18, 2022

Should we deprecate aggregate_files from read_parquet? #9051

Should we deprecate aggregate_files from read_parquet? #9051

Comments

rjzamora commented May 6, 2022

alienscience commented May 18, 2022

rjzamora commented May 18, 2022

alienscience commented May 18, 2022