Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we deprecate aggregate_files from read_parquet? #9051

Open
rjzamora opened this issue May 6, 2022 · 3 comments · May be fixed by #9197
Open

Should we deprecate aggregate_files from read_parquet? #9051

rjzamora opened this issue May 6, 2022 · 3 comments · May be fixed by #9197

Comments

@rjzamora
Copy link
Member

rjzamora commented May 6, 2022

This issue is similar to #8937 (already done) and #9043 in the sense that it aims to remove unnecessary (and rarely-utilized) options from dd.read_parquet.

TLDR: I’d like to propose that we deprecate the aggregate_files argument from dd.read_parquet.

Although I do believe there is strong motivation for a file-aggregation feature (especially for hive/directory-partitioned datasets), the current implementation of this feature actually aggregates row-groups (rather than files), which is extremely inefficient at scale (or on remote storage). This implementation “snafu” is my own fault. However, rather than directly changing the current behavior, I suggest that we simply remove the option altogether. Removing both aggregate_files and chunksize (see #9043) should allow us to cut out a lot of unnecessarily-complex core/engine code and reduce general maintenance burden.

In the future, we may wish to re-introduce an aggregate_files-like feature, but that (simpler) feature should be designed to aggregate full files (rather than arbitrary row-groups). Users (or down-stream libraries) that need more flexibility than simple “per-row-groups” or “per-files” partitioning should be able to feed their own custom logic into the new from_map API.

jsignell pushed a commit that referenced this issue May 10, 2022
…` and ``aggregate_files`` (#9052)

As discussed in #9043 (for `chunksize`) and #9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`.

This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation).  My goal is to find (and inform) whatever users may be depending on these obscure options.
erayaslan pushed a commit to erayaslan/dask that referenced this issue May 12, 2022
…` and ``aggregate_files`` (dask#9052)

As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`.

This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation).  My goal is to find (and inform) whatever users may be depending on these obscure options.
@alienscience
Copy link

If aggregate_files=True is removed, would it be possible to use split_row_groups=N where N is a number to get similar functionality?

If not, then then I guess that a df.repartition() would have to be used. I have seen parquet datasets that cause problems in Dask for any non-trivial operation. I think that these datasets start off as too many partitions and, since we started using aggregate_files=True, processing the datasets has been less problematic. My guess is that datasets such as these would be painful to process if there is no option to aggregate_files or similar.

@rjzamora
Copy link
Member Author

since we started using aggregate_files=True, processing the datasets has been less problematic.

Thank you fo sharing this @alienscience ! Do your datasets typically contain many single row-group files such that using read_parquet(..., split_row_groups=N, aggregate_files=True) means that you are aggregating N small files into each Dask partition? If so, we may be able to change the specific meaning of aggregate_files to preserve that behavior (e.g. aggregate_files=N). The logic that I am most interested in removing supports the current relationship between aggregate_files and chunksize/split_row_groups.

@alienscience
Copy link

Do your datasets typically contain many single row-group files such that using read_parquet(..., split_row_groups=N, aggregate_files=True) means that you are aggregating N small files into each Dask partition?

Yes, and it would be good to have a way to combine multiple small files into a single Dask partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants