-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we deprecate aggregate_files from read_parquet? #9051
Comments
…` and ``aggregate_files`` (#9052) As discussed in #9043 (for `chunksize`) and #9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.
…` and ``aggregate_files`` (dask#9052) As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.
If If not, then then I guess that a |
Thank you fo sharing this @alienscience ! Do your datasets typically contain many single row-group files such that using |
Yes, and it would be good to have a way to combine multiple small files into a single Dask partition. |
This issue is similar to #8937 (already done) and #9043 in the sense that it aims to remove unnecessary (and rarely-utilized) options from
dd.read_parquet
.TLDR: I’d like to propose that we deprecate the
aggregate_files
argument fromdd.read_parquet
.Although I do believe there is strong motivation for a file-aggregation feature (especially for hive/directory-partitioned datasets), the current implementation of this feature actually aggregates row-groups (rather than files), which is extremely inefficient at scale (or on remote storage). This implementation “snafu” is my own fault. However, rather than directly changing the current behavior, I suggest that we simply remove the option altogether. Removing both
aggregate_files
andchunksize
(see #9043) should allow us to cut out a lot of unnecessarily-complex core/engine code and reduce general maintenance burden.In the future, we may wish to re-introduce an
aggregate_files
-like feature, but that (simpler) feature should be designed to aggregate full files (rather than arbitrary row-groups). Users (or down-stream libraries) that need more flexibility than simple “per-row-groups” or “per-files” partitioning should be able to feed their own custom logic into the newfrom_map
API.The text was updated successfully, but these errors were encountered: