[WIP] Revise aggregate_files behavior in read_parquet #9197

rjzamora · 2022-06-17T23:17:16Z

Closes #9043
Closes #9051
Closes #8829

This PR (mostly) preserves the existing chunksize/aggregate_files options in read_parquet by adding two new arguments:

sort_input_paths (default True): Whether or not Dask should re-order the files in the dataset to use "natural" ordering. Note that this new feature could be added in a separate PR, but I wanted to make sure that the new design allows for such an argument to exist.
file_groups (default None): A dictionary mapping paths to "file-group" indices, or a list of directory-partitioned-column names that must match for two or more files to belong to the same "file group." This PR introduces the file group concept to dd.read_parquet. The meaning is simple: Two files must belong to the same file group for Dask to consider aggregating them into the same output DataFrame partition. Matching file-group membership is necessary, but not sufficient, for file aggregation. That is, there must be some other option (like aggregate_files=int|True or chunksize) to specify how files should be aggregated within each group. The engines are always allowed to reorder paths by file group to improve file-aggregation behavior (even if sort_input_paths=False). Note that I orginally added this option in order to drop support for str arguments to aggregate_files in favor of int support (explained below). However, it is worth noting that this option is also much more flexible/powerful than the original aggregate_files=<str> behavior.

In addition to these new arguments, I also modified the existing aggregate_files argument to only accept bool or int types. That is, the aggregate_files option is now the "file equivalent" of split_row_groups. Specifying aggregate_files=100 means that 100 files from the same file group may be aggregated into the same output partition.

The most important result of this PR is likely the support for aggregate_files=<int> (in combination with file_groups=). For example. In main, one would need to use chunksize (or split_row_groups) with aggregate_files="year" to read in a single large partition for each distinct year in a partitioned NYC-taxi dataset:

import dask.dataframe as dd

ddf = dd.read_parquet(
    "s3://ursa-labs-taxi-data/",
    engine="pyarrow",
    storage_options={"anon": True},
    dataset={
        "partitioning": ["year", "month"],
        "partition_base_dir": "ursa-labs-taxi-data",
    },
    chunksize="6GB",
    aggregate_files="year",
)
# (takes several seconds on my workstation - but scales poorly with the total number of row groups)

However, with this branch, file aggregation can be much faster/simpler:

ddf = dd.read_parquet(
    "s3://ursa-labs-taxi-data/",
    engine="pyarrow",
    storage_options={"anon": True},
    dataset={
        "partitioning": ["year", "month"],
        "partition_base_dir": "ursa-labs-taxi-data",
    },
    file_groups=["year"],
    aggregate_files=True,
)
# (takes less than 1s on my workstation)

charlesbluca · 2022-08-25T17:13:33Z

@rjzamora bumping this - are there any blockers here or is this ready for review?

rjzamora added 6 commits June 15, 2022 21:05

starting revision of chunksize and aggregate_files

ada5218

new API mostly working - Just needs cleanup and benchmarking

bae2d69

original tests passing

59d0804

add test_sort_input_paths

cbaeff5

add test_aggregate_files_int

5e1b82f

reduce code redundancy a bit

d11f3f0

github-actions bot added dataframe io labels Jun 17, 2022

rjzamora added the parquet label Jun 17, 2022

rjzamora added 6 commits June 21, 2022 10:29

handle unconstrained aggregation

9df6715

add test coverage for expanded option

3001048

pass through sep

9a196b9

debug KeyError message

f909a3b

debug KeyError message - v2

4dda8c0

use file names

465d572

ian-r-rose self-requested a review August 2, 2022 16:29

rjzamora mentioned this pull request Nov 8, 2022

Change split_row_groups default to "infer" #9637

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Revise aggregate_files behavior in read_parquet #9197

[WIP] Revise aggregate_files behavior in read_parquet #9197

rjzamora commented Jun 17, 2022 •

edited

charlesbluca commented Aug 25, 2022

[WIP] Revise aggregate_files behavior in read_parquet #9197

Are you sure you want to change the base?

[WIP] Revise aggregate_files behavior in read_parquet #9197

Conversation

rjzamora commented Jun 17, 2022 • edited

charlesbluca commented Aug 25, 2022

rjzamora commented Jun 17, 2022 •

edited