New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffle-based groupby aggregation by default #9406
Comments
Just checking in here. Based on the
comment from @ian-r-rose in #9302 (comment) it sounds like we should move to turn on |
I'm going to try to run some more targeted tests today and tomorrow, but broadly speaking, I think this is correct. It is likely not an improvement when |
As @ian-r-rose mentioned, there are Another alternative may be to introduce a more intuitive knob for users to turn (and to discourage the explicit use of |
I took a closer look at whether it ever makes sense to have shuffle-based groupby turned when I tested a high-cardinality groupby on a ~50 GB, 600MM row, 7000 partition timeseries dataset using a cluster with 15 AWS
@pytest.mark.parametrize("shuffle", [False, "tasks"])
@pytest.mark.parametrize("n_groups", [50_000_0000, 100_000_000, 200_000_000])
@pytest.mark.parametrize("split_out", [2, 4, 8, 16])
@pytest.mark.parametrize("partition_freq", ["1d"])
def test_groupby(groupby_client, split_out, n_groups, shuffle, partition_freq):
ddf = timeseries_of_size(
cluster_memory(groupby_client) // 4,
id_maximum=n_groups,
partition_freq=partition_freq,
) # ~600MM rows
print(ddf)
agg = ddf.groupby("id").agg(
{"x": "sum", "y": "mean"},
split_out=split_out,
split_every=2,
shuffle=shuffle,
)
print(len(agg)) Here is a faceted chart showing the results: Note that the aca-based groupby/agg is missing for some of the lower values of The main thing to note is that the shuffle-based groupby wins in all cases. If I were to go to lower cardinalities, the aca-based groupby would start winning, but I stand by the initial assertion that shuffling makes sense basically whenever CaveatOkay, there is a wrinkle to the above story, and that has to do with Then, we come along and repartition according to dask/dask/dataframe/groupby.py Lines 2499 to 2502 in 19a5147
Even worse, I don't think the implementation as it stands today provides a way to keep the same number of partitions before shuffling. It always repartitions to a smaller number ( I think we should allow for What do you think @rjzamora? |
|
In #9302, @rjzamora added a new shuffle-based groupby algorithm which @ian-r-rose demonstrated has significant performance improvements in certain cases #9302 (comment). Today this new shuffle-based groupby is gated behind an optional
shuffle=
keyword argument (off by default). We should probably turn it on by default in cases where it makes sense.Here's a brief summary of @ian-r-rose's findings (see #9302 (comment) for more details)
Opening this issue so we don't loose track of this topic
The text was updated successfully, but these errors were encountered: