Pandas-only P2P shuffling #8635

hendrikmakait · 2024-05-06T14:53:58Z

Supersedes #8606

Tests added / passed
Passes pre-commit run --all-files

* Update * Fix other copy issue

hendrikmakait · 2024-05-06T15:30:13Z

continuous_integration/environment-3.10.yaml

+      - git+https://github.com/hendrikmakait/dask@p2p-pandas
+      - git+https://github.com/hendrikmakait/dask-expr@p2p-pandas


Reminder to revert 712e211

github-actions · 2024-05-06T16:21:35Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

29 files + 2 29 suites +2 10h 21m 53s ⏱️ + 1h 22m 10s
4 053 tests - 3 3 784 ✅ - 159 168 💤 + 62 100 ❌ + 97 1 🔥 - 3
51 387 runs +7 177 48 343 ✅ +5 922 2 857 💤 +1 088 185 ❌ +182 2 🔥 - 15

For more details on these failures and errors, see this check.

Results for commit 98b1c32. ± Comparison against base commit 1ec61a1.

This pull request removes 141 and adds 138 tests. Note that renamed tests count towards both.

distributed.shuffle.tests.test_graph ‑ test_does_not_raise_on_stringified_numeric_column_name
distributed.shuffle.tests.test_graph ‑ test_raise_on_complex_numbers[cdouble]
distributed.shuffle.tests.test_graph ‑ test_raise_on_complex_numbers[clongdouble]
distributed.shuffle.tests.test_graph ‑ test_raise_on_complex_numbers[csingle]
distributed.shuffle.tests.test_graph ‑ test_raise_on_custom_objects
distributed.shuffle.tests.test_graph ‑ test_raise_on_non_string_column_name
distributed.shuffle.tests.test_graph ‑ test_raise_on_sparse_data
distributed.shuffle.tests.test_merge ‑ test_merge[False-inner]
distributed.shuffle.tests.test_merge ‑ test_merge[False-left]
distributed.shuffle.tests.test_merge ‑ test_merge[False-outer]
…

distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[report_args0]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[1]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[True]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[report_args0]
distributed.shuffle.tests.test_merge ‑ test_merge[disk-inner]
distributed.shuffle.tests.test_merge ‑ test_merge[disk-left]
…

This pull request removes 1 skipped test and adds 59 skipped tests. Note that renamed tests count towards both.

distributed.shuffle.tests.test_graph ‑ test_raise_on_custom_objects

distributed.shuffle.tests.test_merge ‑ test_merge[memory-inner]
distributed.shuffle.tests.test_merge ‑ test_merge[memory-left]
distributed.shuffle.tests.test_merge ‑ test_merge[memory-outer]
distributed.shuffle.tests.test_merge ‑ test_merge[memory-right]
distributed.shuffle.tests.test_shuffle ‑ test_basic_integration[memory-1]
distributed.shuffle.tests.test_shuffle ‑ test_basic_integration[memory-20]
distributed.shuffle.tests.test_shuffle ‑ test_basic_integration[memory-None]
distributed.shuffle.tests.test_shuffle ‑ test_basic_lowlevel_shuffle[False-memory-False-1-1-10]
distributed.shuffle.tests.test_shuffle ‑ test_basic_lowlevel_shuffle[False-memory-False-1-1-1]
distributed.shuffle.tests.test_shuffle ‑ test_basic_lowlevel_shuffle[False-memory-False-1-10-10]
…

This pull request skips 5 and un-skips 1 tests.

distributed.shuffle.tests.test_merge
distributed.shuffle.tests.test_metrics
distributed.shuffle.tests.test_shuffle
distributed.shuffle.tests.test_shuffle ‑ test_processing_chain[False]
distributed.shuffle.tests.test_shuffle ‑ test_processing_chain[True]

distributed.tests.test_nanny ‑ test_no_unnecessary_imports_on_worker[pandas]

♻️ This comment has been updated with latest results.

rjzamora · 2024-05-06T17:18:54Z

@hendrikmakait - Is there any background or motivation you can share about this work? Seems very pandas-specific at the moment.

rjzamora · 2024-05-06T17:28:11Z

distributed/shuffle/_pickle.py

+    index: pd.Index, blocks: Sequence[Block], meta: pd.DataFrame
+) -> pd.DataFrame:
+    import pandas as pd
+    from pandas.core.internals import BlockManager, make_block


Would be nice in the long run if any logic using pandas internals could be dispatched.

hendrikmakait · 2024-05-06T19:23:42Z

continuous_integration/environment-mindeps.yaml

@@ -22,7 +22,7 @@ dependencies:
  # Distributed depends on the latest version of Dask
  - pip
  - pip:
-    - git+https://github.com/dask/dask
+    - git+https://github.com/hendrikmakait/dask@p2p-pandas


Reminder to revert 814c73a

mrocklin · 2024-05-06T19:45:45Z

Is there any background or motivation you can share about this work? Seems very pandas-specific at the moment.

I'll let @hendrikmakait answer fully, but my sense is that early experiments showed substantial speedups (like 30% on full TPC-H Benchmarks, not just on the shuffling portion).

Would be nice in the long run if any logic using pandas internals could be dispatched.

I can see how this would be convenient for RAPIDS folks (although probably a little inconvenient for others due to complexity). If we go this direction then maybe helping address/reduce this complexity is something that RAPIDS people could help design/execute/maintain?

rjzamora · 2024-05-06T20:32:59Z

my sense is that early experiments showed substantial speedups (like 30% on full TPC-H Benchmarks, not just on the shuffling portion).

Very nice!

I can see how this would be convenient for RAPIDS folks (although probably a little inconvenient for others due to complexity).

This is a breaking change for rapids "as is". I suppose it would be "convenient" for rapids to continue working with p2p shuffling :)

If we go this direction then maybe helping address/reduce this complexity is something that RAPIDS people could help design/execute/maintain?

I will certainly help with all of the above if doing so is both welcome and feasible. Feasibility probably depends on it being possible to isolate the logic that interacts with pandas internals. If organizing the code in this way really is "too complex", then cudf may need to rely on a completely distinct version of "p2p" to avoid slowing down Coiled engineers. However, my hope is that this is a natural way to organize things even if cudf wasn't in the picture. (fingers crossed)

For this PR, it seems fine to focus on pandas-only details (especially for now). @hendrikmakait - Perhaps we can meet offline so I can figure out whether rapids needs to prioritize (1) rolling-back p2p support, or (2) enabling the pyarrow-free approach with cudf-backed data.

hendrikmakait · 2024-05-07T08:01:30Z

@hendrikmakait - Is there any background or motivation you can share about this work? Seems very pandas-specific at the moment.

There are three main benefits I see here:

As @mrocklin mentioned, we have seen significant performance improvements with this. I plan to share some benchmarks later. (The Arrow-based implementation has never been optimized for performance and has some shortcomings like excessive data copying.)
It removes the Arrow dependency, a consistent source of pain because of semantics differing from pandas and its strict typing.
It removes complexity that never worked as intended, e.g., disk-write buffering.

For this PR, it seems fine to focus on pandas-only details (especially for now). @hendrikmakait - Perhaps we can meet offline so I can figure out whether rapids needs to prioritize (1) rolling-back p2p support, or (2) enabling the pyarrow-free approach with cudf-backed data.

Sounds good! I'll ping you offline. Looking at the code we currently have in place, I expect it to be fairly straightforward to dispatch the library-specific pieces of code or to dispatch instantiation of shuffle runs and implement a cuDF-specific version.

hendrikmakait · 2024-05-08T07:06:07Z

distributed/shuffle/_disk.py

+        concatenate3.
+        """
+        with path.open(mode="r+b") as fh:
+            buffer = memoryview(mmap.mmap(fh.fileno(), 0))


The mmap causes us to lose strict control over the file descriptor lifecycle. This creates problems with cleanup on Windows because we can't remove files if they still have an open file descriptor.

rjzamora · 2024-05-08T21:33:17Z

distributed/shuffle/_shuffle.py

-        t = t.drop([column])
-    splits = np.where(partition[1:] != partition[:-1])[0] + 1
-    splits = np.concatenate([[0], splits])
+    base = df[column].values.base  # type: ignore[union-attr]


Can you explain the base logic a bit?
In cudf/cupy, df[column].values.base produces None.

Does it make sense to protect against the case that base in None here? (Otherwise, len(base) will produce an error)

rjzamora · 2024-05-08T21:45:18Z

distributed/shuffle/_pickle.py

+    return pickle_bytelist(
+        (input_part_id, shard.index, *shard._mgr.blocks), prelude=False
+    )


Fun fact: This PR works fine with cudf-backed data when I do something like the following:

shard = shard.to_pandas() if hasattr(shard, "to_pandas") else shard return pickle_bytelist( (input_part_id, shard.index, *shard._mgr.blocks), prelude=False )

as long as I also modify the last line of unpickle_and_concat_dataframe_shards to be:

result = dd.methods.concat(shards, copy=True) if hasattr(type(meta), "from_pandas"): return type(meta).from_pandas(result) return result

I'm sure there are cases where the cudf->pandas->cudf round trip isn't perfect, but the small amount of logic I needed to work-around the changes is promising. (The performance is certainly not good, but that's a separate concern)

rjzamora · 2024-05-10T19:26:55Z

I've been experimenting with PR using cudf-backed data. I think it will take some work on the rapids side to bring performance in line with main (tpch queries take a ~2x perf hit with dask-cudf), but I wouldn't consider cudf + p2p performance to be a big priority yet. My near-term priority is to keep cudf + p2p "functional". This can be achieved in a fairly simple way by defining/using the following dispatch functions:

def deconstruct_dataframe_shard(shard: pd.DataFrame) -> tuple[Any]:
    """ Deconstruct a DataFrame shard into the information needed to reconstruct it later

    Dispatch on the type of `shard`.
    The elements of the result should be whatever is needed by `restore_dataframe_shard`
    """
    ...

def restore_dataframe_shard(meta: pd.DataFrame, unpickled_shard: tuple[Any]) -> pd.DataFrame:
    """ Reconstruct an un-pickled DataFrame shard

    Dispatch on the type of `meta`.
    `unpickled_shard` corresponds to the round-tripped output of `deconstruct_dataframe_shard`.
    Converts a tuple of un-pickled data (e.g. index  and blocks) back to a DataFrame.
    """
    ...

For the sake of generality, pickle_dataframe_shard would look something like:

def pickle_dataframe_shard(input_part_id: int, shard: pd.DataFrame) -> list[pickle.PickleBuffer]:
    return pickle_bytelist(
        (input_part_id,) +  deconstruct_dataframe_shard(shard),
        prelude=False,
    )

and unpickle_and_concat_dataframe_shards would then look something like:

def unpickle_and_concat_dataframe_shards(
    b: bytes | bytearray | memoryview, meta: pd.DataFrame
) -> pd.DataFrame:
    import dask.dataframe as dd

    unpickled_shards = list(unpickle_bytestream(b))
    unpickled_shards.sort(key=first)
    shards = []

    for _, unpickled_shard in unpickled_shards:
        shards.append(restore_dataframe_shard(meta, unpickled_shard))

    return dd.methods.concat(shards, copy=True)

I suppose the only "complex" part of this general suggestion is that the dispatch functions themselves would need to be exposed/defined in dask.dataframe.

hendrikmakait · 2024-05-22T17:11:22Z

TL;DR: After running several A/B tests and profiling the code, I have concluded not to move forward with this PR because it is not a universal improvement.

Analysis

A/B tests have shown improvements ranging from 5%-30% in end-to-end runtime for almost all TPC-H queries:

However, they have also shown disastrous results for test_set_index and to a lesser extent, test_set_index_uber_lyft.

It looks like this PR is overfitting on OLAP-style queries that include narrow dataframes with few columns and heavily reduce data at the cost of ETL-style transformations that include wide dataframes with many columns.

Profiling the code before and after, I saw that both the transfer and the unpack phase have become significantly more expensive in test_set_index (by a factor of 3.5x and 2.4x, respectively) and we lose much time due to what looks like GIL contention. The Coiled clusters I ran these on also report a GIL contention very close to 1 (and higher than on main).

This is likely due to many small shards with very few rows. I have tried multiple possible quick wins like removing mmap to avoid the impact of page loads during GIL-restricted code paths or using df.drop() instead of del, but none of these had any positive impact. It's currently unclear how to improve this implementation in a way that reduces GIL contention such that ETL-style workloads are not impacted as harshly. There may be upstream improvements we could make that reduce GIL contention but those would also take a while to get released. I also suspect that this is yet another example of the convoy effect, so reducing GIL contention in some parts will have non-trivial effects as it had in previous attempts at improving P2P performance.

hendrikmakait · 2024-05-22T17:15:35Z

Example py-spy profiles for future reference:

profiles.zip

crusaderky and others added 17 commits April 5, 2024 09:24

p2p shuffle without pyArrow

37119a9

Serialize dataframes manually

b9725c5

Don't send columns for every shard

8a732a7

Prevent strings from blowing up memory

21c9954

Merge branch 'main' into p2p-pandas

db8db7d

Remove outdated tests

3d1128a

Fix tests and refactor

191c9d7

Remove offload

e35a738

Remove orphan executor mentions

252e44b

Remove executor from tests

3a25c64

Fix categoricals and ensure copying

8c2ebbf

* Update * Fix other copy issue

deep copy

e30b540

Fix categoricals

2920472

Ignore unnecessary shards

311f91f

Don't expect warning

ea22a5a

Rewire env (REVERT ME)

712e211

Merge branch 'main' into p2p-pandas

c41597c

hendrikmakait commented May 6, 2024

View reviewed changes

Trigger CI

418edf8

rjzamora reviewed May 6, 2024

View reviewed changes

Adjust mindeps (REVERT ME)

814c73a

hendrikmakait commented May 6, 2024

View reviewed changes

hendrikmakait added 3 commits May 7, 2024 12:04

Fix pandas compat

9db59e4

Fix tests on mindeps

c70f28f

Skip in-memory shuffle

1602135

hendrikmakait added 4 commits May 7, 2024 15:46

Skip in-memory P2P dataframe shuffling tests

474b097

Skip in-memory P2P dataframe shuffling tests

aa9bf21

Do not install pyarrow in mindeps-pandas

fb43405

Improved typing

95d5377

hendrikmakait commented May 8, 2024

View reviewed changes

rjzamora reviewed May 8, 2024

View reviewed changes

fjetter assigned hendrikmakait May 15, 2024

Replace del by drop

98b1c32

hendrikmakait closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas-only P2P shuffling #8635

Pandas-only P2P shuffling #8635

hendrikmakait commented May 6, 2024

hendrikmakait May 6, 2024

github-actions bot commented May 6, 2024 •

edited

rjzamora commented May 6, 2024

rjzamora May 6, 2024

hendrikmakait May 6, 2024

mrocklin commented May 6, 2024

rjzamora commented May 6, 2024

hendrikmakait commented May 7, 2024

hendrikmakait May 8, 2024

rjzamora May 8, 2024

rjzamora May 8, 2024 •

edited

rjzamora commented May 10, 2024

hendrikmakait commented May 22, 2024 •

edited

hendrikmakait commented May 22, 2024

		- git+https://github.com/hendrikmakait/dask@p2p-pandas
		- git+https://github.com/hendrikmakait/dask-expr@p2p-pandas

Pandas-only P2P shuffling #8635

Pandas-only P2P shuffling #8635

Conversation

hendrikmakait commented May 6, 2024

hendrikmakait May 6, 2024

Choose a reason for hiding this comment

github-actions bot commented May 6, 2024 • edited

Unit Test Results

rjzamora commented May 6, 2024

rjzamora May 6, 2024

Choose a reason for hiding this comment

hendrikmakait May 6, 2024

Choose a reason for hiding this comment

mrocklin commented May 6, 2024

rjzamora commented May 6, 2024

hendrikmakait commented May 7, 2024

hendrikmakait May 8, 2024

Choose a reason for hiding this comment

rjzamora May 8, 2024

Choose a reason for hiding this comment

rjzamora May 8, 2024 • edited

Choose a reason for hiding this comment

rjzamora commented May 10, 2024

hendrikmakait commented May 22, 2024 • edited

hendrikmakait commented May 22, 2024

github-actions bot commented May 6, 2024 •

edited

rjzamora May 8, 2024 •

edited

hendrikmakait commented May 22, 2024 •

edited