Fix delayed in fusing with multipled dependencies #1038

phofl · 2024-04-25T12:57:51Z

fjetter · 2024-04-29T11:17:46Z

dask_expr/io/tests/test_delayed.py

+def test_from_delayed_fusion():
+    df = from_delayed([_load(x) for x in range(10)], meta={"x": "int64", "y": "int64"})
+    result = df.map_partitions(lambda x: None, meta={}).optimize().dask
+    assert len(result) == 30


can you help me out...? 10 x load + 10 x lambda = 20. What am I missing?

if you want to test that it doesn't fuse, maybe

df.map_partitions(lambda x: None, meta={}).optimize(fuse=False).dask == df.map_partitions(lambda x: None, meta={}).optimize().dask

is a better test. or even

ddf = df.map_partitions(lambda x: None, meta={}) dsk_opt = ddf.optimize().dask dsk_raw = ddf.lower_completely().dask

10 x an intermediate task representing the from_delayed thing that isn't fused anymore. Previously that was fused into the map_partitions thing as well. This is the tradeoff for now

I like that, adjusted

jtilly · 2024-04-29T11:55:21Z

Thank you for the PR!

I just wanted to flag that this PR doesn't (yet) change the behavior for the example that I provided in dask/dask#11067:

(dask) ➜  ~/dask (main) ✗ pip install git+https://github.com/phofl/dask-expr@11067
...
(dask) ➜  ~/dask (main) ✗ python example.py                                             
Loading chunk 9
Loading chunk 8
Loading chunk 7
Loading chunk 6
Loading chunk 5
Loading chunk 4
Loading chunk 3
Loading chunk 2
Loading chunk 1
Loading chunk 0
Storing chunk 9
Storing chunk 8
Storing chunk 7
Storing chunk 6
Storing chunk 5
Storing chunk 4
Storing chunk 3
Storing chunk 2
Storing chunk 1
Storing chunk 0
(dask) ➜  ~/dask (main) ✗ DASK_DATAFRAME__QUERY_PLANNING=False python example.py                     
Loading chunk 9
Storing chunk 9
Loading chunk 8
Storing chunk 8
Loading chunk 7
Storing chunk 7
Loading chunk 6
Storing chunk 6
Loading chunk 5
Storing chunk 5
Loading chunk 4
Storing chunk 4
Loading chunk 3
Storing chunk 3
Loading chunk 2
Storing chunk 2
Loading chunk 1
Storing chunk 1
Loading chunk 0
Storing chunk 0

phofl · 2024-04-29T11:58:55Z

Yes we still have some overhead. It works mostly as expected if your delayed function runs longer (e.g. increase the size of your DataFrame), but this will still need a follow up on our end since the behaviour isn't as intended yet.

On a side note: You might want to use from_map if your delayed function is indeed only loading data from disk, that is better suited for your task anyway and doesn't suffer from this problem.

phofl added 2 commits April 25, 2024 14:57

Fix delayed in fusing with multipled dependencies

3cdf287

Fixup import error

b50a441

fjetter reviewed Apr 29, 2024

View reviewed changes

phofl added 2 commits April 29, 2024 13:23

Adjust test

0a7fbc7

Adjust test

55cc1a3

phofl merged commit 4854a85 into dask:main Apr 30, 2024
7 checks passed

phofl deleted the 11067 branch April 30, 2024 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix delayed in fusing with multipled dependencies #1038

Fix delayed in fusing with multipled dependencies #1038

phofl commented Apr 25, 2024 •

edited

fjetter Apr 29, 2024

fjetter Apr 29, 2024

phofl Apr 29, 2024

phofl Apr 29, 2024

jtilly commented Apr 29, 2024

phofl commented Apr 29, 2024 •

edited

Fix delayed in fusing with multipled dependencies #1038

Fix delayed in fusing with multipled dependencies #1038

Conversation

phofl commented Apr 25, 2024 • edited

fjetter Apr 29, 2024

Choose a reason for hiding this comment

fjetter Apr 29, 2024

Choose a reason for hiding this comment

phofl Apr 29, 2024

Choose a reason for hiding this comment

phofl Apr 29, 2024

Choose a reason for hiding this comment

jtilly commented Apr 29, 2024

phofl commented Apr 29, 2024 • edited

phofl commented Apr 25, 2024 •

edited

phofl commented Apr 29, 2024 •

edited