-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix delayed in fusing with multipled dependencies #1038
Conversation
dask_expr/io/tests/test_delayed.py
Outdated
def test_from_delayed_fusion(): | ||
df = from_delayed([_load(x) for x in range(10)], meta={"x": "int64", "y": "int64"}) | ||
result = df.map_partitions(lambda x: None, meta={}).optimize().dask | ||
assert len(result) == 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me out...? 10 x load + 10 x lambda = 20
. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to test that it doesn't fuse, maybe
df.map_partitions(lambda x: None, meta={}).optimize(fuse=False).dask == df.map_partitions(lambda x: None, meta={}).optimize().dask
is a better test. or even
ddf = df.map_partitions(lambda x: None, meta={})
dsk_opt = ddf.optimize().dask
dsk_raw = ddf.lower_completely().dask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 x an intermediate task representing the from_delayed thing that isn't fused anymore. Previously that was fused into the map_partitions thing as well. This is the tradeoff for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that, adjusted
Thank you for the PR! I just wanted to flag that this PR doesn't (yet) change the behavior for the example that I provided in dask/dask#11067:
|
Yes we still have some overhead. It works mostly as expected if your delayed function runs longer (e.g. increase the size of your DataFrame), but this will still need a follow up on our end since the behaviour isn't as intended yet. On a side note: You might want to use from_map if your delayed function is indeed only loading data from disk, that is better suited for your task anyway and doesn't suffer from this problem. |
xref dask/dask#11067