-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement map_overlap #406
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together. This PR is relatively large, and you are signaling that it is also time sensitive. Are there specific areas/concerns that you are unsure of or are interested in feedback on?
# Bug in dask/dask | ||
# result = df.map_overlap(func, before=0, after="1D") | ||
# expected = lib.DataFrame([4, 4, 4, 3, 3], index=idx, columns=["a"]) | ||
# assert_eq(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complexity is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working through this whenever I have free moments. Not seeing any serious problems yet.
@@ -1275,6 +1276,196 @@ def _task(self, index: int): | |||
) | |||
|
|||
|
|||
class MapOverlap(MapPartitions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a "note to self": I'm wondering if there is any "danger" in treating the pre-lowered version of MapOverlap
as a Blockwise
expression. It is very important that we don't do any partition-related optimizations (e.g. culling) until after this expression is lowered. It seems like we are in the clear for the current optimize
implementation, but it isn't clear to me that we are gaining much by inheriting from MapPartitions
to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Culling in general has to happen after lower, but I agree that this isn't ideal.
Maybe we have to create a new class that sits somewhere in between Expr and Blockwise, but removing this information removes a lot of information that is helpful
dask_expr/_expr.py
Outdated
) | ||
|
||
|
||
class MapOverlapInterleavePartitions(Expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note to self: The Interleave
term comes from the fact that we are extracting before/after rows for each partition and effectively "interleaving" the duplicated data between the "real" partitions. I don't love this name, but I also don't really have a better suggestion (maybe CreateOverlappingPartitions
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion either way
I am merging this to unblock rolling work, but happy to iterate further |
Planning on merging this soonish to unblock rolling work, but feedback welcome