Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement map_overlap #406

Merged
merged 2 commits into from
Nov 22, 2023
Merged

Implement map_overlap #406

merged 2 commits into from
Nov 22, 2023

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Nov 21, 2023

Planning on merging this soonish to unblock rolling work, but feedback welcome

@phofl phofl requested a review from rjzamora November 21, 2023 15:56
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. This PR is relatively large, and you are signaling that it is also time sensitive. Are there specific areas/concerns that you are unsure of or are interested in feedback on?

dask_expr/_expr.py Show resolved Hide resolved
Comment on lines +1311 to +1314
# Bug in dask/dask
# result = df.map_overlap(func, before=0, after="1D")
# expected = lib.DataFrame([4, 4, 4, 3, 3], index=idx, columns=["a"])
# assert_eq(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue open?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl
Copy link
Collaborator Author

phofl commented Nov 21, 2023

The complexity is in class MapOverlapInterleavePartitions(Expr):, everything else is fairly straightforward, so a set of eyeballs would be appreciated there

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through this whenever I have free moments. Not seeing any serious problems yet.

@@ -1275,6 +1276,196 @@ def _task(self, index: int):
)


class MapOverlap(MapPartitions):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a "note to self": I'm wondering if there is any "danger" in treating the pre-lowered version of MapOverlap as a Blockwise expression. It is very important that we don't do any partition-related optimizations (e.g. culling) until after this expression is lowered. It seems like we are in the clear for the current optimize implementation, but it isn't clear to me that we are gaining much by inheriting from MapPartitions to begin with?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Culling in general has to happen after lower, but I agree that this isn't ideal.

Maybe we have to create a new class that sits somewhere in between Expr and Blockwise, but removing this information removes a lot of information that is helpful

)


class MapOverlapInterleavePartitions(Expr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note to self: The Interleave term comes from the fact that we are extracting before/after rows for each partition and effectively "interleaving" the duplicated data between the "real" partitions. I don't love this name, but I also don't really have a better suggestion (maybe CreateOverlappingPartitions?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion either way

@phofl
Copy link
Collaborator Author

phofl commented Nov 22, 2023

I am merging this to unblock rolling work, but happy to iterate further

@phofl phofl merged commit a620d8c into dask:main Nov 22, 2023
7 checks passed
@phofl phofl deleted the map_overlap branch November 22, 2023 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants