Implement map_overlap #406

phofl · 2023-11-21T15:56:31Z

Planning on merging this soonish to unblock rolling work, but feedback welcome

rjzamora

Thanks for putting this together. This PR is relatively large, and you are signaling that it is also time sensitive. Are there specific areas/concerns that you are unsure of or are interested in feedback on?

dask_expr/_expr.py

rjzamora · 2023-11-21T16:17:47Z

dask_expr/tests/test_collection.py

+    # Bug in dask/dask
+    # result = df.map_overlap(func, before=0, after="1D")
+    # expected = lib.DataFrame([4, 4, 4, 3, 3], index=idx, columns=["a"])
+    # assert_eq(result, expected)


Is there an issue open?

dask/dask#10639

phofl · 2023-11-21T16:45:49Z

The complexity is in class MapOverlapInterleavePartitions(Expr):, everything else is fairly straightforward, so a set of eyeballs would be appreciated there

rjzamora

Still working through this whenever I have free moments. Not seeing any serious problems yet.

rjzamora · 2023-11-21T18:13:54Z

dask_expr/_expr.py

@@ -1275,6 +1276,196 @@ def _task(self, index: int):
            )


+class MapOverlap(MapPartitions):


Mostly a "note to self": I'm wondering if there is any "danger" in treating the pre-lowered version of MapOverlap as a Blockwise expression. It is very important that we don't do any partition-related optimizations (e.g. culling) until after this expression is lowered. It seems like we are in the clear for the current optimize implementation, but it isn't clear to me that we are gaining much by inheriting from MapPartitions to begin with?

Culling in general has to happen after lower, but I agree that this isn't ideal.

Maybe we have to create a new class that sits somewhere in between Expr and Blockwise, but removing this information removes a lot of information that is helpful

rjzamora · 2023-11-21T18:25:01Z

dask_expr/_expr.py

+        )
+
+
+class MapOverlapInterleavePartitions(Expr):


Another note to self: The Interleave term comes from the fact that we are extracting before/after rows for each partition and effectively "interleaving" the duplicated data between the "real" partitions. I don't love this name, but I also don't really have a better suggestion (maybe CreateOverlappingPartitions?)

No strong opinion either way

phofl · 2023-11-22T10:00:10Z

I am merging this to unblock rolling work, but happy to iterate further

Implement map_overlap

baf12ea

phofl requested a review from rjzamora November 21, 2023 15:56

rjzamora reviewed Nov 21, 2023

View reviewed changes

Rename class

3457fd7

phofl merged commit a620d8c into dask:main Nov 22, 2023
7 checks passed

phofl deleted the map_overlap branch November 22, 2023 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement map_overlap #406

Implement map_overlap #406

phofl commented Nov 21, 2023

rjzamora left a comment

rjzamora Nov 21, 2023

phofl Nov 21, 2023

phofl commented Nov 21, 2023 •

edited

rjzamora left a comment

rjzamora Nov 21, 2023

phofl Nov 21, 2023

rjzamora Nov 21, 2023

phofl Nov 21, 2023

phofl commented Nov 22, 2023

		@@ -1275,6 +1276,196 @@ def _task(self, index: int):
		)


		class MapOverlap(MapPartitions):

Implement map_overlap #406

Implement map_overlap #406

Conversation

phofl commented Nov 21, 2023

rjzamora left a comment

Choose a reason for hiding this comment

rjzamora Nov 21, 2023

Choose a reason for hiding this comment

phofl Nov 21, 2023

Choose a reason for hiding this comment

phofl commented Nov 21, 2023 • edited

rjzamora left a comment

Choose a reason for hiding this comment

rjzamora Nov 21, 2023

Choose a reason for hiding this comment

phofl Nov 21, 2023

Choose a reason for hiding this comment

rjzamora Nov 21, 2023

Choose a reason for hiding this comment

phofl Nov 21, 2023

Choose a reason for hiding this comment

phofl commented Nov 22, 2023

phofl commented Nov 21, 2023 •

edited