New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up get_num_partitions for common cases #21791
Conversation
e7f4552
to
22abf4e
Compare
@@ -1032,6 +1062,28 @@ def is_basic_daily(self) -> bool: | |||
def is_basic_hourly(self) -> bool: | |||
return self.cron_schedule == "0 * * * *" | |||
|
|||
def get_fixed_minute_interval(self) -> Optional[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this make more sense in some sort of util module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a short docstring would be helpful - i.e. what exactly is a fixed-minute interval?
except ValueError: | ||
return None | ||
|
||
if 60 % interval == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be interval % 60
? Not used to seeing the constant on the left side of this operator.
return None | ||
|
||
try: | ||
interval = int(cron_parts[0][2:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding this a little bit difficult to follow - more specific names or comments with example cron strings would help?
python_modules/dagster/dagster/_core/definitions/time_window_partitions.py
Show resolved
Hide resolved
fixed_minute_interval = self.get_fixed_minute_interval() | ||
if fixed_minute_interval: | ||
return ( | ||
int( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to disregard, but I think this would read more clearly as something like:
minutes_in_window = (last_partition_window.start.timestamp() - first_partition_window.start.timestamp()) / 60
return minutes_in_window // fixed_minute_interval + 1
@@ -1228,6 +1228,19 @@ def my_partitioned_config_2(_start, _end): | |||
== partitions_def.get_partition_keys(current_time=current_time)[50:53] | |||
) | |||
|
|||
partitions_def = TimeWindowPartitionsDefinition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this sort of scheme is how it handles DST weirdnesses and leap years etc. It'd be great to add some tests which span larger time ranges to ensure that we have coverage for those sorts of things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took out the end time so now it includes both! (not really an issue for these types of cronstrings since you can always just advance the time by that many minutes, but good to be certain for sure)
Summary: Test Plan: A representative large asset graph with many partition sets goes from taking 65 second to computing counts on all assets to taking 0.4 seconds to compute counts on all assets
22abf4e
to
4dbd5ef
Compare
did a pass based on feedback - moved to separate module, added comments, made sure new test case spans leap years (there are some existing test cases in that same test that also cover that for hourly and daily partitions definitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
minutes_in_window = ( | ||
last_partition_window.start.timestamp() - first_partition_window.start.timestamp() | ||
) / 60 | ||
return int(minutes_in_window // fixed_minute_interval) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think //
returns an int, so the cast isn't needed
Summary: For daily, hourly, and minute-ly partition sets, do basic math rather than enumerating every single partition key and counting them. Test Plan: A representative large asset graph with many large partition sets goes from taking 65 seconds to computing counts on all assets to taking 0.4 seconds to compute counts on all assets ## Summary & Motivation ## How I Tested These Changes
Summary: For daily, hourly, and minute-ly partition sets, do basic math rather than enumerating every single partition key and counting them. Test Plan: A representative large asset graph with many large partition sets goes from taking 65 seconds to computing counts on all assets to taking 0.4 seconds to compute counts on all assets ## Summary & Motivation ## How I Tested These Changes
Summary:
For daily, hourly, and minute-ly partition sets, do basic math rather than enumerating every single partition key and counting them.
Test Plan: A representative large asset graph with many large partition sets goes from taking 65 seconds to computing counts on all assets to taking 0.4 seconds to compute counts on all assets
Summary & Motivation
How I Tested These Changes