Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up get_num_partitions for common cases #21791

Merged
merged 1 commit into from May 17, 2024
Merged

Conversation

gibsondan
Copy link
Member

@gibsondan gibsondan commented May 10, 2024

Summary:
For daily, hourly, and minute-ly partition sets, do basic math rather than enumerating every single partition key and counting them.

Test Plan: A representative large asset graph with many large partition sets goes from taking 65 seconds to computing counts on all assets to taking 0.4 seconds to compute counts on all assets

Summary & Motivation

How I Tested These Changes

@gibsondan gibsondan force-pushed the getnumpartitions branch 2 times, most recently from e7f4552 to 22abf4e Compare May 10, 2024 20:47
@@ -1032,6 +1062,28 @@ def is_basic_daily(self) -> bool:
def is_basic_hourly(self) -> bool:
return self.cron_schedule == "0 * * * *"

def get_fixed_minute_interval(self) -> Optional[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this make more sense in some sort of util module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a short docstring would be helpful - i.e. what exactly is a fixed-minute interval?

except ValueError:
return None

if 60 % interval == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be interval % 60? Not used to seeing the constant on the left side of this operator.

return None

try:
interval = int(cron_parts[0][2:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this a little bit difficult to follow - more specific names or comments with example cron strings would help?

fixed_minute_interval = self.get_fixed_minute_interval()
if fixed_minute_interval:
return (
int(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disregard, but I think this would read more clearly as something like:

minutes_in_window = (last_partition_window.start.timestamp() - first_partition_window.start.timestamp()) / 60
return minutes_in_window // fixed_minute_interval + 1

@@ -1228,6 +1228,19 @@ def my_partitioned_config_2(_start, _end):
== partitions_def.get_partition_keys(current_time=current_time)[50:53]
)

partitions_def = TimeWindowPartitionsDefinition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this sort of scheme is how it handles DST weirdnesses and leap years etc. It'd be great to add some tests which span larger time ranges to ensure that we have coverage for those sorts of things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took out the end time so now it includes both! (not really an issue for these types of cronstrings since you can always just advance the time by that many minutes, but good to be certain for sure)

Summary:

Test Plan: A representative large asset graph with many partition sets goes from taking 65 second to computing counts on all assets to taking 0.4 seconds to compute counts on all assets
@gibsondan
Copy link
Member Author

did a pass based on feedback - moved to separate module, added comments, made sure new test case spans leap years (there are some existing test cases in that same test that also cover that for hourly and daily partitions definitions)

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

minutes_in_window = (
last_partition_window.start.timestamp() - first_partition_window.start.timestamp()
) / 60
return int(minutes_in_window // fixed_minute_interval) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think // returns an int, so the cast isn't needed

@gibsondan gibsondan merged commit 0980dd8 into master May 17, 2024
1 check passed
@gibsondan gibsondan deleted the getnumpartitions branch May 17, 2024 14:51
gibsondan added a commit that referenced this pull request May 22, 2024
Summary:
For daily, hourly, and minute-ly partition sets, do basic math rather
than enumerating every single partition key and counting them.

Test Plan: A representative large asset graph with many large partition
sets goes from taking 65 seconds to computing counts on all assets to
taking 0.4 seconds to compute counts on all assets

## Summary & Motivation

## How I Tested These Changes
nikomancy pushed a commit that referenced this pull request May 22, 2024
Summary:
For daily, hourly, and minute-ly partition sets, do basic math rather
than enumerating every single partition key and counting them.

Test Plan: A representative large asset graph with many large partition
sets goes from taking 65 seconds to computing counts on all assets to
taking 0.4 seconds to compute counts on all assets

## Summary & Motivation

## How I Tested These Changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants