-
A common use case in data engineering is to have data that needs to be updated very frequently for a given day but otherwise can be represented less granularly. For example, imagine you have a pipeline that pulls product order data from an upstream service. Each order is represented as a file. You may want to get all the new order data every 5 minutes during the current day. What is the best way to represent this pipeline so that it is observable and actionable? One option would be to represent each order as a partition in Dagster. However, Dagster's partition system is optimized for O(1000s) of partitions, so if your system has millions of orders, you will quickly surpass this limit. Another option would be to create a partitioning scheme that matches your data SLA, e.g., a partition for every 5 minutes. This would be easy to schedule in Dagster and allow a human to understand the state of your incremental data processing throughout the day. However, a 5-minute interval would still quickly surpass Dagster's partition limit. This granularity could also make backfills a challenge, and it is unlikely the 5-minute granularity would be useful for historic days. Instead, we recommend a middle-ground: a daily partition scheme where the current day is updated multiple times throughout the day. How is this accomplished? Keywords for search engines: run daily partitioned asset on hourly schedule |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Beta Was this translation helpful? Give feedback.
-
Related: #14612 |
Beta Was this translation helpful? Give feedback.
-
@slopp what would be recommended way to do same multiple processing triggered by AMP |
Beta Was this translation helpful? Give feedback.
The key to this type of pipeline is making sure your asset functions are safe to run the same partition multiple times. Here is example code showing a potential structure: