Run daily partitions more frequently than once a day #18893

slopp · 2023-12-26T15:44:38Z

slopp
Dec 26, 2023
Maintainer

A common use case in data engineering is to have data that needs to be updated very frequently for a given day but otherwise can be represented less granularly. For example, imagine you have a pipeline that pulls product order data from an upstream service. Each order is represented as a file. You may want to get all the new order data every 5 minutes during the current day. What is the best way to represent this pipeline so that it is observable and actionable?

One option would be to represent each order as a partition in Dagster. However, Dagster's partition system is optimized for O(1000s) of partitions, so if your system has millions of orders, you will quickly surpass this limit.

Another option would be to create a partitioning scheme that matches your data SLA, e.g., a partition for every 5 minutes. This would be easy to schedule in Dagster and allow a human to understand the state of your incremental data processing throughout the day. However, a 5-minute interval would still quickly surpass Dagster's partition limit. This granularity could also make backfills a challenge, and it is unlikely the 5-minute granularity would be useful for historic days.

Instead, we recommend a middle-ground: a daily partition scheme where the current day is updated multiple times throughout the day. How is this accomplished?

Keywords for search engines: run daily partitioned asset on hourly schedule

Answered by slopp

Dec 26, 2023

The key to this type of pipeline is making sure your asset functions are safe to run the same partition multiple times. Here is example code showing a potential structure:

from dagster import (
    DailyPartitionsDefinition,
    asset,
    AssetExecutionContext,
    ScheduleEvaluationContext,
    BackfillPolicy,
    schedule,
    define_asset_job,
    AssetSelection,
    RunRequest,
    Definitions,
)
import random


def remove_partial_day(day, context):
    """Given a date, removes files for this day from s3 if they exist"""
    # for example
    context.log.info(f"Found {random.randint(0,10)} existing files for {day} to delete")
    ...


def ftp_to_s3(day, context):
    """Gets files f…

View full answer

slopp · 2023-12-26T15:51:56Z

slopp
Dec 26, 2023
Maintainer Author

The key to this type of pipeline is making sure your asset functions are safe to run the same partition multiple times. Here is example code showing a potential structure:

from dagster import (
    DailyPartitionsDefinition,
    asset,
    AssetExecutionContext,
    ScheduleEvaluationContext,
    BackfillPolicy,
    schedule,
    define_asset_job,
    AssetSelection,
    RunRequest,
    Definitions,
)
import random


def remove_partial_day(day, context):
    """Given a date, removes files for this day from s3 if they exist"""
    # for example
    context.log.info(f"Found {random.randint(0,10)} existing files for {day} to delete")
    ...


def ftp_to_s3(day, context):
    """Gets files from ftp for a day and writes them to s3"""
    # for example
    context.log.info(f"Getting {random.randint(0,10)} files for {day}")
    ...


def clickhouse_remove_day(day, context):
    """Drop records for a given day to avoid writing duplicates"""
    # for example
    context.log.info(f"Dropping existing records, if any, for {day}")
    ...


def clickhouse_load_day(day, context):
    """Load all records for a given day"""
    # for example
    context.log.info(f"Loading records for {day}")
    ...


# supply an end_offset of one so we can update the _current day's_ partition throughout the day
daily_partitions_with_offset = DailyPartitionsDefinition(
    start_date="2023-01-01",
    end_offset=1
)

@asset(
    partitions_def=daily_partitions_with_offset,
    backfill_policy=BackfillPolicy.multi_run(),  # this asset does not know how to process multiple days within one run, so a multi run backfill policy is appropriate
)
def ftp_source_daily(context: AssetExecutionContext):
    """Get all files from a ftp source for a day, load to s3, safe to run multiple times a day"""
    day_to_process = context.partition_key

    remove_partial_day(day_to_process, context)

    ftp_to_s3(day_to_process, context)

    return


@asset(
    deps=[ftp_source_daily],
    partitions_def=daily_partitions_with_offset,
    backfill_policy=BackfillPolicy.multi_run(),  # this asset does not know how to process multiple days within one run, so a multi run backfill policy is appropriate
)
def clickhouse_daily(context: AssetExecutionContext):
    """Load all data for a day, safe to run multiple times a day"""
    day_to_process = context.partition_key

    # First drop all existing records for this partition
    clickhouse_remove_day(day_to_process, context)

    # Then load all data for this day
    clickhouse_load_day(day_to_process, context)

    return


# create a job targeting our daily assets for our schedule to use
load_data = define_asset_job(
    name="load_data", selection=AssetSelection.assets(clickhouse_daily).upstream()
)


@schedule(job=load_data, cron_schedule="*/5 * * * *")
def update_frequently(context: ScheduleEvaluationContext):
    """Update our daily assets as frequently as every 5 minutes"""
    day_to_process = context.scheduled_execution_time.strftime("%Y-%m-%d")
    return RunRequest(run_key=None, partition_key=day_to_process)  # optional


defs = Definitions(
    assets=[clickhouse_daily, ftp_source_daily],
    schedules=[update_frequently],
    jobs=[load_data],
)

To run this sample in Dagster you can execute dagster dev -f definitions.py.

The regularly scheduled job will update the current day's partition multiple times throughout a given day. This will appear as multiple materialization events for the same partition:

The schedule frequency (every 5 minutes) will dictate how quickly new data is processed and available.

The historical data can still be backfilled at the daily granularity:

This example code has assets that only know how to process one day at a time so a BackfillPolicy.multi_run( ) is used, meaning a backfill will launch a run per day. Alternatively, you could write assets to process a range of dates.

The main downside to this approach is that if one order fails, it will fail the entire day. Logging within the processing functions can help the data engineer identify which order failed for a given day.

2 replies

mattfysh Mar 28, 2024

Thanks for posting this @slopp this is similar to the problem I'm trying to solve, and I suspect a very common scenario for people who are coming from data pipelines that use offset checkpointing (eg. Spark)

In your solution you're deleting files from S3 even though they've previously been copied from FTP, which I'm assuming is the external source. Even when backfilling a partition for a previous day, it seems to also be deleting the assets you've copied to S3, is that right? This would introduce a strong dependency on the external FTP for historical data, I'm not sure it's a warranted risk for this solution.

Can you think of anything that would solve those two problem points? I'd also be interested to hear of any new or upcoming dagster features that will make this common pipeline type easier to solve in future. Thanks again!

slopp May 8, 2024
Maintainer Author

This example code is essentially just pseudo-code, you could adjust the incremental load strategy to be less dependent on an external system and rely instead on what you've already "cached", eg here is another example

@asset(
  partition_def = some_daily_partition_def,
)
def regularly_updated_data(context, config: MyAssetConfig):
 """ an asset scheduled to run more frequently than its partition def """
  previously_processed = get_processed_data(context.partition_key) # a hypothetical dict of ids and processed data, might be None
  available = get_available_ids(context.partition_key)
  
  new_data = {}
  
  for id in available:
    if id in previously_processed["ids"]:
      continue
    new_data[id] = get_data(id)

  merged_data = merge(new_data, previously_processed)
  persist(merged_data)

sryza · 2023-12-26T16:45:19Z

sryza
Dec 26, 2023
Maintainer

Related: #14612

3 replies

sryza Dec 26, 2023
Maintainer

Also related: #15005

mattfysh Mar 28, 2024

As soon as you start reaching for workarounds like this (e.g. defining two assets in dagster when logically there is only one) it starts to take a lot of the shine away from the "asset-centric" / SDA approach that dagster takes. All of those touted benefits about being declarative are no longer true.

In fact, many common problems in data engineering seem to call for many workarounds and escape hatches in dagster (e.g. storing cursors in the metadata of prior runs). It feels like dagster is not the right tool for incremental batch / micro-batch pipelines, which is probably not the image the dagster is wanting to project. Is there much thinking here on how to solve these issues in more natural ways?

slopp May 8, 2024
Maintainer Author

If your order of magnitude for "microbatching" is in the 10s of thousands then using partitions directly is very appropriate; and in this sense Dagster is far better for orchestrating things like backfills than most orchestration tools with no knowledge of incremental work.

This issue is primarily for addressing cases where your micro-batching needs surpass Dagster's recommendation for total # of partitions, in which case I think it is fair that some "state management" needs to be done within the asset code. IMO this is not un-natural, but represents fairly natural flexibility where the granularity we as humans want observability into is different from the granularity our storage or processing systems want.

nixent · 2023-12-29T07:14:02Z

nixent
Dec 29, 2023

@slopp what would be recommended way to do same multiple processing triggered by AMP freshness policy instead of schedule?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run daily partitions more frequently than once a day #18893

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Run daily partitions more frequently than once a day #18893

slopp Dec 26, 2023 Maintainer

Replies: 3 comments · 5 replies

slopp Dec 26, 2023 Maintainer Author

mattfysh Mar 28, 2024

slopp May 8, 2024 Maintainer Author

sryza Dec 26, 2023 Maintainer

sryza Dec 26, 2023 Maintainer

mattfysh Mar 28, 2024

slopp May 8, 2024 Maintainer Author

nixent Dec 29, 2023

slopp
Dec 26, 2023
Maintainer

Replies: 3 comments 5 replies

slopp
Dec 26, 2023
Maintainer Author

slopp May 8, 2024
Maintainer Author

sryza
Dec 26, 2023
Maintainer

sryza Dec 26, 2023
Maintainer

slopp May 8, 2024
Maintainer Author

nixent
Dec 29, 2023