add batch fetching of data version records #21798

prha · 2024-05-11T22:00:14Z

Summary & Motivation

Asset graphs with large fan-in can incur a hefty data-fetching cost when used with data versions. This PR fetches the asset record for a batched set of asset keys. The asset record has the last materialization record, and potentially the last observation record (in Plus), reducing the number of serial fetches we have to make to get the input data versions.

This batching of calls is only possible because we're not filtering the records (obs/mats) that we're fetching (either by partition or by storage id).

How I Tested These Changes

Added an explicit fan-in data version test that checks the underlying data fetching calls. It went from 200 calls to get_event_records => 1 call of get_asset_records.

python_modules/dagster/dagster/_core/execution/context/data_version_cache.py

schrockn

Nice. All very reasonable and clean.

Req'ing changes to drive discussion on configurability piece.

schrockn · 2024-05-13T10:02:26Z

python_modules/dagster/dagster/_core/execution/context/data_version_cache.py



 if TYPE_CHECKING:
    from dagster._core.execution.context.compute import StepExecutionContext

+ASSET_RECORD_BATCH_SIZE = 100


Is there a way we can make this configurable and driven from a source of truth in Dagster Plus when the user is a Plus user? It would be great to have control over this to tune perf and control incidents.

I think the main way to do this (since this called from user code) is to make it a property of the instance.

something like:

# in DagterInstance @property def max_recommended_batch_size() -> int: ...

and then the question is how narrowly we scope it. Is it for fetching events? all records?

I guess the thing to do for this particular case is to have the recommended batch size set and then warn in OSS if it exceeds it.

Yeah that makes sense. @gibsondan do you have a position on doing this approach?

i think the limit here should probably be tuned to this specific query callsite (as opposed to perscribing the max amount of asset records you would want to fetch in any situation - it probably depends on the context)

I think it depends on whether we want to be able to tune it remotely on the cloud servers without the user taking any action.

If we do, a simple option would be to make it an env var - this is how we handled it for batch writing store events, and there are various options for both users and for us to do remote tuning of env var values

(if we do this we should just be sure to re-fetch the env var from os.environ every time it is accessed)

Is this true even though this code is run in user-code (e.g. run containers)? Edit: I guess that makes sense if we use it for store event calls

I DMed daniel this exact question :-)

python_modules/dagster/dagster/_core/execution/context/data_version_cache.py

…#21809) ## Summary & Motivation The current implementation of `get_latest_data_version_record` would always fetch the latest observation AND the latest materialization. It would then return one or the other, preferring the materialization in the ambiguous case. This PR short-circuits the ambiguous case such that we only fetch the observation when the materialization is not present. This effectively halves (from 2 calls => 1 call) the round-trips to storage in the most common case (materializable assets). Additionally, this PR replaces calls to the generic `get_event_records` API with the narrower `fetch_X` APIs. This PR is orthogonal to #21798, which replaces calls to `get_latest_data_version_record` with calls to the batchable `get_asset_records` in the simple case where storage_id filters and partition filters are not applied. Should note: this does not change the logic of which record is returned. The API name is a little misleading because we will return the latest materialization record even if the observation record is more recent for materializable assets with both types of records. ## How I Tested These Changes BK

addressed

schrockn

Cool great

add batch fetching of data version records

fa64bb7

prha force-pushed the prha/asset_record_data_version branch from 75b0afe to fa64bb7 Compare May 11, 2024 22:49

gibsondan reviewed May 12, 2024

View reviewed changes

python_modules/dagster/dagster/_core/execution/context/data_version_cache.py Show resolved Hide resolved

refactor

3ffd9a2

prha marked this pull request as draft May 12, 2024 22:30

force max fetch of asset records

274ef84

prha marked this pull request as ready for review May 12, 2024 23:43

prha requested review from gibsondan and schrockn May 12, 2024 23:44

add test

009f3b5

prha force-pushed the prha/asset_record_data_version branch from af3e771 to 009f3b5 Compare May 12, 2024 23:48

prha requested a review from smackesey May 12, 2024 23:55

schrockn requested changes May 13, 2024

View reviewed changes

prha mentioned this pull request May 13, 2024

only fetch observations in get_latest_data_version_record if required #21809

Merged

change batch size to env var

5fe4c85

schrockn previously requested changes May 13, 2024

View reviewed changes

python_modules/dagster/dagster/_core/execution/context/data_version_cache.py Outdated Show resolved Hide resolved

rename env var

a912fce

prha requested a review from schrockn May 13, 2024 17:39

schrockn approved these changes May 13, 2024

View reviewed changes

prha merged commit e6ddee9 into master May 13, 2024
1 check passed

prha deleted the prha/asset_record_data_version branch May 13, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add batch fetching of data version records #21798

add batch fetching of data version records #21798

prha commented May 11, 2024 •

edited

schrockn left a comment

schrockn May 13, 2024

prha May 13, 2024

schrockn May 13, 2024

gibsondan May 13, 2024

gibsondan May 13, 2024

prha May 13, 2024 •

edited

schrockn May 13, 2024

schrockn left a comment

add batch fetching of data version records #21798

add batch fetching of data version records #21798

Conversation

prha commented May 11, 2024 • edited

Summary & Motivation

How I Tested These Changes

schrockn left a comment

Choose a reason for hiding this comment

schrockn May 13, 2024

Choose a reason for hiding this comment

prha May 13, 2024

Choose a reason for hiding this comment

schrockn May 13, 2024

Choose a reason for hiding this comment

gibsondan May 13, 2024

Choose a reason for hiding this comment

gibsondan May 13, 2024

Choose a reason for hiding this comment

prha May 13, 2024 • edited

Choose a reason for hiding this comment

schrockn May 13, 2024

Choose a reason for hiding this comment

schrockn left a comment

Choose a reason for hiding this comment

prha commented May 11, 2024 •

edited

prha May 13, 2024 •

edited