Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unintentional data loss when deleting DAGs #20758

Merged

Conversation

SamWheating
Copy link
Contributor

@SamWheating SamWheating commented Jan 7, 2022

We encountered some data loss today due to a user deleting a DAG from the UI called project.load, which then deleted all of the history from other DAGs called project.load.bigquery and project.load.trino, which also caused them to run unexpectedly due to the resetting of run history.

Note - we don't use SubDAGs, we're just using . in the DAG ID as a separator for a hierarchical naming system.

As it turns out, deleting a DAG my_dag will delete all of the metadata for any DAG which starts with my_dag., as it is assumed that the latter are subdags of the former:

            cond = or_(model.dag_id == dag_id, model.dag_id.like(dag_id + ".%"))
            count += session.query(model).filter(cond).delete(synchronize_session='fetch')

This isn't always the case.

Anyways, this PR changes the delete_dag function so that it only deletes the intended DAG and DAGs starting with <dag_id>. which are also SubDAGs. I think that there may still be some other edge cases where DAGs can be unintentionally deleted, but this patches the most apparent case.

This can all be cleaned up even more once the deprecation of SubDAGs is complete (Airflow 3?)

@boring-cyborg boring-cyborg bot added the area:API Airflow's REST/HTTP API label Jan 7, 2022
@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Jan 10, 2022
@github-actions
Copy link

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small NIT :)

@potiuk
Copy link
Member

potiuk commented Jan 10, 2022

We encountered some data loss today due to a user deleting a DAG from the UI called project.load, which then deleted all of the history from other DAGs called project.load.bigquery and project.load.trino, which also caused them to run unexpectedly due to the resetting of run history.

Side-comment: I was almost sure that DAG_id cannot contain "." (precisely because of the subdag convention), But now I see this i not the case :). We excluded '.' for task group (for the reason task groups are also . separated) but not for the task ids:

KEY_REGEX = re.compile(r'^[\w.-]+$')
GROUP_KEY_REGEX = re.compile(r'^[\w-]+$')

Nice catch.

@potiuk potiuk merged commit 5980d2b into apache:main Jan 10, 2022
@kaxil kaxil added this to the Airflow 2.2.4 milestone Jan 11, 2022
Comment on lines +58 to +64
dags_to_delete_query = session.query(DagModel.dag_id).filter(
or_(
DagModel.dag_id == dag_id,
and_(DagModel.dag_id.like(f"{dag_id}.%"), DagModel.is_subdag),
)
)
dags_to_delete = [dag_id for dag_id, in dags_to_delete_query]
Copy link
Member

@kaxil kaxil Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        or_(
            DagModel.dag_id == dag_id,,
            DagModel.root_dag_id == dag_id

might have also worked @SamWheating

@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jan 27, 2022
jedcunningham pushed a commit that referenced this pull request Jan 27, 2022
jedcunningham pushed a commit that referenced this pull request Jan 28, 2022
jedcunningham pushed a commit that referenced this pull request Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API okay to merge It's ok to merge this PR as it does not require more tests type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants