Fix Scheduler crash when executing task instances of missing DAG #20349

ephraimbuddy · 2021-12-16T17:24:48Z

When executing task instances, we do not check if the dag is missing in
the dagbag. This PR fixes it by ignoring task instances if we can't find
the dag in serialized dag table

Closes: #20099

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

eladkal · 2021-12-16T18:23:00Z

airflow/jobs/scheduler_job.py

@@ -316,6 +316,15 @@ def _executable_task_instances_to_queued(self, max_tis: int, session: Session =

        pool_to_task_instances: DefaultDict[str, List[models.Pool]] = defaultdict(list)
        for task_instance in task_instances_to_examine:
+            # If the dag is no longer in the dagbag, don't bother


How is that case possible?

The path that leads to here is:
_do_scheduling -> _critical_section_execute_task_instances -> _executable_task_instances_to_queued

It's odd that dagruns were created but the task eventually is "dagless". How was dagrun created if the dag is not in the dagbag?
Is this designed to handle the edge case of removing the DAG after a dagrun was created?

This can hardly happen but can happen.
The way to reproduce it is to remove a dag file of a running dag and mark the task instances as failed then delete the dag from the UI. You'll observe the crashlooping. The dag must have max_active_tasks_per_dag set

As for the change, it doesn't stop a running task from completing, it only errors out that the dag is missing.
I could actually only error when checking concurrency because that's where we accessed a dagbag and used it to check for task which can cause the scheduler to crash with nonetype attribute error.

And sorry for the misleading comment at the concurrency point. The task would get there and still print error log.

I'm considering to only have the log there. WDYT?

ashb · 2021-12-16T22:00:59Z

airflow/jobs/scheduler_job.py

+                    # If the dag is missing, continue to the next task.
+                    if not serialized_dag:
+                        self.log.error(
+                            "DAG '%s' for taskinstance %s not found in serialized_dag table",
+                            dag_id,
+                            task_instance,
+                        )
+                        continue


If we can't find them, shouldn't these be set to failed or something, otherwise these TIs will always come up each time around the loop

From my testing, the scheduler continues to run the task instances even though the dag is no more.
I think we should not fail it but allow it to finish since it's not stuck or raising exception

allow it to finish

If the code is not there anymore, how can the task instance execute? What am I missing here 🤔

Sorry for the delay, wanted to properly reproduce it before responding.
Here's a reproduction step:

Run this dag:

from airflow import DAG from datetime import datetime dag = DAG( "airflow_bug", schedule_interval="0 1 * * *", start_date=datetime(2021, 1, 1), max_active_runs=1, concurrency=1, ) for i in range(100): @dag.task(task_id=f'mytasrk{i}') def sleeping(): import time time.sleep(60) sleeping()

Once you unpause and the task start running, remove the file from the Dag folder.
Watch the scheduler logs, after some time it'll start crashing and won't recover(until the whole tasks would start failing, I think).

[2021-12-24 05:55:22,877] {scheduler_job.py:623} ERROR - Executor reports task instance <TaskInstance: airflow_bug.mytasrk97 scheduled__2021-01-01T01:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally? [2021-12-24 05:55:22,881] {scheduler_job.py:630} ERROR - Marking task instance <TaskInstance: airflow_bug.mytasrk97 scheduled__2021-01-01T01:00:00+00:00 [queued]> as failed Traceback (most recent call last): File "/opt/airflow/airflow/jobs/scheduler_job.py", line 628, in _process_executor_events task = dag.get_task(ti.task_id) AttributeError: 'NoneType' object has no attribute 'get_task'

cc: @kaxil

This error/reproduction step is not quite right, but the same idea can trigger this behaviour -- if the dag is deleted at the "right" time, this bit of the scheduler will fail.

I think in that case though we should fail the task instances as the DAG doesn't exist anymore, and as TP said, it can't run successfully.

Yeah, if the DAG is missing entirely, we should stop the entire run from continuing because nothing afterwards would run. If I understand correctly, skipping the task instance (as this PR currently implements) means the ti would stay queued and unnecessarily be tried again (and again…), which seems suboptimal.

airflow/jobs/scheduler_job.py

tests/jobs/test_scheduler_job.py

SamWheating · 2022-01-12T18:00:18Z

airflow/jobs/scheduler_job.py

+                            dag_id,
+                            task_instance,
+                        )
+                        task_instance.set_state(State.FAILED, session=session)


Is it possible to set the entire DAG to FAILED here? Depending on the trigger rules, there could be downstream tasks which still attempt to execute, which will then be marked failed by the same check.

there could be downstream tasks which still attempt to execute, which will then be marked failed by the same check

I think this is a good thing in this case. This code is reached because the DAG declaring those tasks is gone, so it doesn’t make sense to execute those tasks IMO.

I will opt for setting all scheduled tasks to None. I doubt that failing the DAG here will really fail it when there're tasks being executed in executor. There's a bug that when you mark a DAG as failed, it comes up again as running #16078

So I propose to set all scheduled tasks to None. The scheduler will no longer move the task instances to scheduled when the dag can no longer be found and it makes sense to set it to None instead of failing it since the task instances won't have logs

I think this is a good thing in this case. This code is reached because the DAG declaring those tasks is gone, so it doesn’t make sense to execute those tasks IMO.

Yeah, I agree that they won't be executable, I'm just wondering if we can mark them all as failed in a single pass.

github-actions · 2022-01-13T06:59:40Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

kaxil · 2022-01-13T08:41:51Z

airflow/jobs/scheduler_job.py

+                            task_instance,
+                        )
+                        session.query(TI).filter(TI.dag_id == dag_id, TI.state == State.SCHEDULED).update(
+                            {TI.state: State.NONE}, synchronize_session='fetch'


I don't think we should set it to None ! We should Mark it failed for sure. There is also TaskInstanceState.REMOVED which partially applies and partially doesn't as it is not the task alone that is removed but the entire dag

When executing task instances, we do not check if the dag is missing in the dagbag. This PR fixes it by ignoring task instances if we can't find the dag in serialized dag table

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>

ephraimbuddy · 2022-01-18T10:46:43Z

I was testing something else with dynamic dags that has a lot of tasks. I noticed that the DAGs do have import errors at times and the task instances get failed. After some time, the dag appears again with failed task instances. I'm wondering if this PR will cause issues for users because there's no log to tell what happened to the task instances that failed.
cc: @ashb @kaxil @uranusjr

jedcunningham · 2022-01-20T23:11:04Z

@ephraimbuddy, you're concerned that the message is in the scheduler logs, not something user facing (i.e. in the UI)?

ephraimbuddy · 2022-01-26T18:57:38Z

I'm particularly concerned about dynamic dags. Sometimes, they give import errors while the DAGs are still running and within some minutes, the DAG would parse fine and reappear again.
This change would fail task instances of such DAGs instead of the scheduler crashlooping the entire period of the import errors. Not sure which is better...

potiuk · 2022-01-26T19:17:07Z

I think honestly crashing on data entry is arguably "never" a good thing.

Why the import errors would appear actually ? I am not sure if I "feel" where it comes from. I though such crashes on import might only come from not-yet-synced files (when you are using something else than git-sync - there you have atomic swaps for all files in a commit) but I cannot see how it could lead to missing DAG in serialized table and running tasks at the same time. I thought we can only have running task when we also have serialized dag.

But maybe I am missing something :)

Do we actually have a case that DAG is deleted while being re-generated? Maybe we should fix that instead ?

BTW. AIP-43 will mean that such errors will not crash scheduler anyway.

) When executing task instances, we do not check if the dag is missing in the dagbag. This PR fixes it by ignoring task instances if we can't find the dag in serialized dag table Closes: #20099 (cherry picked from commit 9871576)

ephraimbuddy requested review from ashb, kaxil and XD-DENG as code owners December 16, 2021 17:24

boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Dec 16, 2021

eladkal reviewed Dec 16, 2021

View reviewed changes

ephraimbuddy force-pushed the fix-scheduler-crash branch from a2a1161 to 5ce6d00 Compare December 16, 2021 20:55

ashb reviewed Dec 16, 2021

View reviewed changes

ephraimbuddy requested review from eladkal, ashb and uranusjr December 17, 2021 13:50

uranusjr reviewed Jan 12, 2022

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the fix-scheduler-crash branch from 8ce697e to dd24862 Compare January 12, 2022 08:27

kaxil reviewed Jan 12, 2022

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

kaxil reviewed Jan 12, 2022

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the fix-scheduler-crash branch from dd24862 to 4b0391a Compare January 12, 2022 11:31

kaxil reviewed Jan 12, 2022

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

SamWheating reviewed Jan 12, 2022

View reviewed changes

kaxil approved these changes Jan 13, 2022

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jan 13, 2022

ephraimbuddy force-pushed the fix-scheduler-crash branch from 349305c to 86f1e93 Compare January 13, 2022 07:39

kaxil requested changes Jan 13, 2022

View reviewed changes

uranusjr approved these changes Jan 13, 2022

View reviewed changes

ephraimbuddy requested a review from kaxil January 13, 2022 09:50

ephraimbuddy and others added 6 commits January 13, 2022 12:59

Fix Scheduler crash when executing task instances of missing DAG

63e81ac

When executing task instances, we do not check if the dag is missing in the dagbag. This PR fixes it by ignoring task instances if we can't find the dag in serialized dag table

fixup! Fix Scheduler crash when executing task instances of missing DAG

996ae4a

Split 'task instance' into two words

b0c1ae5

apply suggestions from code review

20e3408

fixup! apply suggestions from code review

b03f8cd

fixup! fixup! apply suggestions from code review

7d9c002

ephraimbuddy and others added 3 commits January 13, 2022 12:59

Apply suggestions from code review

cfeb9ee

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>

set all scheduled tasks of the missing dag to NONE state

4ca34f6

fail the tasks instances rather than set to none

3c5079f

ephraimbuddy force-pushed the fix-scheduler-crash branch from 3b3876c to 3c5079f Compare January 13, 2022 11:59

kaxil approved these changes Jan 13, 2022

View reviewed changes

kaxil merged commit 9871576 into apache:main Jan 13, 2022

kaxil deleted the fix-scheduler-crash branch January 13, 2022 21:23

kaxil added this to the Airflow 2.2.4 milestone Jan 13, 2022

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jan 31, 2022

jedcunningham mentioned this pull request Feb 18, 2022

Status of testing of Apache Airflow 2.2.4rc1 #21669

Closed

40 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Scheduler crash when executing task instances of missing DAG #20349

Fix Scheduler crash when executing task instances of missing DAG #20349

ephraimbuddy commented Dec 16, 2021

eladkal Dec 16, 2021 •

edited

ephraimbuddy Dec 16, 2021 •

edited

ashb Dec 16, 2021

ephraimbuddy Dec 16, 2021

uranusjr Dec 22, 2021

ephraimbuddy Dec 24, 2021

ashb Jan 10, 2022

uranusjr Jan 12, 2022

SamWheating Jan 12, 2022 •

edited

uranusjr Jan 13, 2022

ephraimbuddy Jan 13, 2022

SamWheating Jan 13, 2022

ephraimbuddy Jan 13, 2022

github-actions bot commented Jan 13, 2022

kaxil Jan 13, 2022

ephraimbuddy commented Jan 18, 2022

jedcunningham commented Jan 20, 2022

ephraimbuddy commented Jan 26, 2022

potiuk commented Jan 26, 2022 •

edited

Fix Scheduler crash when executing task instances of missing DAG #20349

Fix Scheduler crash when executing task instances of missing DAG #20349

Conversation

ephraimbuddy commented Dec 16, 2021

eladkal Dec 16, 2021 • edited

Choose a reason for hiding this comment

ephraimbuddy Dec 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamWheating Jan 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 13, 2022

Choose a reason for hiding this comment

ephraimbuddy commented Jan 18, 2022

jedcunningham commented Jan 20, 2022

ephraimbuddy commented Jan 26, 2022

potiuk commented Jan 26, 2022 • edited

eladkal Dec 16, 2021 •

edited

ephraimbuddy Dec 16, 2021 •

edited

SamWheating Jan 12, 2022 •

edited

potiuk commented Jan 26, 2022 •

edited