Fix Deadlock on refresh from DB by local task run #25266

potiuk · 2022-07-24T17:12:59Z

This PR attempts to fix the deadlock that occurs when task instance
is being run in parallel to running _do_scheduling operation
executing get_next_dagruns_to_examine.

The whole scheduling is based on actually locking DagRuns scheduler
operats on - and it basically means that state of ANY task instances
for that DagRun should not change during the scheduling.

However there are some cases where task instance is locked
FOR UPDATE without prior locking of the DagRun table - this
happens for example when local task job executes the task
and runs "check_and_change_state_before_execution" method on the
task instance it runs. There is no earlier DagRun locking
happening and the "refresh_from_db" run with lock_for_update
will get the lock on both TaskInstance row as well as on the
DagRun row. The problem is this locking happens in reverse sequence
in this case:

get_next_dagruns_to_examine - locks DagRun first and THEN
tries to locks some task instances for that DagRun
"check_and_change_state_before_execution" runs effectively the
query: select ... from task_instance join dag_run ... for update
which FIRST locks TaskInstance and then DagRun table.

This reverse sequence of locking is what causes the deadlock.

The fix is to force locking the DagRun before running the task instance
query that joins dag_run to task_instance.

Fixes: #23361

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

This PR attempts to fix the deadlock that occurs when task instance is being run in parallel to running _do_scheduling operation executing get_next_dagruns_to_examine. The whole scheduling is based on actually locking DagRuns scheduler operats on - and it basically means that state of ANY task instances for that DagRun should not change during the scheduling. However there are some cases where task instance is locked FOR UPDATE without prior locking of the DagRun table - this happens for example when local task job executes the task and runs "check_and_change_state_before_execution" method on the task instance it runs. There is no earlier DagRun locking happening and the "refresh_from_db" run with lock_for_update will get the lock on both TaskInstance row as well as on the DagRun row. The problem is this locking happens in reverse sequence in this case: 1) get_next_dagruns_to_examine - locks DagRun first and THEN tries to locks some task instances for that DagRun 2) "check_and_change_state_before_execution" runs effectively the query: select ... from task_instance join dag_run ... for update which FIRST locks TaskInstance and then DagRun table. This reverse sequence of locking is what causes the deadlock. The fix is to force locking the DagRun before running the task instance query that joins dag_run to task_instance. Fixes: apache#23361

potiuk · 2022-07-25T21:41:20Z

Looks like all green. I think the analysis from #23361 leads to this being proper fix. The scenarios for Deadlock are plausible - and we also might be able to confirm it by @RNHTTR and @dstaple running this fix later and trying to reproduce deadlocks.

potiuk · 2022-07-26T13:05:37Z

@ashb since you are back from vacation :) - you might want to take a look. It looks like it is a proper fix and we could explain exactly the scenario that would have caused the nastly Postgres deadlock (so the REAL one this time).

When I implemented my solution, I assumed that task_run state change should also be shielded by the DagRun row-lock. I think it makes perfect sense, because we do not want any state change of any task instance belonging to DagRun when we are scheduling and keeping lock on the DagRun.

Alternative solution (proposed initially by @dstaple who made the bulk of analysis) was to avoid the DagRun lock altogether when we are changing state during task_run. It seems to cure the deadlocks, but It has the nasty side effect that potentially scheduler and task state change from task run are updating the same task instance and reading and then potentilly changing it's state at the same time. This is a bit more scalable (less synchronisation on DagRun, but quite a bit risky IMHO - and possibly that's also the reaosn for what @RNHTTR reproduced in #23361 - where he attempted to DELETE DagRun while the tasks were running (also resulting with deadlock).

The solution here should prevent from Deletion happening in parallel to changing state as well.

ashb · 2022-07-26T14:42:29Z

Hmmmm, interesting.

So generally the fewer things you have to lock the better, and doubly so as we aren't actually making any changes to DagRun, so taking out a lock on it as well seems like it should be unnecessary. Let me (re)read that other thead.

RNHTTR · 2022-07-26T15:03:41Z

@potiuk FYI this patch didn't seem to resolve the deletion deadlock that I'm able to reproduce 😞

potiuk · 2022-07-26T15:20:56Z

So generally the fewer things you have to lock the better, and doubly so as we aren't actually making any changes to DagRun, so taking out a lock on it as well seems like it should be unnecessary. Let me (re)read that other thead.

It changes the task instance state only - which means that task instance state is not "frozen" during the scheduler processing. If we remove the lock from the dagrun it might happen that task state change from "queued" to "running" for example. Which might impact scheduler's decisions. I do not think it can be a "disastrous" change, but generally it is something that scheduler does not expect when making decisions. So things like running more task instances than are allowed per dagrun or similar might happen.

Yeah, less locks is better, the only problem with not having DagRun lock is that is far more difficult to reason what could happen and what can go wrong. When DagRun is "frozen", nothing wrong can happen. When it's not, well, I am not sure :)

potiuk · 2022-07-28T16:37:18Z

Closing this one in favour of Ash's #25312

potiuk requested review from kaxil, XD-DENG and ashb as code owners July 24, 2022 17:13

potiuk mentioned this pull request Jul 24, 2022

Scheduler crashes with psycopg2.errors.DeadlockDetected exception #23361

Closed

2 tasks

potiuk force-pushed the attempt-to-fix-postgres-lock branch from f8d905d to fe0e251 Compare July 24, 2022 18:08

potiuk force-pushed the attempt-to-fix-postgres-lock branch from fe0e251 to 12df02b Compare July 25, 2022 10:25

potiuk requested review from jedcunningham, dstandish, uranusjr and malthe July 25, 2022 21:38

potiuk added this to the Airflow 2.3.4 milestone Jul 25, 2022

potiuk closed this Jul 28, 2022

potiuk deleted the attempt-to-fix-postgres-lock branch July 29, 2022 20:05

ephraimbuddy removed this from the Airflow 2.3.4 milestone Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Deadlock on refresh from DB by local task run #25266

Fix Deadlock on refresh from DB by local task run #25266

potiuk commented Jul 24, 2022

potiuk commented Jul 25, 2022

potiuk commented Jul 26, 2022 •

edited

ashb commented Jul 26, 2022

RNHTTR commented Jul 26, 2022

potiuk commented Jul 26, 2022

potiuk commented Jul 28, 2022

Fix Deadlock on refresh from DB by local task run #25266

Fix Deadlock on refresh from DB by local task run #25266

Conversation

potiuk commented Jul 24, 2022

potiuk commented Jul 25, 2022

potiuk commented Jul 26, 2022 • edited

ashb commented Jul 26, 2022

RNHTTR commented Jul 26, 2022

potiuk commented Jul 26, 2022

potiuk commented Jul 28, 2022

potiuk commented Jul 26, 2022 •

edited