Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Deadlock on refresh from DB by local task run #25266

Closed
wants to merge 1 commit into from

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jul 24, 2022

This PR attempts to fix the deadlock that occurs when task instance
is being run in parallel to running _do_scheduling operation
executing get_next_dagruns_to_examine.

The whole scheduling is based on actually locking DagRuns scheduler
operats on - and it basically means that state of ANY task instances
for that DagRun should not change during the scheduling.

However there are some cases where task instance is locked
FOR UPDATE without prior locking of the DagRun table - this
happens for example when local task job executes the task
and runs "check_and_change_state_before_execution" method on the
task instance it runs. There is no earlier DagRun locking
happening and the "refresh_from_db" run with lock_for_update
will get the lock on both TaskInstance row as well as on the
DagRun row. The problem is this locking happens in reverse sequence
in this case:

  1. get_next_dagruns_to_examine - locks DagRun first and THEN
    tries to locks some task instances for that DagRun

  2. "check_and_change_state_before_execution" runs effectively the
    query: select ... from task_instance join dag_run ... for update
    which FIRST locks TaskInstance and then DagRun table.

This reverse sequence of locking is what causes the deadlock.

The fix is to force locking the DagRun before running the task instance
query that joins dag_run to task_instance.

Fixes: #23361


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

This PR attempts to fix the deadlock that occurs when task instance
is being run in parallel to running _do_scheduling operation
executing get_next_dagruns_to_examine.

The whole scheduling is based on actually locking DagRuns scheduler
operats on - and it basically means that state of ANY task instances
for that DagRun should not change during the scheduling.

However there are some cases where task instance is locked
FOR UPDATE without prior locking of the DagRun table - this
happens for example when local task job executes the task
and runs "check_and_change_state_before_execution" method on the
task instance it runs. There is no earlier DagRun locking
happening and the "refresh_from_db" run with lock_for_update
will get the lock on both TaskInstance row as well as on the
DagRun row. The problem is this locking happens in reverse sequence
in this case:

1) get_next_dagruns_to_examine - locks DagRun first and THEN
   tries to locks some task instances for that DagRun

2) "check_and_change_state_before_execution" runs effectively the
    query: select ... from task_instance join dag_run ... for update
    which FIRST locks TaskInstance and then DagRun table.

This reverse sequence of locking is what causes the deadlock.

The fix is to force locking the DagRun before running the task instance
query that joins dag_run to task_instance.

Fixes: apache#23361
@potiuk potiuk force-pushed the attempt-to-fix-postgres-lock branch from fe0e251 to 12df02b Compare July 25, 2022 10:25
@potiuk potiuk added this to the Airflow 2.3.4 milestone Jul 25, 2022
@potiuk
Copy link
Member Author

potiuk commented Jul 25, 2022

Looks like all green. I think the analysis from #23361 leads to this being proper fix. The scenarios for Deadlock are plausible - and we also might be able to confirm it by @RNHTTR and @dstaple running this fix later and trying to reproduce deadlocks.

@potiuk
Copy link
Member Author

potiuk commented Jul 26, 2022

@ashb since you are back from vacation :) - you might want to take a look. It looks like it is a proper fix and we could explain exactly the scenario that would have caused the nastly Postgres deadlock (so the REAL one this time).

When I implemented my solution, I assumed that task_run state change should also be shielded by the DagRun row-lock. I think it makes perfect sense, because we do not want any state change of any task instance belonging to DagRun when we are scheduling and keeping lock on the DagRun.

Alternative solution (proposed initially by @dstaple who made the bulk of analysis) was to avoid the DagRun lock altogether when we are changing state during task_run. It seems to cure the deadlocks, but It has the nasty side effect that potentially scheduler and task state change from task run are updating the same task instance and reading and then potentilly changing it's state at the same time. This is a bit more scalable (less synchronisation on DagRun, but quite a bit risky IMHO - and possibly that's also the reaosn for what @RNHTTR reproduced in #23361 - where he attempted to DELETE DagRun while the tasks were running (also resulting with deadlock).

The solution here should prevent from Deletion happening in parallel to changing state as well.

@ashb
Copy link
Member

ashb commented Jul 26, 2022

Hmmmm, interesting.

So generally the fewer things you have to lock the better, and doubly so as we aren't actually making any changes to DagRun, so taking out a lock on it as well seems like it should be unnecessary. Let me (re)read that other thead.

@RNHTTR
Copy link
Collaborator

RNHTTR commented Jul 26, 2022

@potiuk FYI this patch didn't seem to resolve the deletion deadlock that I'm able to reproduce 😞

@potiuk
Copy link
Member Author

potiuk commented Jul 26, 2022

So generally the fewer things you have to lock the better, and doubly so as we aren't actually making any changes to DagRun, so taking out a lock on it as well seems like it should be unnecessary. Let me (re)read that other thead.

It changes the task instance state only - which means that task instance state is not "frozen" during the scheduler processing. If we remove the lock from the dagrun it might happen that task state change from "queued" to "running" for example. Which might impact scheduler's decisions. I do not think it can be a "disastrous" change, but generally it is something that scheduler does not expect when making decisions. So things like running more task instances than are allowed per dagrun or similar might happen.

Yeah, less locks is better, the only problem with not having DagRun lock is that is far more difficult to reason what could happen and what can go wrong. When DagRun is "frozen", nothing wrong can happen. When it's not, well, I am not sure :)

@potiuk
Copy link
Member Author

potiuk commented Jul 28, 2022

Closing this one in favour of Ash's #25312

@potiuk potiuk closed this Jul 28, 2022
@potiuk potiuk deleted the attempt-to-fix-postgres-lock branch July 29, 2022 20:05
@ephraimbuddy ephraimbuddy removed this from the Airflow 2.3.4 milestone Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler crashes with psycopg2.errors.DeadlockDetected exception
4 participants