New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Deadlock on refresh from DB by local task run #25266
Conversation
f8d905d
to
fe0e251
Compare
This PR attempts to fix the deadlock that occurs when task instance is being run in parallel to running _do_scheduling operation executing get_next_dagruns_to_examine. The whole scheduling is based on actually locking DagRuns scheduler operats on - and it basically means that state of ANY task instances for that DagRun should not change during the scheduling. However there are some cases where task instance is locked FOR UPDATE without prior locking of the DagRun table - this happens for example when local task job executes the task and runs "check_and_change_state_before_execution" method on the task instance it runs. There is no earlier DagRun locking happening and the "refresh_from_db" run with lock_for_update will get the lock on both TaskInstance row as well as on the DagRun row. The problem is this locking happens in reverse sequence in this case: 1) get_next_dagruns_to_examine - locks DagRun first and THEN tries to locks some task instances for that DagRun 2) "check_and_change_state_before_execution" runs effectively the query: select ... from task_instance join dag_run ... for update which FIRST locks TaskInstance and then DagRun table. This reverse sequence of locking is what causes the deadlock. The fix is to force locking the DagRun before running the task instance query that joins dag_run to task_instance. Fixes: apache#23361
fe0e251
to
12df02b
Compare
@ashb since you are back from vacation :) - you might want to take a look. It looks like it is a proper fix and we could explain exactly the scenario that would have caused the nastly Postgres deadlock (so the REAL one this time). When I implemented my solution, I assumed that task_run state change should also be shielded by the DagRun row-lock. I think it makes perfect sense, because we do not want any state change of any task instance belonging to DagRun when we are scheduling and keeping lock on the DagRun. Alternative solution (proposed initially by @dstaple who made the bulk of analysis) was to avoid the DagRun lock altogether when we are changing state during task_run. It seems to cure the deadlocks, but It has the nasty side effect that potentially scheduler and task state change from task run are updating the same task instance and reading and then potentilly changing it's state at the same time. This is a bit more scalable (less synchronisation on DagRun, but quite a bit risky IMHO - and possibly that's also the reaosn for what @RNHTTR reproduced in #23361 - where he attempted to DELETE DagRun while the tasks were running (also resulting with deadlock). The solution here should prevent from Deletion happening in parallel to changing state as well. |
Hmmmm, interesting. So generally the fewer things you have to lock the better, and doubly so as we aren't actually making any changes to DagRun, so taking out a lock on it as well seems like it should be unnecessary. Let me (re)read that other thead. |
@potiuk FYI this patch didn't seem to resolve the deletion deadlock that I'm able to reproduce 😞 |
It changes the task instance state only - which means that task instance state is not "frozen" during the scheduler processing. If we remove the lock from the dagrun it might happen that task state change from "queued" to "running" for example. Which might impact scheduler's decisions. I do not think it can be a "disastrous" change, but generally it is something that scheduler does not expect when making decisions. So things like running more task instances than are allowed per dagrun or similar might happen. Yeah, less locks is better, the only problem with not having DagRun lock is that is far more difficult to reason what could happen and what can go wrong. When DagRun is "frozen", nothing wrong can happen. When it's not, well, I am not sure :) |
Closing this one in favour of Ash's #25312 |
This PR attempts to fix the deadlock that occurs when task instance
is being run in parallel to running _do_scheduling operation
executing get_next_dagruns_to_examine.
The whole scheduling is based on actually locking DagRuns scheduler
operats on - and it basically means that state of ANY task instances
for that DagRun should not change during the scheduling.
However there are some cases where task instance is locked
FOR UPDATE without prior locking of the DagRun table - this
happens for example when local task job executes the task
and runs "check_and_change_state_before_execution" method on the
task instance it runs. There is no earlier DagRun locking
happening and the "refresh_from_db" run with lock_for_update
will get the lock on both TaskInstance row as well as on the
DagRun row. The problem is this locking happens in reverse sequence
in this case:
get_next_dagruns_to_examine - locks DagRun first and THEN
tries to locks some task instances for that DagRun
"check_and_change_state_before_execution" runs effectively the
query: select ... from task_instance join dag_run ... for update
which FIRST locks TaskInstance and then DagRun table.
This reverse sequence of locking is what causes the deadlock.
The fix is to force locking the DagRun before running the task instance
query that joins dag_run to task_instance.
Fixes: #23361
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.