Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't mistakenly take a lock on DagRun via ti.refresh_from_fb #25312

Merged
merged 4 commits into from Aug 9, 2022

Conversation

ashb
Copy link
Member

@ashb ashb commented Jul 26, 2022

In 2.2.0 we made TI.dag_run be automatically join-loaded, which is fine
for most cases, but for refresh_from_db we don't need that (we don't
access anything under ti.dag_run) and it's possible that when
lock_for_update=True is passed we are locking more than we want to and
might cause deadlocks.

Even if it doesn't, selecting more than we need is wasteful.

Idea inspired by @dstaple, except rather than not locking it changed to not even select against the DagRun table in the first place.

Might fix #23361

@ashb ashb requested review from kaxil and XD-DENG as code owners July 26, 2022 15:57
@ashb
Copy link
Member Author

ashb commented Jul 26, 2022

/cc @potiuk Even if this doesn't fix the deadlock it's probably worth doing

@ashb
Copy link
Member Author

ashb commented Jul 26, 2022

@RNHTTR Can you see if this change does anything for the deadlock you are able to reproduce?

ashb added 2 commits July 28, 2022 11:30
In 2.2.0 we made TI.dag_run be automatically join-loaded, which is fine
for most cases, but for `refresh_from_db` we don't need that (we don't
access anything under ti.dag_run) and it's possible that when
`lock_for_update=True` is passed we are locking more than we want to and
_might_ cause deadlocks.

Even if it doesn't, selecting more than we need is wasteful.
airflow/models/taskinstance.py Outdated Show resolved Hide resolved
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jul 28, 2022
@jedcunningham jedcunningham added this to the Airflow 2.3.4 milestone Jul 28, 2022
@ashb ashb merged commit be2b53e into apache:main Aug 9, 2022
@ashb ashb deleted the possible-deadlock-fix branch August 9, 2022 14:17
ephraimbuddy pushed a commit that referenced this pull request Aug 15, 2022
In 2.2.0 we made TI.dag_run be automatically join-loaded, which is fine
for most cases, but for `refresh_from_db` we don't need that (we don't
access anything under ti.dag_run) and it's possible that when
`lock_for_update=True` is passed we are locking more than we want to and
_might_ cause deadlocks.

Even if it doesn't, selecting more than we need is wasteful.

(cherry picked from commit be2b53e)
@dstaple
Copy link
Contributor

dstaple commented Oct 18, 2023

/cc @potiuk Even if this doesn't fix the deadlock it's probably worth doing

For future reference, this PR did indeed fix the specific deadlock issue from #23361 , thanks again folks.

dstandish added a commit to astronomer/airflow that referenced this pull request Mar 30, 2024
Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in apache#25312 when we converted to query with the column attrs instead of the TI object (apache#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
dstandish added a commit that referenced this pull request Apr 2, 2024
Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in #25312 when we converted to query with the column attrs instead of the TI object (#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
idantepper pushed a commit to idantepper/airflow that referenced this pull request Apr 3, 2024
…#38571)

Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in apache#25312 when we converted to query with the column attrs instead of the TI object (apache#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
idantepper pushed a commit to idantepper/airflow that referenced this pull request Apr 3, 2024
…#38571)

Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in apache#25312 when we converted to query with the column attrs instead of the TI object (apache#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
mathiaHT pushed a commit to mathiaHT/airflow that referenced this pull request Apr 4, 2024
…#38571)

Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in apache#25312 when we converted to query with the column attrs instead of the TI object (apache#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
utkarsharma2 pushed a commit to astronomer/airflow that referenced this pull request Apr 22, 2024
…#38571)

Fundamentally what's going on here is we need a TaskInstance object instead of a Row object when sending over the wire in RPC call.  But the full story on this one is actually somewhat complicated.
It was back in 2.2.0 in apache#25312 when we converted to query with the column attrs instead of the TI object (apache#28900 only refactored this logic into a function).  The reason was to avoid locking the dag_run table since TI newly had a dag_run relationship attr.  Now, this causes a problem with AIP-44 because the RPC api does not know how to serialize a Row object.
This PR switches back to querying a TaskInstance object, but avoids locking dag_run by using lazy_load option.  Meanwhile, since try_number is a horrible attribute (which gives you a different answer depending on the state), we have to switch it back to look at the underlying private attr instead of the public accesor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler crashes with psycopg2.errors.DeadlockDetected exception
4 participants