Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when updating DagRun last_scheduling_decision and TaskInstance state=scheduled #27473

Closed
1 of 2 tasks
dstaple opened this issue Nov 2, 2022 · 9 comments
Closed
1 of 2 tasks
Labels
area:core kind:bug This is a clearly a bug

Comments

@dstaple
Copy link
Contributor

dstaple commented Nov 2, 2022

Apache Airflow version

Both Airflow 2.2.5 and 2.5.3; presumably other versions.

What happened

The Airflow scheduler occasionally exits with psycopg2.errors.DeadlockDetected, and several running tasks fail with SIGTERMs.

One of the queries involved in the deadlock originates from the scheduler and is of the form

UPDATE task_instance SET state='scheduled' ...

This can be seen in both the scheduler and database logs, and originates here:

  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/models/dagrun.py", line 909, in schedule_tis
    .update({TI.state: State.SCHEDULED}, synchronize_session=False)

Here line 909 corresponds to version 2.2.5.

The other query involved in the deadlock is shown in the database logs as follows:

UPDATE dag_run SET last_scheduling_decision='2022-11-01T00:00:22.409295+00:00'::timestamptz WHERE dag_run.id = 841582

From reading the Airflow source code, this seems likely to originate from airflow/models/dagrun.py in the function update_state().

Database logs look as follows:

2022-11-01 03:57:41.327 UTC [174772] ERROR:  deadlock detected
2022-11-01 03:57:41.327 UTC [174772] DETAIL:  Process 174772 waits for ShareLock on transaction 780275485; blocked by process 176738.
	Process 176738 waits for ExclusiveLock on tuple (15622,258) of relation 599796 of database 599710; blocked by process 176552.
	Process 176552 waits for ShareLock on transaction 780275254; blocked by process 174772.
	Process 174772: UPDATE task_instance SET state='scheduled' WHERE task_instance.dag_id = 'my_sanitized_dag_id_1' AND task_instance.run_id = 'my_sanitized_run_id_1' AND task_instance.task_id IN ('my_sanitized_task_id_1', 'my_sanitized_task_id_2', 'my_sanitized_task_id_3', 'my_sanitized_task_id_4', 'my_sanitized_task_id_5', 'my_sanitized_task_id_6', 'my_sanitized_task_id_7', 'my_sanitized_task_id_8', 'my_sanitized_task_id_9', 'my_sanitized_task_id_10', 'my_sanitized_task_id_11', 'my_sanitized_task_id_12', 'my_sanitized_task_id_13', 'my_sanitized_task_id_14', 'my_sanitized_task_id_15')
	Process 176738: UPDATE dag_run SET last_scheduling_decision='2022-11-01T00:00:22.409295+00:00'::timestamptz WHERE dag_run.id = 841582
	Process 176552: UPDATE dag_run SET last_scheduling_decision='2022-11-01T00:00:22.409295+00:00'::timestamptz WHERE dag_run.id = 841582
2022-11-01 03:57:41.327 UTC [174772] HINT:  See server log for query details.
2022-11-01 03:57:41.327 UTC [174772] CONTEXT:  while updating tuple (227875,18) in relation "task_instance"
2022-11-01 03:57:41.327 UTC [174772] STATEMENT:  UPDATE task_instance SET state='scheduled' WHERE task_instance.dag_id = 'my_sanitized_dag_id_1' AND task_instance.run_id = 'my_sanitized_run_id_1' AND task_instance.task_id IN ('my_sanitized_task_id_1', 'my_sanitized_task_id_2', 'my_sanitized_task_id_3', 'my_sanitized_task_id_4', 'my_sanitized_task_id_5', 'my_sanitized_task_id_6', 'my_sanitized_task_id_7', 'my_sanitized_task_id_8', 'my_sanitized_task_id_9', 'my_sanitized_task_id_10', 'my_sanitized_task_id_11', 'my_sanitized_task_id_12', 'my_sanitized_task_id_13', 'my_sanitized_task_id_14', 'my_sanitized_task_id_15')

The stack trace in the Airflow scheduler looks as follows:

[2022-11-01 03:57:51,140] {{scheduler_job.py:753}} INFO - Exited execute loop
Traceback (most recent call last):
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
    cursor, statement, parameters, context
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 609, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 174772 waits for ShareLock on transaction 780275485; blocked by process 176738.
Process 176738 waits for ExclusiveLock on tuple (15622,258) of relation 599796 of database 599710; blocked by process 176552.
Process 176552 waits for ShareLock on transaction 780275254; blocked by process 174772.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (227875,18) in relation "task_instance"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/airflow/venv-py3/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/__main__.py", line 48, in main
    args.func(args)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/utils/cli.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler
    _run_scheduler_job(args=args)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job
    job.run()
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 246, in run
    self._execute()
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 726, in _execute
    self._run_scheduler_loop()
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 807, in _run_scheduler_loop
    num_queued_tis = self._do_scheduling(session)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 890, in _do_scheduling
    callback_to_run = self._schedule_dag_run(dag_run, session)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1147, in _schedule_dag_run
    dag_run.schedule_tis(schedulable_tis, session)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/utils/session.py", line 67, in wrapper
    return func(*args, **kwargs)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/airflow/models/dagrun.py", line 909, in schedule_tis
    .update({TI.state: State.SCHEDULED}, synchronize_session=False)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 4063, in update
    update_op.exec_()
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1697, in exec_
    self._do_exec()
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1895, in _do_exec
    self._execute_stmt(update_stmt)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1702, in _execute_stmt
    self.result = self.query._execute_crud(stmt, self.mapper)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3568, in _execute_crud
    return conn.execute(stmt, self._params)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement
    distilled_params,
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
    e, statement, parameters, cursor, context
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
    cursor, statement, parameters, context
  File "/opt/airflow/venv-py3/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 609, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 174772 waits for ShareLock on transaction 780275485; blocked by process 176738.
Process 176738 waits for ExclusiveLock on tuple (15622,258) of relation 599796 of database 599710; blocked by process 176552.
Process 176552 waits for ShareLock on transaction 780275254; blocked by process 174772.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (227875,18) in relation "task_instance"

[SQL: UPDATE task_instance SET state=%(state)s WHERE task_instance.dag_id = %(dag_id_1)s AND task_instance.run_id = %(run_id_1)s AND task_instance.task_id IN (%(task_id_1)s, %(task_id_2)s, %(task_id_3)s, %(task_id_4)s, %(task_id_5)s, %(task_id_6)s, %(task_id_7)s, %(task_id_8)s, %(task_id_9)s, %(task_id_10)s, %(task_id_11)s, %(task_id_12)s, %(task_id_13)s, %(task_id_14)s, %(task_id_15)s)]
[parameters: {'state': <TaskInstanceState.SCHEDULED: 'scheduled'>, 'dag_id_1': 'my_sanitized_dag_id_1', 'run_id_1': 'my_sanitized_run_id_1', 'task_id_1': 'my_sanitized_task_id_1', 'task_id_2': 'my_sanitized_task_id_2', 'task_id_3': 'my_sanitized_task_id_3', 'task_id_4': 'my_sanitized_task_id_4', 'task_id_5': 'my_sanitized_task_id_5', 'task_id_6': 'my_sanitized_task_id_6', 'task_id_7': 'my_sanitized_task_id_7', 'task_id_8': 'my_sanitized_task_id_8', 'task_id_9': 'my_sanitized_task_id_9', 'task_id_10': 'my_sanitized_task_id_10', 'task_id_11': 'my_sanitized_task_id_11', 'task_id_12': 'my_sanitized_task_id_12', 'task_id_13': 'my_sanitized_task_id_13', 'task_id_14': 'my_sanitized_task_id_14', 'task_id_15': 'my_sanitized_task_id_15'}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

What you think should happen instead

If locks on multiple tables are needed, they should be taken out in systematic order to make deadlocks impossible. Alternatively we may be locking more than we need to, similar to the situation in taskinstance.py prior to #25312

How to reproduce

The problem is not easily reproducible. It occurs approximately once every two weeks when operating at scale (50-150 DAGs, several of which have hundreds of tasks).

Operating System

CentOS 7

Versions of Apache Airflow Providers

Used these providers to repro with Airflow 2.2.5:

apache-airflow-providers-cncf-kubernetes==2.0.3
apache-airflow-providers-elasticsearch==2.0.3
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-mysql==2.1.1
apache-airflow-providers-postgres==2.3.0
apache-airflow-providers-sqlite==2.1.3

Used these providers to repro with Airflow 2.5.3:

apache-airflow-providers-cncf-kubernetes==5.1.1
apache-airflow-providers-common-sql==1.4.0
apache-airflow-providers-elasticsearch==4.3.3
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-mysql==2.1.1
apache-airflow-providers-postgres==5.4.0
apache-airflow-providers-sqlite==2.1.3

Deployment

Virtualenv installation

Deployment details

Anything else

This issue is distinct from the previous deadlock issue reported in #23361 and fixed by #25312.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dstaple dstaple added area:core kind:bug This is a clearly a bug labels Nov 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Nov 2, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@ephraimbuddy
Copy link
Contributor

Is this still happening in the latest airflow(2.4.2)?

@dstaple
Copy link
Contributor Author

dstaple commented Nov 8, 2022

Is this still happening in the latest airflow(2.4.2)?

Unfortunately this only occurs at scale so it's not something we can quickly check in a test environment. We will upgrade Airflow again in January. However, the code involved is the same between 2.2.5 and 2.4.2 so we don't expect any difference.

@ashwinshankar77
Copy link
Contributor

Hi, we are seeing this deadlock error at Slack at scale causing disruptions. Is there any fix thats been worked on? cc: @potiuk

Details: Airflow version: 2.2.5, Celery, 2 schedulers with row level locking enabled, mysql 8 metadata db

 raise exception
File "/opt/airflow/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
self.dialect.do_execute(
File "/opt/airflow/.venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
File "/opt/airflow/.venv/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
res = self._query(query)
File "/opt/airflow/.venv/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
 db.query(q)
 File "/opt/airflow/.venv/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
 _mysql.connection.query(self, query)
 sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
 [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.run_id = %s AND task_instance.task_id IN (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]

@potiuk
Copy link
Member

potiuk commented Jan 10, 2023

The easiest way you can check it (and without loosing too much time of others trying to investigate) is to upgrade to latest version of Airlfow and see if the problem is fixed there @ashwinshankar77 . If it happens for you frequently, this is by far best contribution you can make to the project and give back to the free project you get. Because we can confirm that either the issue has been fixed already or that it is still there. We've implemented many fixes since 2.2.5 to various parts of Airflow that are likely to fix it.

And - even if we find a fix, the only way you will be able to fix it is to .... upgrade. Because we only release fixes in the latest line of Airlfow (so even if we find a problem now the fix will come in 2.5.1 and you .... will have to upgrade anyway).

@ashwinshankar77
Copy link
Contributor

The easiest way you can check it (and without loosing too much time of others trying to investigate) is to upgrade to latest version of Airlfow and see if the problem is fixed there @ashwinshankar77 . If it happens for you frequently, this is by far best contribution you can make to the project and give back to the free project you get. Because we can confirm that either the issue has been fixed already or that it is still there. We've implemented many fixes since 2.2.5 to various parts of Airflow that are likely to fix it.

And - even if we find a fix, the only way you will be able to fix it is to .... upgrade. Because we only release fixes in the latest line of Airlfow (so even if we find a problem now the fix will come in 2.5.1 and you .... will have to upgrade anyway).

Sounds good. We do have plans to upgrade next month 👍

@potiuk
Copy link
Member

potiuk commented Jan 10, 2023

Sounds good. We do have plans to upgrade next month 👍

Cool . Let us know what you find out. And in case you find it happening, providing logs including server-side logs from the DB containing lock logs would be really helpful. No guarantees but it might bring us closer to diagnosing and fixing it.

@eladkal
Copy link
Contributor

eladkal commented Feb 6, 2023

Closing as no task on this issue at this point.

@eladkal eladkal closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2023
@dstaple
Copy link
Contributor Author

dstaple commented May 12, 2023

We have upgraded to Airflow 2.5.3 and verified this issue is still present.

This issue occurs intermittently and is not easy to repro -- as reported in the issue it sometimes does not occur for weeks.

I see Airflow 2.6.0 has been released, but upgrading Airflow is a lot of work and there is no reason to expect this is fixed on main, so I think it's fair to say this is an ongoing issue.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

5 participants