Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when airflow try to update 'k8s_pod_yaml' in 'rendered_task_instance_fields' table #29687

Closed
1 of 2 tasks
amagr opened this issue Feb 22, 2023 · 6 comments · Fixed by #32341
Closed
1 of 2 tasks
Labels

Comments

@amagr
Copy link

amagr commented Feb 22, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow 2.4.2
We run into a problem, where HttpSensor has an error because of deadlock. We are running 3 different dags with 12 max_active_runs, that call api and check for response if it should reshedule it or go to next task. All these sensors have 1 minutes poke interval, so 36 of them are running at the same time. Sometimes (like once in 20 runs) we get following deadlock error:

Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context cursor, statement, parameters, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute cursor.execute(statement, parameters) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query _mysql.connection.query(self, query) MySQLdb.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1457, in _run_raw_task self._execute_task_with_callbacks(context, test_mode) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1579, in _execute_task_with_callbacks RenderedTaskInstanceFields.write(rtif) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 75, in wrapper return func(*args, session=session, **kwargs) File "/usr/local/lib/python3.7/contextlib.py", line 119, in __exit__ next(self.gen) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 36, in create_session session.commit() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1428, in commit self._transaction.commit(_to_root=self.future) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 829, in commit self._prepare_impl() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 808, in _prepare_impl self.session.flush() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 3345, in flush self._flush(objects) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 3485, in _flush transaction.rollback(_capture_exception=True) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__ with_traceback=exc_tb, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_ raise exception File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 3445, in _flush flush_context.execute() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute rec.execute(self) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 633, in execute uow, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 241, in save_obj update, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1001, in _emit_update_statements statement, multiparams, execution_options=execution_options File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1614, in _execute_20 return meth(self, args_10style, kwargs_10style, execution_options) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 326, in _execute_on_connection self, multiparams, params, execution_options File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1491, in _execute_clauseelement cache_hit=cache_hit, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context e, statement, parameters, cursor, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2027, in _handle_dbapi_exception sqlalchemy_exception, with_traceback=exc_info[2], from_=e File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_ raise exception File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context cursor, statement, parameters, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute cursor.execute(statement, parameters) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query _mysql.connection.query(self, query) sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: UPDATE rendered_task_instance_fields SET k8s_pod_yaml=%s WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s AND rendered_task_instance_fields.run_id = %s AND rendered_task_instance_fields.map_index = %s] [parameters: ('{"metadata": {"annotations": {"dag_id": "bidder-joiner", "task_id": "capitest", "try_number": "1", "run_id": "scheduled__2023-02-15T14:15:00+00:00"}, ... (511 characters truncated) ... e": "AIRFLOW_IS_K8S_EXECUTOR_POD", "value": "True"}], "image": "artifactorymaster.outbrain.com:5005/datainfra/airflow:8cbd2a3d8c", "name": "base"}]}}', 'bidder-joiner', 'capitest', 'scheduled__2023-02-15T14:15:00+00:00', -1)] (Background on this error at: https://sqlalche.me/e/14/e3q8)
Failed to execute job 3966 for task capitest ((MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: UPDATE rendered_task_instance_fields SET k8s_pod_yaml=%s WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s AND rendered_task_instance_fields.run_id = %s AND rendered_task_instance_fields.map_index = %s] [parameters: ('{"metadata": {"annotations": {"dag_id": "bidder-joiner", "task_id": "capitest", "try_number": "1", "run_id": "scheduled__2023-02-15T14:15:00+00:00"}, ... (511 characters truncated) ... e": "AIRFLOW_IS_K8S_EXECUTOR_POD", "value": "True"}], "image": "artifactorymaster.outbrain.com:5005/datainfra/airflow:8cbd2a3d8c", "name": "base"}]}}', 'bidder-joiner', 'capitest', 'scheduled__2023-02-15T14:15:00+00:00', -1)] (Background on this error at: https://sqlalche.me/e/14/e3q8); 68)
I checked MySql logs and deadlock is caused by query:

DELETE FROM rendered_task_instance_fields WHERE rendered_task_instance_fields.dag_id = 'bidder-joiner-raw_data_2nd_pass_delay' AND rendered_task_instance_fields.task_id = 'is_data_ready' AND ((rendered_task_instance_fields.dag_id, rendered_task_instance_fields.task_id, rendered_task_instance_fields.run_id) NOT IN (SELECT subq2.dag_id, subq2.task_id, subq2.run_id
FROM (SELECT subq1.dag_id AS dag_id, subq1.task_id AS task_id, subq1.run_id AS run_id
FROM (SELECT DISTINCT rendered_task_instance_fields.dag_id AS dag_id, rendered_task_instance_fields.task_id AS task_id, rendered_task_instance_fields.run_id AS run_id, dag_run.execution_date AS execution_date
FROM rendered_task_instance_fields INNER JOIN dag_run ON rendered_task_instance_fields.dag_id = dag_run.dag_id AND rendered_task_instance_fields.run_id = dag_run.run_id
WHERE rendered_task_instance_fields.dag_id = 'bidder-joiner-raw_data

What you think should happen instead

I found similar issue open on github (#25765) so I think it should be resolved in the same way - adding @retry_db_transaction annotation to function that is executing this query

How to reproduce

Create 3 dags with 12 max_active_runs that use HttpSensor at the same time, same poke interval and mode reschedule.

Operating System

Ubuntu 20

Versions of Apache Airflow Providers

apache-airflow-providers-common-sql>=1.2.0
mysql-connector-python>=8.0.11
mysqlclient>=1.3.6
apache-airflow-providers-mysql==3.2.1
apache-airflow-providers-http==4.0.0
apache-airflow-providers-slack==6.0.0
apache-airflow-providers-apache-spark==3.0.0

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@amagr amagr added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Feb 22, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 22, 2023

Thanks for opening your first issue here! Be sure to follow the issue template!

@Taragolis
Copy link
Contributor

⚠️ This lead increase size of DB ⚠️ You could try to set [core] max_num_rendered_ti_fields_per_task to 0.

@amagr
Copy link
Author

amagr commented Feb 23, 2023

⚠️ This lead increase size of DB ⚠️ You could try to set [core] max_num_rendered_ti_fields_per_task to 0.

What do you mean by "This lead increase size of DB?"
In regards to your proposal, I guess changing this config would fix it, but then we would not see rendered fields in UI? Which is pretty significant, as it's very helpful for debbuging.

@Taragolis
Copy link
Contributor

0 is stands for disable deleting "old" rendered templates from the DB:

  • This delete query would never run, in Postgres (⚠️ my opinion) this query just slow however in MySQL it also could cause deadlocks.
  • Rendered fields would be accessible in the UI "forever" (until settings changed to some value greater that 0)
  • As result size of rendered_task_instance_fields would growth

if num_to_keep <= 0:
return

There is also discussion exists in dev-list, unfortunetly I do not have a time to check the long-term behaviour of Airflow DB in case of MySQL I would be appreciate if you share you result

@Taragolis Taragolis added mysql-deadlock and removed needs-triage label for new issues that we didn't triage yet labels Feb 23, 2023
@Taragolis
Copy link
Contributor

Taragolis commented Feb 23, 2023

BTW, @amagr what is version of MySQL do you use 5.7 or 8 ?

@amagr
Copy link
Author

amagr commented Feb 27, 2023

Hey @Taragolis awesome, thank you for your help! I think we will change the setting to 0, and do manual deletion of old rendered fields. I will try to check the discussion you listed this week and come back to you.
In regards to MySQL version, we are using 8.0.21-12.

potiuk added a commit to potiuk/airflow that referenced this issue Jul 4, 2023
This is a follow-up on apache#18616 where we introduced retries on the
occassional deadlocks when rendered task fields have been deleted
by parallel threads (this is not a real deadlock, it's because
MySQL locks too many things when queries are executed and will
deadlock when one of those queries wait too much).

Adding retry - while not perfect - should allow to handle the
problem and significantly decrease the likelihood of such
deadlocks.

We can probably think about different approach for rendered fields,
but for now retrying is - I think - acceptable short-term fix.

Fixes: apache#32294
Fixes: apache#29687
potiuk added a commit that referenced this issue Jul 5, 2023
This is a follow-up on #18616 where we introduced retries on the
occassional deadlocks when rendered task fields have been deleted
by parallel threads (this is not a real deadlock, it's because
MySQL locks too many things when queries are executed and will
deadlock when one of those queries wait too much).

Adding retry - while not perfect - should allow to handle the
problem and significantly decrease the likelihood of such
deadlocks.

We can probably think about different approach for rendered fields,
but for now retrying is - I think - acceptable short-term fix.

Fixes: #32294
Fixes: #29687
ephraimbuddy pushed a commit that referenced this issue Jul 6, 2023
This is a follow-up on #18616 where we introduced retries on the
occassional deadlocks when rendered task fields have been deleted
by parallel threads (this is not a real deadlock, it's because
MySQL locks too many things when queries are executed and will
deadlock when one of those queries wait too much).

Adding retry - while not perfect - should allow to handle the
problem and significantly decrease the likelihood of such
deadlocks.

We can probably think about different approach for rendered fields,
but for now retrying is - I think - acceptable short-term fix.

Fixes: #32294
Fixes: #29687
(cherry picked from commit c8a3c11)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants