-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when airflow try to update 'k8s_pod_yaml' in 'rendered_task_instance_fields' table #29687
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
|
What do you mean by "This lead increase size of DB?" |
airflow/airflow/models/renderedtifields.py Lines 206 to 209 in 78115c5
There is also discussion exists in dev-list, unfortunetly I do not have a time to check the long-term behaviour of Airflow DB in case of MySQL I would be appreciate if you share you result |
BTW, @amagr what is version of MySQL do you use 5.7 or 8 ? |
Hey @Taragolis awesome, thank you for your help! I think we will change the setting to 0, and do manual deletion of old rendered fields. I will try to check the discussion you listed this week and come back to you. |
This is a follow-up on apache#18616 where we introduced retries on the occassional deadlocks when rendered task fields have been deleted by parallel threads (this is not a real deadlock, it's because MySQL locks too many things when queries are executed and will deadlock when one of those queries wait too much). Adding retry - while not perfect - should allow to handle the problem and significantly decrease the likelihood of such deadlocks. We can probably think about different approach for rendered fields, but for now retrying is - I think - acceptable short-term fix. Fixes: apache#32294 Fixes: apache#29687
This is a follow-up on #18616 where we introduced retries on the occassional deadlocks when rendered task fields have been deleted by parallel threads (this is not a real deadlock, it's because MySQL locks too many things when queries are executed and will deadlock when one of those queries wait too much). Adding retry - while not perfect - should allow to handle the problem and significantly decrease the likelihood of such deadlocks. We can probably think about different approach for rendered fields, but for now retrying is - I think - acceptable short-term fix. Fixes: #32294 Fixes: #29687
This is a follow-up on #18616 where we introduced retries on the occassional deadlocks when rendered task fields have been deleted by parallel threads (this is not a real deadlock, it's because MySQL locks too many things when queries are executed and will deadlock when one of those queries wait too much). Adding retry - while not perfect - should allow to handle the problem and significantly decrease the likelihood of such deadlocks. We can probably think about different approach for rendered fields, but for now retrying is - I think - acceptable short-term fix. Fixes: #32294 Fixes: #29687 (cherry picked from commit c8a3c11)
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Airflow 2.4.2
We run into a problem, where HttpSensor has an error because of deadlock. We are running 3 different dags with 12 max_active_runs, that call api and check for response if it should reshedule it or go to next task. All these sensors have 1 minutes poke interval, so 36 of them are running at the same time. Sometimes (like once in 20 runs) we get following deadlock error:
Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context cursor, statement, parameters, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute cursor.execute(statement, parameters) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query _mysql.connection.query(self, query) MySQLdb.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1457, in _run_raw_task self._execute_task_with_callbacks(context, test_mode) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1579, in _execute_task_with_callbacks RenderedTaskInstanceFields.write(rtif) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 75, in wrapper return func(*args, session=session, **kwargs) File "/usr/local/lib/python3.7/contextlib.py", line 119, in __exit__ next(self.gen) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 36, in create_session session.commit() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1428, in commit self._transaction.commit(_to_root=self.future) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 829, in commit self._prepare_impl() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 808, in _prepare_impl self.session.flush() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 3345, in flush self._flush(objects) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 3485, in _flush transaction.rollback(_capture_exception=True) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__ with_traceback=exc_tb, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_ raise exception File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 3445, in _flush flush_context.execute() File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute rec.execute(self) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 633, in execute uow, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 241, in save_obj update, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1001, in _emit_update_statements statement, multiparams, execution_options=execution_options File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1614, in _execute_20 return meth(self, args_10style, kwargs_10style, execution_options) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 326, in _execute_on_connection self, multiparams, params, execution_options File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1491, in _execute_clauseelement cache_hit=cache_hit, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context e, statement, parameters, cursor, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2027, in _handle_dbapi_exception sqlalchemy_exception, with_traceback=exc_info[2], from_=e File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_ raise exception File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context cursor, statement, parameters, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute cursor.execute(statement, parameters) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query _mysql.connection.query(self, query) sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: UPDATE rendered_task_instance_fields SET k8s_pod_yaml=%s WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s AND rendered_task_instance_fields.run_id = %s AND rendered_task_instance_fields.map_index = %s] [parameters: ('{"metadata": {"annotations": {"dag_id": "bidder-joiner", "task_id": "capitest", "try_number": "1", "run_id": "scheduled__2023-02-15T14:15:00+00:00"}, ... (511 characters truncated) ... e": "AIRFLOW_IS_K8S_EXECUTOR_POD", "value": "True"}], "image": "artifactorymaster.outbrain.com:5005/datainfra/airflow:8cbd2a3d8c", "name": "base"}]}}', 'bidder-joiner', 'capitest', 'scheduled__2023-02-15T14:15:00+00:00', -1)] (Background on this error at: https://sqlalche.me/e/14/e3q8)
Failed to execute job 3966 for task capitest ((MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: UPDATE rendered_task_instance_fields SET k8s_pod_yaml=%s WHERE rendered_task_instance_fields.dag_id = %s AND rendered_task_instance_fields.task_id = %s AND rendered_task_instance_fields.run_id = %s AND rendered_task_instance_fields.map_index = %s] [parameters: ('{"metadata": {"annotations": {"dag_id": "bidder-joiner", "task_id": "capitest", "try_number": "1", "run_id": "scheduled__2023-02-15T14:15:00+00:00"}, ... (511 characters truncated) ... e": "AIRFLOW_IS_K8S_EXECUTOR_POD", "value": "True"}], "image": "artifactorymaster.outbrain.com:5005/datainfra/airflow:8cbd2a3d8c", "name": "base"}]}}', 'bidder-joiner', 'capitest', 'scheduled__2023-02-15T14:15:00+00:00', -1)] (Background on this error at: https://sqlalche.me/e/14/e3q8); 68)
I checked MySql logs and deadlock is caused by query:
What you think should happen instead
I found similar issue open on github (#25765) so I think it should be resolved in the same way - adding @retry_db_transaction annotation to function that is executing this query
How to reproduce
Create 3 dags with 12 max_active_runs that use HttpSensor at the same time, same poke interval and mode reschedule.
Operating System
Ubuntu 20
Versions of Apache Airflow Providers
apache-airflow-providers-common-sql>=1.2.0
mysql-connector-python>=8.0.11
mysqlclient>=1.3.6
apache-airflow-providers-mysql==3.2.1
apache-airflow-providers-http==4.0.0
apache-airflow-providers-slack==6.0.0
apache-airflow-providers-apache-spark==3.0.0
Deployment
Docker-Compose
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: