New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in Scheduler Loop when Updating Dag Run #25765
Comments
Please search for and provide extract of a detailed log of what happens at the server at the time when deadlock happens. You should be able to find it in your Mysql log - unfortunately, client deadlock does not give enough information on what deadlocks with what. The log will show the detailed query and what query held deadlock with. |
@potiuk Hello, here's the latest detected deadlock info:
|
Is this reproducible without Smart Sensors? |
@eladkal we tried turning off smart sensors and the deadlocks actually occurred much more frequently. The traceback for the deadlock error was exactly the same as when we had smart sensors enabled as well. |
I'm not 100% certain, but perhaps this is coming from this PR? https://github.com/apache/airflow/pull/22004/files Looks like the deadlock is happening with the query that's trying to delete old rendered TI fields for mysql (specifically, the Apparently it's known this queries in that method could deadlock, but for whatever reason the decorator is not properly catching+retrying in a way that keeps the scheduler marked as healthy. |
Fixed in #26347 |
Apache Airflow version
2.3.3
What happened
We have been getting occasional deadlock errors in our main scheduler loop that is causing the scheduler to error out of the main scheduler loop and terminate. The full stack trace of the error is below:
It appears the issue occurs when attempting to update the
last_scheduling_decision
field of thedag_run
table, but we are unsure why this would cause a deadlock. This issue has only been occurring when we upgrade to version 2.3.3, this was not an issue with version 2.2.4.What you think should happen instead
The scheduler loop should not have any deadlocks that cause it to exit out of its main loop and terminate. I would expect the scheduler loop to always be running constantly, which is not the case if a deadlock occurs in this loop.
How to reproduce
This is occurring for us when we run a
LocalExecutor
with smart sensors enabled (2 shards). We only have 3 other daily DAGs which run at different times, and the error seems to occur right when the start time comes for one DAG to start running. After we restart the scheduler after that first deadlock, it seems to run fine the rest of the day, but the next day when it comes time to start the DAG again, another deadlock occurs.Operating System
Ubuntu 18.04.6 LTS
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==4.1.0
apache-airflow-providers-common-sql==1.0.0
apache-airflow-providers-ftp==3.1.0
apache-airflow-providers-http==4.0.0
apache-airflow-providers-imap==3.0.0
apache-airflow-providers-mysql==3.1.0
apache-airflow-providers-sftp==4.0.0
apache-airflow-providers-sqlite==3.1.0
apache-airflow-providers-ssh==3.1.0
Deployment
Other
Deployment details
We deploy airflow to 2 different ec2 instances. The scheduler lives on one ec2 instances and the webserver lives on a separate ec2 instance. We only run a single scheduler.
Anything else
This issue occurs once a day when the first of our daily DAGs gets triggered. When we restart the scheduler after the deadlock, it works fine for the rest of the day typically.
We use a
LocalExecutor
with aPARALLELISM
of 32, smart sensors enabled using 2 shards.Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: