You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Airflow schedules tasks to be performed on a Kubernetes Cluster. However for some reason when the task is completed pods are not cleared out (as it is normal). I am not sure if the bug is on the airflow side or airflow-kubernetes-provider
core.parallelism is set to 32,
Total number of tasks (running and completed) above is 32 which matches core.parallelism & our single default pool. The next step is that those 2 running jobs complete no more tasks are running.
This causes system malfunctioning because Airflow still thinks that tasks are running, but they have completed. Looks like kubernetes is not reporting the state back to Airflow and then airflow executor is running out of open slots.
Also tasks had been queued up in the scheduled state and could not be promoted to queued state.
At 10:24 airflow-scheduler has been restarted.
Marked execution shows that airflow-scheduler catches up those tasks after restart. 5th column is displaying start date time and 6th end date time. From the graph one can assume that the job usually takes up to 2 minutes rather than an hour.
We enabled debug logs on scheduler. So when it happens next time we hopefully will know more.
What you think should happen instead?
Airflow tasks should continue running.
How to reproduce
I could not reproduce the issue. But in last 3 weeks it happened 5 times on our production system. We suspect that it started breaking for us when we upgraded apache-airflow-providers-cncf-kubernetes from 7.13.0 to 8.0.1 and keeps breaking on 8.1.1 as well.
I commented on that issue but I'm not sure if it's related no one there mentions that jobs are completed (apart of me). To me this one is more relevant -> #33402
This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time.
related: #22612
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.4
What happened?
Airflow schedules tasks to be performed on a Kubernetes Cluster. However for some reason when the task is completed pods are not cleared out (as it is normal). I am not sure if the bug is on the airflow side or airflow-kubernetes-provider
core.parallelism
is set to 32,Total number of tasks (running and completed) above is 32 which matches
core.parallelism
& our single default pool. The next step is that those 2 running jobs complete no more tasks are running.This causes system malfunctioning because Airflow still thinks that tasks are running, but they have completed. Looks like kubernetes is not reporting the state back to Airflow and then airflow executor is running out of open slots.
Also tasks had been queued up in the scheduled state and could not be promoted to queued state.
At 10:24 airflow-scheduler has been restarted.
Marked execution shows that airflow-scheduler catches up those tasks after restart. 5th column is displaying
start date time
and 6thend date time
. From the graph one can assume that the job usually takes up to 2 minutes rather than an hour.We enabled debug logs on scheduler. So when it happens next time we hopefully will know more.
What you think should happen instead?
Airflow tasks should continue running.
How to reproduce
I could not reproduce the issue. But in last 3 weeks it happened 5 times on our production system. We suspect that it started breaking for us when we upgraded
apache-airflow-providers-cncf-kubernetes
from7.13.0
to8.0.1
and keeps breaking on8.1.1
as well.Probably related to:
#36998
#33402
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon 8.20.0 Amazon integration (including Amazon Web Services (AWS)).
apache-airflow-providers-cncf-kubernetes 8.1.1 Kubernetes
apache-airflow-providers-common-io 1.3.0
Common IO Provider
apache-airflow-providers-common-sql 1.11.1 Common SQL Provider
apache-airflow-providers-databricks 6.2.0 Databricks
apache-airflow-providers-ftp 3.7.0 File Transfer Protocol (FTP)
apache-airflow-providers-github 2.5.1 GitHub
apache-airflow-providers-google 10.17.0 Google services including: - Google Ads - Google Cloud (GCP) - Google Firebase - Google LevelDB - Google Marketing Platform - Google Workspace (formerly Google Suite)
apache-airflow-providers-hashicorp 3.6.4 Hashicorp including Hashicorp Vault
apache-airflow-providers-http 4.10.0 Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap 3.5.0 Internet Message Access Protocol (IMAP)
apache-airflow-providers-mysql 5.5.4 MySQL
apache-airflow-providers-postgres 5.10.2 PostgreSQL
apache-airflow-providers-sftp 4.9.1 SSH File Transfer Protocol (SFTP)
apache-airflow-providers-smtp 1.6.1 Simple Mail Transfer Protocol (SMTP)
apache-airflow-providers-snowflake 5.4.0 Snowflake
apache-airflow-providers-sqlite 3.7.1 SQLite
apache-airflow-providers-ssh 3.10.1 Secure Shell (SSH)
Deployment
Official Apache Airflow Helm Chart
Deployment details
We deployed airflow on a kubernetes cluster using
KubernetesExecutor
setting in a helm chart.Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: