Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes is not reporting back workers status to Airflow #39200

Open
2 tasks done
aru-trackunit opened this issue Apr 23, 2024 · 5 comments
Open
2 tasks done

Kubernetes is not reporting back workers status to Airflow #39200

aru-trackunit opened this issue Apr 23, 2024 · 5 comments
Assignees
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@aru-trackunit
Copy link
Contributor

aru-trackunit commented Apr 23, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.4

What happened?

Airflow schedules tasks to be performed on a Kubernetes Cluster. However for some reason when the task is completed pods are not cleared out (as it is normal). I am not sure if the bug is on the airflow side or airflow-kubernetes-provider

core.parallelism is set to 32,

Screenshot 2024-04-22 at 11 48 03

Total number of tasks (running and completed) above is 32 which matches core.parallelism & our single default pool. The next step is that those 2 running jobs complete no more tasks are running.

This causes system malfunctioning because Airflow still thinks that tasks are running, but they have completed. Looks like kubernetes is not reporting the state back to Airflow and then airflow executor is running out of open slots.

Screenshot 2024-04-22 at 13 58 24 Screenshot 2024-04-22 at 13 58 58

Also tasks had been queued up in the scheduled state and could not be promoted to queued state.
Screenshot 2024-04-22 at 11 49 51

At 10:24 airflow-scheduler has been restarted.

Marked execution shows that airflow-scheduler catches up those tasks after restart. 5th column is displaying start date time and 6th end date time. From the graph one can assume that the job usually takes up to 2 minutes rather than an hour.
Screenshot 2024-04-22 at 14 00 07

We enabled debug logs on scheduler. So when it happens next time we hopefully will know more.

What you think should happen instead?

Airflow tasks should continue running.

How to reproduce

I could not reproduce the issue. But in last 3 weeks it happened 5 times on our production system. We suspect that it started breaking for us when we upgraded apache-airflow-providers-cncf-kubernetes from 7.13.0 to 8.0.1 and keeps breaking on 8.1.1 as well.

Probably related to:
#36998
#33402

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 8.20.0 Amazon integration (including Amazon Web Services (AWS)).
apache-airflow-providers-cncf-kubernetes 8.1.1 Kubernetes
apache-airflow-providers-common-io 1.3.0 Common IO Provider
apache-airflow-providers-common-sql 1.11.1 Common SQL Provider
apache-airflow-providers-databricks 6.2.0 Databricks
apache-airflow-providers-ftp 3.7.0 File Transfer Protocol (FTP)
apache-airflow-providers-github 2.5.1 GitHub
apache-airflow-providers-google 10.17.0 Google services including: - Google Ads - Google Cloud (GCP) - Google Firebase - Google LevelDB - Google Marketing Platform - Google Workspace (formerly Google Suite)
apache-airflow-providers-hashicorp 3.6.4 Hashicorp including Hashicorp Vault
apache-airflow-providers-http 4.10.0 Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap 3.5.0 Internet Message Access Protocol (IMAP)
apache-airflow-providers-mysql 5.5.4 MySQL
apache-airflow-providers-postgres 5.10.2 PostgreSQL
apache-airflow-providers-sftp 4.9.1 SSH File Transfer Protocol (SFTP)
apache-airflow-providers-smtp 1.6.1 Simple Mail Transfer Protocol (SMTP)
apache-airflow-providers-snowflake 5.4.0 Snowflake
apache-airflow-providers-sqlite 3.7.1 SQLite
apache-airflow-providers-ssh 3.10.1 Secure Shell (SSH)

Deployment

Official Apache Airflow Helm Chart

Deployment details

We deployed airflow on a kubernetes cluster using KubernetesExecutor setting in a helm chart.

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@aru-trackunit aru-trackunit added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 23, 2024
@RNHTTR
Copy link
Collaborator

RNHTTR commented Apr 23, 2024

I think this is a duplicate of #36998. BTW restarting the scheduler temporarily solves this.

@RNHTTR RNHTTR added provider:cncf-kubernetes Kubernetes provider related issues and removed area:core labels Apr 23, 2024
@aru-trackunit
Copy link
Contributor Author

aru-trackunit commented Apr 24, 2024

I commented on that issue but I'm not sure if it's related no one there mentions that jobs are completed (apart of me). To me this one is more relevant -> #33402

@aru-trackunit
Copy link
Contributor Author

Using airflow metrics I also observed that when the issue is happening there are 2 different number for usually associated metrics:

At 12:17 airflow-scheduler was restarted

Screenshot 2024-04-30 at 14 17 44

@RNHTTR
Copy link
Collaborator

RNHTTR commented May 7, 2024

@aru-trackunit are you able to reproduce? Are you sure this isn't #36998 ? If it is, this patch might resolve it.

@dirrao
Copy link
Collaborator

dirrao commented May 11, 2024

This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time.
related: #22612

@dirrao dirrao added kind:bug This is a clearly a bug and removed kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 11, 2024
@dirrao dirrao self-assigned this May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

3 participants