Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

Closed
1 of 2 tasks
vilozio opened this issue Sep 29, 2022 · 3 comments
Closed
1 of 2 tasks

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

vilozio opened this issue Sep 29, 2022 · 3 comments
Labels
area:core kind:bug This is a clearly a bug
Milestone

Comments

@vilozio
Copy link

vilozio commented Sep 29, 2022

Apache Airflow version

Other Airflow 2 version

What happened

We have Airflow (2.2.0) deployed to Kubernetes (1.21.14-gke.700) with the official Airflow Helm Chart (1.6.0). Database is PostgreSQL.

We use KubernetesExecutor.

Recently we upgraded Airflow to the recent (at that point of time) version 2.3.4 (I was waiting for this version because of the fix #24356).

Initially the upgrade went fine, but then we noticed that DAGs just stops running new tasks. The tasks are stuck in queued state. Actions like clear, delete of a run or manual run doesn't help. Only restart of Airflow scheduler helps, but then after one day (we have only daily jobs) tasks are stuck again.

After restart of scheduler, the next run of tasks finishes successfully, but on the 3rd day they are stuck again.

I have logs and some ideas why it could happen, I will write it below.

What you think should happen instead

Tasks run successfully on the scheduled time.

How to reproduce

Upgrade Airflow 2.2.0 with DAGs to the version 2.3.4.

Operating System

Windows 11

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

  • Airflow 2.2.0
  • Official Airflow Helm Chart 1.6.0
  • Kubernetes 1.21.14-gke.700
  • PostgreSQL 13
  • executor: KubernetesExecutor

Anything else

This problem occurs every 3rd day. We downgraded the Airflow back to 2.2.0 until we find out the reason.

In logs the scheduler writes something like this (left mention of only one dag task, there are other 27 dag tasks)

[2022-09-14T08:01:30.656+0000] {dagbag.py:472} DEBUG - Loaded DAG <DAG: MY_DAG_1>
[2022-09-14T08:01:30.746+0000] {processor.py:650} INFO - DAG(s) dict_keys([MY_DAG_1']) retrieved from /opt/airflow/dags/create_dags_from_configs.py
[2022-09-14T08:01:30.747+0000] {dagbag.py:619} DEBUG - Running dagbag.sync_to_db with retries. Try 1 of 3
[2022-09-14T08:01:30.747+0000] {dagbag.py:624} DEBUG - Calling the DAG.bulk_sync_to_db method
[2022-09-14T08:01:30.966+0000] {scheduler_job.py:354} INFO - 28 tasks up for execution:
 <TaskInstance: MY_DAG_1.main scheduled__2022-09-13T04:00:00+00:00 [scheduled]>
 ...
[2022-09-14T08:01:31.037+0000] {scheduler_job.py:547} INFO - Sending TaskInstanceKey(dag_id='MY_DAG_1', task_id='main', run_id='scheduled__2022-09-13T04:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-09-14T08:01:31.038+0000] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'MY_DAG_1', 'main', 'scheduled__2022-09-13T04:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/create_dags_from_configs.py']
...
[2022-09-14T08:01:31.054+0000] {base_executor.py:159} DEBUG - 28 running task instances
[2022-09-14T08:01:31.055+0000] {base_executor.py:160} DEBUG - 28 in queue
[2022-09-14T08:01:31.055+0000] {base_executor.py:161} DEBUG - 36 open slots

Then for every DAG task there appears these error messages in logs

[2022-09-12T16:32:45.654+0000] {kubernetes_executor.py:469} INFO - Found 2 queued task instances
[2022-09-12T16:32:45.673+0000] {kubernetes_executor.py:510} INFO - TaskInstance: <TaskInstance: MY_DAG_1.main scheduled__2022-08-03T07:00:00+00:00 [queued]> found in queued state but was not launched, rescheduling
[2022-09-12T16:32:46.336+0000] {scheduler_job.py:419} INFO - DAG MY_DAG_1 has 1/16 running and queued tasks
[2022-09-12T16:32:46.337+0000] {scheduler_job.py:505} INFO - Setting the following tasks to queued state:
<TaskInstance: MY_DAG_1.main scheduled__2022-08-03T07:00:00+00:00 [scheduled]>
[2022-09-12T16:32:49.649+0000] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='MY_DAG_1', task_id='main', run_id='scheduled__2022-08-03T07:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

This repeats until I restart the scheduler and delete all tasks to reschedule them.

At first I thought that this situation is related to #21225, but in the issue Celery is used instead of KubernetesExecutor.

I also have a suspicion this error may be caused by the changes in #23016 as there the trigger logic was changed for Celery, but not for Kubernetes Executor.

All the symptoms are similar to the issue #25728. It seems that Task Key was never removed from self.running after it initially rescheduled itself. This also happens in our case with KubernetesExecutor.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vilozio vilozio added area:core kind:bug This is a clearly a bug labels Sep 29, 2022
@flaviussn
Copy link

The same thing happened to me related with this issue. The bug is fixed in version 2.4.0 and 2.4.1 for me

@potiuk
Copy link
Member

potiuk commented Oct 10, 2022

Closng as fixed it 2.4.0. We can always reaopened if it is not. Please double-check @vilozio if this is fixed after migrating to 2.4.0 and comment here.

@potiuk potiuk closed this as completed Oct 10, 2022
@potiuk potiuk added this to the Airflow 2.4.0 milestone Oct 10, 2022
@vilozio
Copy link
Author

vilozio commented Feb 20, 2023

Sorry for a late reply. This issue was resolved for me after migrating to 2.4.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

3 participants