New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Orphaned tasks stuck in executor as running #16550
Merged
ashb
merged 1 commit into
apache:main
from
Jorricks:fix-tasks-stuck-in-executor-running
Jun 22, 2021
Merged
FIX Orphaned tasks stuck in executor as running #16550
ashb
merged 1 commit into
apache:main
from
Jorricks:fix-tasks-stuck-in-executor-running
Jun 22, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Jorricks
requested review from
ashb,
kaxil,
turbaszek and
XD-DENG
as code owners
June 20, 2021 17:29
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
Jorricks
force-pushed
the
fix-tasks-stuck-in-executor-running
branch
from
June 21, 2021 08:57
21a176f
to
28236f4
Compare
ashb
approved these changes
Jun 21, 2021
Awesome work, congrats on your first merged pull request! |
ashb
pushed a commit
that referenced
this pull request
Jun 22, 2021
(cherry picked from commit 90f0088)
kaxil
pushed a commit
to astronomer/airflow
that referenced
this pull request
Jun 22, 2021
(cherry picked from commit 90f0088)
Jorricks
added a commit
to Jorricks/airflow
that referenced
this pull request
Jun 24, 2021
kaxil
pushed a commit
that referenced
this pull request
Jul 2, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.
jhtimmins
pushed a commit
that referenced
this pull request
Aug 9, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. (cherry picked from commit 554a239)
jhtimmins
pushed a commit
that referenced
this pull request
Aug 13, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. (cherry picked from commit 554a239)
kaxil
pushed a commit
that referenced
this pull request
Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. (cherry picked from commit 554a239)
jhtimmins
pushed a commit
that referenced
this pull request
Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. (cherry picked from commit 554a239)
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Nov 27, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. (cherry picked from commit 554a23928efb4ff1d87d115ae2664edec3a9408c) GitOrigin-RevId: 4d436182194b6d79d5a0c040d4dd07310ba74faf
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Mar 10, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Jun 4, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
kosteev
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Jul 10, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Aug 27, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Oct 4, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
aglipska
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Oct 7, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Dec 7, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole
pushed a commit
to GoogleCloudPlatform/composer-airflow
that referenced
this pull request
Jan 27, 2023
Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related: #13542
The issue discussed here was caused by multiple things.
One of the issues is that when the scheduler picks up 'assumed to be'
orphaned
tasks, these tasks might have never made it to celery.When the tasks execution never happens, it is automatically cleaned up but only partially.
Then once the scheduler retries to queue the task again, it won't be able to, because there is still a reference of the task in the set of the
running
variable.This PR should fix the described issue.