Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Orphaned tasks stuck in executor as running #16550

Merged
merged 1 commit into from Jun 22, 2021

Conversation

Jorricks
Copy link
Contributor

Related: #13542

The issue discussed here was caused by multiple things.
One of the issues is that when the scheduler picks up 'assumed to be' orphaned tasks, these tasks might have never made it to celery.
When the tasks execution never happens, it is automatically cleaned up but only partially.
Then once the scheduler retries to queue the task again, it won't be able to, because there is still a reference of the task in the set of the running variable.
This PR should fix the described issue.

@boring-cyborg boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Jun 20, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 20, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@Jorricks Jorricks force-pushed the fix-tasks-stuck-in-executor-running branch from 21a176f to 28236f4 Compare June 21, 2021 08:57
@ashb ashb merged commit 90f0088 into apache:main Jun 22, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 22, 2021

Awesome work, congrats on your first merged pull request!

@jhtimmins jhtimmins added this to the Airflow 2.1.1 milestone Jun 22, 2021
ashb pushed a commit that referenced this pull request Jun 22, 2021
kaxil pushed a commit to astronomer/airflow that referenced this pull request Jun 22, 2021
kaxil pushed a commit to astronomer/airflow that referenced this pull request Jun 23, 2021
(cherry picked from commit 90f0088)
(cherry picked from commit 83fb4bf)
kaxil pushed a commit to astronomer/airflow that referenced this pull request Jun 23, 2021
(cherry picked from commit 90f0088)
(cherry picked from commit 83fb4bf)
(cherry picked from commit d44c223)
Jorricks added a commit to Jorricks/airflow that referenced this pull request Jun 24, 2021
kaxil pushed a commit that referenced this pull request Jul 2, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.
jhtimmins pushed a commit that referenced this pull request Aug 9, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
jhtimmins pushed a commit that referenced this pull request Aug 13, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
kaxil pushed a commit that referenced this pull request Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
jhtimmins pushed a commit that referenced this pull request Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 27, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a23928efb4ff1d87d115ae2664edec3a9408c)

GitOrigin-RevId: 4d436182194b6d79d5a0c040d4dd07310ba74faf
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Mar 10, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jun 4, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 10, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Aug 27, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 4, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
aglipska pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 7, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Dec 7, 2022
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jan 27, 2023
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

GitOrigin-RevId: 554a23928efb4ff1d87d115ae2664edec3a9408c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants