-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler enters crash loop in certain cases with dynamic task mapping #25681
Comments
CC: @ashb @uranusjr @ephraimbuddy - migth be worth at least checking before we attempt to release 2.3.4 |
Thanks for the amazingly detailed repro steps -- this should make our job much easier! |
Task log I saw(just a note for the debugging):
|
@ephraimbuddy That error appears to be different to the the one the OP reported fwiw. |
Yes. It's different. |
I was able to reproduce this in main by running the above dag after reducing the default_pool slot to 5 |
It seems like a race condition...expand_mapped_task being called multiple times diff --git a/airflow/models/mappedoperator.py b/airflow/models/mappedoperator.py
index 2a93e442b7..b1d9693d32 100644
--- a/airflow/models/mappedoperator.py
+++ b/airflow/models/mappedoperator.py
@@ -655,6 +655,15 @@ class MappedOperator(AbstractOperator):
else:
# Otherwise convert this into the first mapped index, and create
# TaskInstance for other indexes.
+ if session.query(TaskInstance).filter(
+ TaskInstance.dag_id == self.dag_id,
+ TaskInstance.run_id == run_id,
+ TaskInstance.task_id == self.task_id,
+ TaskInstance.map_index == 0
+ ).one_or_none():
+ self.log.error("Checked ")
+ unmapped_ti.state = TaskInstanceState.REMOVED
+ return all_expanded_tis, total_length
unmapped_ti.map_index = 0
self.log.debug("Updated in place to become %s", unmapped_ti)
all_expanded_tis.append(unmapped_ti) With the above, some other tasks are being marked as removed too |
One issue with the provided dag is that it uses a dynamic start_date. With dynamic start date, each time the file processor processes the file, the dag hashes won't match. This leads to the scheduler calling DagRun.verify_integrity. diff --git a/airflow/models/dagrun.py b/airflow/models/dagrun.py
index 8dcbfcfc60..c55be069e3 100644
--- a/airflow/models/dagrun.py
+++ b/airflow/models/dagrun.py
@@ -1039,7 +1039,8 @@ class DagRun(Base, LoggingMixin):
# Check if we have some missing indexes to create ti for
missing_indexes: Dict["MappedOperator", Sequence[int]] = defaultdict(list)
for k, v in existing_indexes.items():
- missing_indexes.update({k: list(set(expected_indexes[k]).difference(v))})
+ if len(v): # If list is empty, don't record
+ missing_indexes.update({k: list(set(expected_indexes[k]).difference(v))})
return task_ids, missing_indexes
def _get_task_creator(
@@ -1199,7 +1200,8 @@ class DagRun(Base, LoggingMixin):
new_indexes[task] = range(new_length)
missing_indexes: Dict[MappedOperator, Sequence[int]] = defaultdict(list)
for k, v in existing_indexes.items():
- missing_indexes.update({k: list(set(new_indexes[k]).difference(v))})
+ if len(v):
+ missing_indexes.update({k: list(set(new_indexes[k]).difference(v))})
return missing_indexes
@staticmethod I will make a PR to fix updating the missing_index even when the value is null then we can work on the dynamic start_date issue for 2.4.0 |
Will that fix be a part of 2.3.4? |
No. It won't. Sorry. |
Apache Airflow version
2.3.3
What happened
The scheduler crashed when attempting to queue a dynamically mapped task which is directly downstream and only dependent on another dynamically mapped task.
scheduler.log
What you think should happen instead
The scheduler does not crash and the dynamically mapped task executes normally
How to reproduce
Setup
Steps to reproduce
Operating System
MacOS and Dockerized Linux on MacOS
Versions of Apache Airflow Providers
None
Deployment
Other
Deployment details
I have tested and confirmed this bug is present in three separate deployments:
airflow standalone
All three of these deployments were executed locally on a Macbook Pro.
1.
airflow standalone
I created a new Python 3.9 virtual environment, installed Airflow 2.3.3, configured a few environment variables, and executed
airflow standalone
. Here is a bash script that completes all of these tasks:airflow_standalone.sh
Here is the DAG code that can be placed in a
dags
directory in the same location as the above script. Note that thisDAG code triggers the bug in all environments I tested.
bug_test.py
Once set up, simply activating the DAG will demonstrate the bug.
2. DaskExecutor on docker compose with Postgres 12
I cannot provide a full replication of this setup as it is rather in depth. The Docker image is starts from
python:3.9-slim
then installs Airflow with appropriate constraints. It has a lot of additional packages installed, both system and python. It also has a custom entrypoint that can run the Dask scheduler in addition to regular Airflow commands. Here are the applicable Airflow configuration values:airflow.cfg
Here is a docker-compose file that is nearly identical to the one I use (I just removed unrelated bits):
docker-compose.yml
I also had to manually change the default pool size to 15 in the UI in order to trigger the bug. With the default pool set to 128 the bug did not trigger.
3. KubernetesExecutor on Docker Desktop builtin Kubernetes cluster with Postgres 11
This uses the official Airflow Helm Chart with the following values overrides:
values.yaml
The docker image is the official
airflow:2.3.3-python3.9
image with a single environment variable modified:Anything else
This is my understanding of the timeline that produces the crash:
It seems that if some but not all subtasks for the second task have been created when the scheduler attempts to queue
the mapped task, then the scheduler tries to create all of the subtasks again which causes a unique constraint violation.
NOTES
provided test case succeeded on the DaskExecutor deployment when the default pool was 128, however when I reduced that pool to 15 this bug occurred.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: