Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SLA wording to reflect it is relative to Dag Run start #27111

Merged
merged 1 commit into from Oct 27, 2022

Conversation

notatallshaw-gts
Copy link
Contributor

Update Docs to reflect reality of SLA

closes: #26566

@notatallshaw-gts
Copy link
Contributor Author

notatallshaw-gts commented Oct 18, 2022

I don't think the build doc error is related to this change? At least as I'm interpreting the log output.

Edit: Yup, repushing cleared the test failures.

@notatallshaw-gts
Copy link
Contributor Author

@potiuk Whenever you get a chance, any objections?

@potiuk potiuk merged commit 639210a into apache:main Oct 27, 2022
@ephraimbuddy ephraimbuddy added the type:doc-only Changelog: Doc Only label Nov 9, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.4.3 milestone Nov 9, 2022
ephraimbuddy pushed a commit that referenced this pull request Nov 9, 2022
ephraimbuddy pushed a commit that referenced this pull request Nov 9, 2022
@rom1mouret
Copy link

I am not advocating for re-opening this issue now, but I wonder if it might still lack clarity?

Contrary to what one would intuitively expect, isn't the SLA relative to the next DAG run?
The code calls next_dagrun_info() twice before any SLA check. The first time to calculate the end of the current interval, and the second time to calculate the end of the next interval, which could be one day in the future if the DAG is scheduled to run once a day.

next_info = dag.next_dagrun_info(dag.get_run_data_interval(ti.dag_run), restricted=False)

@notatallshaw-gts
Copy link
Contributor Author

I am not advocating for re-opening this issue now, but I wonder if it might still lack clarity?

Contrary to what one would intuitively expect, isn't the SLA relative to the next DAG run? The code calls next_dagrun_info() twice before any SLA check. The first time to calculate the end of the current interval, and the second time to calculate the end of the next interval, which could be one day in the future if the DAG is scheduled to run once a day.

next_info = dag.next_dagrun_info(dag.get_run_data_interval(ti.dag_run), restricted=False)

That looks like it's calling next only once? next_dagrun_info is the outer call and get_run_data_interval is the inner call.

And isn't the fact it calls next_dagrun_info an artifact of Airflow's quirky execution date behavior? Traditionally you needed to calculate the execution date of the of the next dag run to understand the "real" time that current dag starts at. I wonder if this could now be replaced with using data_interval_end from the current dag? Which if so might be less error prone?

I'm happy for another PR to be opened for either even better clarity or updating this logic to be more consistent.

@rom1mouret
Copy link

Sorry, I should have quoted the entire code section:

 next_info = dag.next_dagrun_info(dag.get_run_data_interval(ti.dag_run), restricted=False)  # 1st call
 while next_info and next_info.logical_date < ts:
     next_info = dag.next_dagrun_info(next_info.data_interval, restricted=False)  # 2nd call
     <SLA miss detection here>

I agree that you need to call next_dagrun_info at least once to get the actual scheduled execution time, as opposed to the start of the interval, which is one day in the past for the current DAG run if the DAG runs daily.
The loop is also justified, since we want SLAs to be checked for past DAG runs as well.

Actually, I'm running an older version of Airflow that detects SLA misses as follows:

dttm = ti.execution_date
dttm = dag.following_schedule(dttm)  # 1st call
while dttm < timezone.utcnow():
     following_schedule = dag.following_schedule(dttm)  # 2nd call
     if following_schedule + task.sla < timezone.utcnow():
         <SLA miss>

I added some debugging print calls and I confirm that the first SLA check is counter-intuitively relative to a date in the future.
Perhaps it makes sense if we consider SLA to be relative to the execution date + the theoretical max duration of the DAG? After all, it's fair to assume that all tasks should finish before the next DAG run.

I haven't tested with the new version though, but the logic doesn’t seem to have changed.

@notatallshaw-gts
Copy link
Contributor Author

notatallshaw-gts commented Nov 11, 2022

I confirm that the first SLA check is counter-intuitively relative to a date in the future.

I'm not sure I completely follow this wording, do you have a concrete example where this differs from the wording in my PR? The wording I have given matches the motivating example I give in #26566.

And I'm less concerned about if the wording matches the steps of the code as long as the description matches the end result.

After all, it's fair to assume that all tasks should finish before the next DAG run.

I'm not sure I follow what you're getting at but no this is not a fair assumption. I have worked with many DAGs that run daily but each DAG run may be for longer than one day (sometimes weeks).

@rom1mouret
Copy link

With my Airflow (1.14.15), the SLA miss was not detected in your scenario (#26566) until the second DAG run.

I might have found the reason. Airflow 1.14.15 does check for the SLA one interval ahead of each task, but it does so for SUCCESS/SKIPPED tasks only, not the currently running tasks. When a new run is triggered, the SUCCESS/SKIPPED tasks from the previous run will be one interval behind, so looking one interval ahead works for these tasks. That’s when the SLA miss is triggered. Definitely later than I expected.

@notatallshaw-gts
Copy link
Contributor Author

With my Airflow (1.14.15), the SLA miss was not detected in your scenario (#26566) until the second DAG run.

I might have found the reason. Airflow 1.14.15 does check for the SLA one interval ahead of each task, but it does so for SUCCESS/SKIPPED tasks only, not the currently running tasks. When a new run is triggered, the SUCCESS/SKIPPED tasks from the previous run will be one interval behind, so looking one interval ahead works for these tasks. That’s when the SLA miss is triggered. Definitely later than I expected.

Do you mean 1.10.15? Regardless this fits with the experience of SLAs I had with Airflow 1.x era, that is they came much later than I expected and I didn't find them useful

Fortunately I currently only have to work with Airflow 2.x instances, and I have been setting up a lot of SLAs recently and my practical experience of them is they work as I've described in this PR.

@potiuk
Copy link
Member

potiuk commented Nov 14, 2022

Yep. There is currently probably few (certainly less than 20) percent code left from Airflow 1.10 in Airflow 2. Airflow 1.10 is end of life, end of support, end of security fixes for 1.5 year. If you are using it still @rom1mouret - you are shooting yourself in the foot, because not only maintainers do not remember too much about those days (and most of all they want to forget), but also you are one of the few last people on Earth who use Airflow 1.10.

It's not only dangerous for you and your employer (multiple Security CVEs that were already handled in Airfow 2 but they were not in Airlfow 1.10) but it also removes the biggest reason why you might want to use an open-source software - a community of users and maintainers that help you to solve your problems.

You can spend your time in searching for reasons, but the whole effort there is only useful for you, no-one else really wants to have anything to do with 1.10 any more.

You are bascally in this situation:

image

@notatallshaw-gts notatallshaw-gts deleted the sla-docs branch November 14, 2022 19:40
@rom1mouret
Copy link

Thanks.
When I wrote my first comment, I assumed that the current Airflow version would behave the same as my old version with regards to SLA, based on what I saw in the code.

Can’t wait to upgrade!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have SLA docs reflect reality
4 participants