Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically register DAGs that are used in a context manager #23592

Merged
merged 6 commits into from Sep 7, 2022

Conversation

ashb
Copy link
Member

@ashb ashb commented May 9, 2022

This has caused a bit of confusion to committers (missing off the as dag) and is just more user friendly

This is on by default but can be disabled by passing auto_register=False do a DAG

@ashb ashb requested review from kaxil and XD-DENG as code owners May 9, 2022 16:05
@ashb ashb force-pushed the automatic-dag-registration branch from 4c2bdff to a62b484 Compare May 9, 2022 16:05
@ashb
Copy link
Member Author

ashb commented May 9, 2022

This works for the current example sub-dags we have in the tests -- is there any other case I should test?

@ashb ashb requested review from malthe and potiuk May 9, 2022 16:09
@ashb ashb added this to the Airflow 2.4.0 milestone May 9, 2022
@malthe
Copy link
Contributor

malthe commented May 9, 2022

@ashb there are at least two things to consider here which I stumbled upon playing around with this:

  1. There's nothing special about with except that tasks are automatically assigned to this dag (i.e., dag=dag). Tying registration of DAGs to with is perhaps a bit confusing in this sense – also, you can use with on a particular dag instance multiple times!

  2. What about DAGs declared in functions?

    def test():
        with DAG():
            ...
    test()

    I don't mind but it is a little magic perhaps, that this automatically declares the DAG.

I ended up instead with this pattern:

dag = DAG(...)

with dag:
    ...

– as a recommended pattern, simply because then you can't forget to give the DAG instance a top-level variable.

And then as I mention in https://lists.apache.org/thread/xndt3zklxbzo5nlmjz10dtfm2pp4t4wq, we could instead warn the user if they'd created a DAG which isn't exposed/registered.

(The gist of that linked reply is that there are two different ways to do that.)

@kaxil
Copy link
Member

kaxil commented May 9, 2022

Tying registration of DAGs to with is perhaps a bit confusing in this sense – also, you can use with on a particular dag instance multiple times!

@malthe Is it the param name that is confusing? I do not understand I follow what you are saying. The use-case for this PR is that we don't need to needlessly do a with DAG(..) as dag as there is no need to have the as dag in it unless someone wants to actually use the dag variable itself

What about DAGs declared in functions?

Isn't it the same behaviour as the one with with DAG(..) as dag ?

@malthe
Copy link
Contributor

malthe commented May 9, 2022

@kaxil I understand the use-case – what I am somewhat hesitant to vote for is the conflation of the two orthogonal concerns which are both suggested for with here:

  1. As a scope for tasks created which automatically get the dag parameter set;
  2. As a mechanism to automatically register a DAG with the loading mechanism.

I think you could argue then that any DAG instance created during the loading of a module should be automatically registered – after all, why would a user create a DAG instance only to throw it away?

The problem with that is that we do have a test case that prevents this:

def test():
    dag = DAG(...)
    # Add tasks to dag and forget returning the DAG instance

test()

# or dag = test(), same result because `test` has no return value

The point is, should we require the use of with or should auto_register work regardless?

@ashb
Copy link
Member Author

ashb commented May 17, 2022

We could extend the auto-register -- but that makes it slightly more complex for SubDag (we'd have to "unregister" a dag when it gets turned in to a sub dag) but we could do it just as easily if we want to.

Edit: actually I guess that with subdag: is just as likely to be used as with dag

@ashb ashb force-pushed the automatic-dag-registration branch 2 times, most recently from c1a7796 to ac66a51 Compare May 24, 2022 09:21
@@ -3022,24 +3029,28 @@ class DagContext:

"""

_context_managed_dag: Optional[DAG] = None
_previous_context_managed_dags: List[DAG] = []
_context_managed_dags: Deque[DAG] = deque()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason we need a deque here? From what I can tell a plain list is enough (with the ordering reversed, i.e. we append to / pop from the end and use -1 to access).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason no -- mostly thinking that we only ever need to access the head value, or push a new head value, and deque is more optimized for that case.

But it hardly matters as this is not on the critical path for performance.

@ashb ashb force-pushed the automatic-dag-registration branch from ac66a51 to 301c20a Compare June 17, 2022 11:47
@ashb ashb force-pushed the automatic-dag-registration branch from 301c20a to bb5d2de Compare July 29, 2022 11:33
@@ -937,8 +953,7 @@ def test_get_dag_with_dag_serialization(self):
def test_collect_dags_from_db(self):
"""DAGs are collected from Database"""
db.clear_db_dags()
example_dags_folder = airflow.example_dags.__path__[0]
dagbag = DagBag(example_dags_folder)
dagbag = DagBag(str(example_dags_folder))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just allow DagBag to take Path, but this doesn’t need to be done now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my thought -- it was orthogonal to this PR

@ashb ashb force-pushed the automatic-dag-registration branch from bb5d2de to ed6a98f Compare August 2, 2022 10:27
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ashb
Copy link
Member Author

ashb commented Aug 5, 2022

I still need to update the docs/tutorial etc to reflect this.

@ashb
Copy link
Member Author

ashb commented Aug 12, 2022

Reminder to self: finish this before 2.4

@ashb ashb force-pushed the automatic-dag-registration branch from ed6a98f to 6cdd00e Compare August 15, 2022 11:20
@@ -306,6 +314,8 @@ def _load_modules_from_file(self, filepath, safe_mode):
if mod_name in sys.modules:
del sys.modules[mod_name]

DagContext.current_autoregister_module_name = mod_name
Copy link
Member

@uranusjr uranusjr Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in-place operation feels a bit weird to me… Perhaps a context manager would be easier to understand and maintain? Something like

with DagContext.enable_autoregister(mod_name):
    parse(...)

This probably needs some refactoring to work though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, This is essentially used as a global variable.

It's not very easy to refactor given it's use across three or so different functions, so a context manager would be tricky. (I.e. I can't see an easy refactor to make it work right now)

@ashb ashb merged commit 51c5d12 into apache:main Sep 7, 2022
@ashb ashb deleted the automatic-dag-registration branch September 7, 2022 17:54
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants