You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We generate the tasks for our DAGs using an external database. This has been working for us since 2020. We have over 150 DAGs running and each one uses an "objects" table filtered by a "dag" field to determine what it needs to do:
As an example:
DAG
Object
IsActive
Batch
Dag1
User
True
1
Dag1
Student
False
2
Dag1
Courses
True
3
Dag2
Books
True
null
Dag2
Authors
True
null
Here are some examples of the DAG graphs we use:
In the above DAG, there is a loop that generates the tasks based on the db records. A "batch" field is used to determine which "batch" something is to be run in. So the number of objects in each column of the image can be dynamically changed based on new requirements.
Or this one:
In the above case, the DAG file has the list of tasks and any logic that is needed to run them, but each row, representing a separate entity is pulled from the database. So anytime we need to add or remove an object or disable one for a little while, we can do it using the database.
I was wondering what the community thought about structuring the dag using a database as shown above? The database is hit as part of the top-level code, as its used to determine what needs to be run as part of the DAG. Obviously, this is something that is not recommended in the documentation, as it would hit the database, everytime the DAG file is parsed. But given the number of DAGs and their complexity, we have not seen an issue with performance or locking against the external db we use for this purpose. We could build a complex solution that generates the PY file based on the database records and it could be done only when the record changes. But given how well our system has been working, wondering if that is needed.
Would love to get the communities thoughts on the above solution.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We generate the tasks for our DAGs using an external database. This has been working for us since 2020. We have over 150 DAGs running and each one uses an "objects" table filtered by a "dag" field to determine what it needs to do:
As an example:
Here are some examples of the DAG graphs we use:
In the above DAG, there is a loop that generates the tasks based on the db records. A "batch" field is used to determine which "batch" something is to be run in. So the number of objects in each column of the image can be dynamically changed based on new requirements.
Or this one:
In the above case, the DAG file has the list of tasks and any logic that is needed to run them, but each row, representing a separate entity is pulled from the database. So anytime we need to add or remove an object or disable one for a little while, we can do it using the database.
I was wondering what the community thought about structuring the dag using a database as shown above? The database is hit as part of the top-level code, as its used to determine what needs to be run as part of the DAG. Obviously, this is something that is not recommended in the documentation, as it would hit the database, everytime the DAG file is parsed. But given the number of DAGs and their complexity, we have not seen an issue with performance or locking against the external db we use for this purpose. We could build a complex solution that generates the PY file based on the database records and it could be done only when the record changes. But given how well our system has been working, wondering if that is needed.
Would love to get the communities thoughts on the above solution.
Beta Was this translation helpful? Give feedback.
All reactions