-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124
create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
… gitSync is enabled regardless the persistence conf
611faac
to
01f1585
Compare
Stupid question. Why do you need gitSync AND dagPersitence @hussein-awala ? It always bothered me, because it never made any sense to me. What is the reason you need persistence when you have git Sync? What does it give you more than just gitSync? What is your expectation by enabling it ? |
The main reason is pulling dags from different sources (some dags from git repo and other dags from CI/CD for ex). To avoid losing the other dags after a pod restart, we need to activate the persistence. airflow/chart/templates/_helpers.yaml Lines 413 to 419 in c1a9cf8
in this case, I could add the other dags to a new folder Later, I will propose a new feature: add support to sync the dags from different git repos and adding dags from other sources to separate sub folders in dags folder, and if we don't want to add git dags to the persistent volume, we can just change the volume mount path and add an empty volume for git dags:
|
I think you are complicating Airflow's DAG syncing needlessly. IMHO It should not be Airflow's job to combine syncing data from multiple sources - this is not Airflow's job really and I would be the first one to strongly oppose such feature. It has a number of problems (includiing managing access to write different DAGs, knowing when to sync which dirs, knowing what are relationships between different subfolders etc. that should be solved externally to Airflow). We certainly do not want to solve all those problems at the level of Airflow. And we do not have to. This problem can be solved way better. I believe you should combine all of your DAG sources outside of Airflow and let Airflow use single GitSync from single repository (by using submodules and recursive update you should be able to achieve pulling DAGs from multiple repositories). You can achieve what you want by:
This (and even way more complex) setup have been successfully deployed and works very well for a number of companies. You can watch this - fantastic - talk from JAGEX where they explained how they manage to run 170+ github repos synced by a single GitSync and effectively manage dependencies between different parts of they DAG folder using git. This enables you setting up your own conventions, CI checks, switch different branches of different repos manually for all your sub-folders and generally keep full history of all "valid" combinations of all repos in the single "umbrella" git repo where all the sub-dags are kept as submodules. It can't get better than that - all without even touching any of Airflow complexity - either in Airflow itself or in the Helm chart. The talk is here https://airflowsummit.org/sessions/2022/manage-dags-at-scale/ Please watch it and let me know why you think such solution will not work for you. I think you will need extremely strong justifications if you want to increase complexity and add such feature to Airflow - because you can solve it way simpler following good and working practices. |
I proposed a PR which would disable the combo #28822 |
Git Sync and Persistence for DAGs makes very little sense together and is largely misleading our users on what it does. Git Sync provides atomicity of DAG folder synchronisation via checking out a complete copy of the DAGs folder and swapping symbolic link pointing to it. It does not play well with networked persistence. It makes it super-easy by users unaware how git-sync and persistence work under-the-hood to walk into several traps: * git sync on persistent remote volumes such as EFS generate a LOT of extra traffic due to the way how git sync works (it creates second working folder for dags and replaces symbolic link to folders which effectively forces full sync of whole DAG folder for all involved instances with every commit * due to that sync that gets distributed over multiple clients of persistent volumes it looses the atomicity property of git sync and the above case where there are burst of synchronisation betwween multiple nodes, it is very likely to trigger inconsistent DAG parsing * the problem amplifies when the network volumes are distributed among multiple nodes and there are some networking limits (for example not provisioned IOPS in EFS). The amount of traffic generated at sync might cause even more inconsistencies - only solvable by paying extra IOPS (where it would not be needed normally) * users might be tricked into trying to use gitSync and also update DAGs using persistence (so basically combine the development friendly dag distribution over persistent volumes and production-ready git-sync - without being aware that git-sync will override the manually synced DAGS when swapping the symbolic links On top of it, the current status is that it does not work. There are several issues where volumes are missing when the combo is used in certain situations and better than fixing those is to disable it. Closes: apache#27545 Closes: apache#27476 Closes: apache#27080 Related: apache#27124
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
closes: #27080
Before this PR which added the standalone Dag Processors to the helm chart, the git sync logic was:
dags.gitSync.enabled
then:git_sync_container-init
andgit_sync_container
to scheduler pods containersdags
from:dags.persistence.enabled
: claim existing PVC ifdags.persistence.existingClaim
is defined or create a new PVC if notemptyDir
volume if notdags
to scheduler container as readOnlydags
togit_sync_container
dags.persistence.enabled
then:dags
from an existing PVC ifdags.persistence.existingClaim
is defined or create a new PVC if notSo when
dags.gitSync.enabled
we always had agit_sync_container
running in the schedulers pods, and thedags.persistence.enabled
is used to specify the volume typeAfter the PR, we create the
git_sync_container
in the scheduler pods only if:gitSync
is enableddagProcessor
is not enabled andgitSync
is enabledwhich I find logical (code)
But we create the
git_sync_container
in thedag-processor
only ifgitSync
is enabled anddags.persistence
is not enabled (code)I think we can always create
git_sync_container
in thedag-processor
pod when it is enabled andgitSync
is enabled, and we usedags.persistence
to switch between PVC and emptyDir