create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124

hussein-awala · 2022-10-18T22:14:12Z

Before this PR which added the standalone Dag Processors to the helm chart, the git sync logic was:

if dags.gitSync.enabled then:
1. add git_sync_container-init and git_sync_container to scheduler pods containers
2. load the volume dags from:
  - a PVC if dags.persistence.enabled: claim existing PVC if dags.persistence.existingClaim is defined or create a new PVC if not
  - emptyDir volume if not
3. mount the volume dags to scheduler container as readOnly
4. mount the volume dags to git_sync_container
elif dags.persistence.enabled then:
- load the volume dags from an existing PVC if dags.persistence.existingClaim is defined or create a new PVC if not
- add this volume to schedulers container

So when dags.gitSync.enabled we always had a git_sync_container running in the schedulers pods, and the dags.persistence.enabled is used to specify the volume type

After the PR, we create the git_sync_container in the scheduler pods only if:

executor is one of the Local Executors and gitSync is enabled
or dagProcessor is not enabled and gitSync is enabled

which I find logical (code)

But we create the git_sync_container in the dag-processor only if gitSync is enabled and dags.persistence is not enabled (code)

I think we can always create git_sync_container in the dag-processor pod when it is enabled and gitSync is enabled, and we use dags.persistence to switch between PVC and emptyDir

boring-cyborg · 2022-10-18T22:14:15Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

… gitSync is enabled regardless the persistence conf

potiuk · 2022-10-24T02:26:02Z

Stupid question. Why do you need gitSync AND dagPersitence @hussein-awala ? It always bothered me, because it never made any sense to me.

What is the reason you need persistence when you have git Sync? What does it give you more than just gitSync?

What is your expectation by enabling it ?

hussein-awala · 2022-10-24T22:27:05Z

What is the reason you need persistence when you have git Sync? What does it give you more than just gitSync?

The main reason is pulling dags from different sources (some dags from git repo and other dags from CI/CD for ex). To avoid losing the other dags after a pod restart, we need to activate the persistence.
To achieve my goal, I need also to replace this block

airflow/chart/templates/_helpers.yaml

Lines 413 to 419 in c1a9cf8

    
           {{ define "airflow_dags" -}} 
        
           {{- if .Values.dags.gitSync.enabled -}} 
        
           {{ (printf "%s/dags/repo/%s" .Values.airflowHome .Values.dags.gitSync.subPath) }} 
        
           {{- else -}} 
        
           {{ (printf "%s/dags" .Values.airflowHome) }} 
        
           {{- end -}} 
        
           {{- end -}}

by:

{{ define "airflow_dags" -}}
{{ (printf "%s/dags" .Values.airflowHome) }}
{{- end -}}

in this case, I could add the other dags to a new folder /dags/other_dags, the git-sync could pull the git repo dags to /dags/repo, and the dag-processor will process all the dags in the folder /dags.
(I will test it tomorrow before commit my code)

Later, I will propose a new feature: add support to sync the dags from different git repos and adding dags from other sources to separate sub folders in dags folder, and if we don't want to add git dags to the persistent volume, we can just change the volume mount path and add an empty volume for git dags:

/dags:
  /persistent_dags -> persitent volume:
    /folder1
    /folder2
    ...
  /git_dags -> empty dir
    /repo1
    /repo2
    ...

potiuk · 2022-10-25T21:36:45Z

I think you are complicating Airflow's DAG syncing needlessly. IMHO It should not be Airflow's job to combine syncing data from multiple sources - this is not Airflow's job really and I would be the first one to strongly oppose such feature. It has a number of problems (includiing managing access to write different DAGs, knowing when to sync which dirs, knowing what are relationships between different subfolders etc. that should be solved externally to Airflow). We certainly do not want to solve all those problems at the level of Airflow. And we do not have to. This problem can be solved way better.

I believe you should combine all of your DAG sources outside of Airflow and let Airflow use single GitSync from single repository (by using submodules and recursive update you should be able to achieve pulling DAGs from multiple repositories).

You can achieve what you want by:

Creating single git repo which will have mutliple sub-repos - each git subrepo in different directory
Making sure your CI/CD and other sources push their DAGs to their respective Git repos

This (and even way more complex) setup have been successfully deployed and works very well for a number of companies. You can watch this - fantastic - talk from JAGEX where they explained how they manage to run 170+ github repos synced by a single GitSync and effectively manage dependencies between different parts of they DAG folder using git. This enables you setting up your own conventions, CI checks, switch different branches of different repos manually for all your sub-folders and generally keep full history of all "valid" combinations of all repos in the single "umbrella" git repo where all the sub-dags are kept as submodules. It can't get better than that - all without even touching any of Airflow complexity - either in Airflow itself or in the Helm chart.

The talk is here https://airflowsummit.org/sessions/2022/manage-dags-at-scale/

Please watch it and let me know why you think such solution will not work for you. I think you will need extremely strong justifications if you want to increase complexity and add such feature to Airflow - because you can solve it way simpler following good and working practices.

potiuk · 2023-01-10T08:10:31Z

I proposed a PR which would disable the combo #28822

Git Sync and Persistence for DAGs makes very little sense together and is largely misleading our users on what it does. Git Sync provides atomicity of DAG folder synchronisation via checking out a complete copy of the DAGs folder and swapping symbolic link pointing to it. It does not play well with networked persistence. It makes it super-easy by users unaware how git-sync and persistence work under-the-hood to walk into several traps: * git sync on persistent remote volumes such as EFS generate a LOT of extra traffic due to the way how git sync works (it creates second working folder for dags and replaces symbolic link to folders which effectively forces full sync of whole DAG folder for all involved instances with every commit * due to that sync that gets distributed over multiple clients of persistent volumes it looses the atomicity property of git sync and the above case where there are burst of synchronisation betwween multiple nodes, it is very likely to trigger inconsistent DAG parsing * the problem amplifies when the network volumes are distributed among multiple nodes and there are some networking limits (for example not provisioned IOPS in EFS). The amount of traffic generated at sync might cause even more inconsistencies - only solvable by paying extra IOPS (where it would not be needed normally) * users might be tricked into trying to use gitSync and also update DAGs using persistence (so basically combine the development friendly dag distribution over persistent volumes and production-ready git-sync - without being aware that git-sync will override the manually synced DAGS when swapping the symbolic links On top of it, the current status is that it does not work. There are several issues where volumes are missing when the combo is used in certain situations and better than fixing those is to disable it. Closes: apache#27545 Closes: apache#27476 Closes: apache#27080 Related: apache#27124

github-actions · 2023-02-25T00:13:08Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

hussein-awala requested review from dstandish and jedcunningham as code owners October 18, 2022 22:14

boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Oct 18, 2022

hussein-awala added 2 commits October 19, 2022 12:25

create gitSync container and init container in dag-processor pod when…

fd7e2c5

… gitSync is enabled regardless the persistence conf

update helm unit tests

01f1585

hussein-awala force-pushed the fix/gitSync_with_persistence branch from 611faac to 01f1585 Compare October 19, 2022 10:25

eladkal added this to the Airflow Helm Chart 1.8.0 milestone Oct 21, 2022

Merge branch 'main' into fix/gitSync_with_persistence

c1a9cf8

jedcunningham removed this from the Airflow Helm Chart 1.8.0 milestone Dec 8, 2022

potiuk mentioned this pull request Jan 10, 2023

Git Sync containers are missing when Dag Processor is enabled #27080

Closed

2 tasks

potiuk mentioned this pull request Jan 10, 2023

Disable GitSync + Persistence combo in the Helm Chart #28822

Closed

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 25, 2023

github-actions bot closed this Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124

create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124

hussein-awala commented Oct 18, 2022 •

edited by eladkal

boring-cyborg bot commented Oct 18, 2022

potiuk commented Oct 24, 2022

hussein-awala commented Oct 24, 2022

potiuk commented Oct 25, 2022

potiuk commented Jan 10, 2023

github-actions bot commented Feb 25, 2023

create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124

create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124

Conversation

hussein-awala commented Oct 18, 2022 • edited by eladkal

boring-cyborg bot commented Oct 18, 2022

potiuk commented Oct 24, 2022

hussein-awala commented Oct 24, 2022

potiuk commented Oct 25, 2022

potiuk commented Jan 10, 2023

github-actions bot commented Feb 25, 2023

hussein-awala commented Oct 18, 2022 •

edited by eladkal