Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create gitSync container in dag-processor pod when gitSync is enabled regardless the persistence conf #27124

Closed

Conversation

hussein-awala
Copy link
Member

@hussein-awala hussein-awala commented Oct 18, 2022

closes: #27080

Before this PR which added the standalone Dag Processors to the helm chart, the git sync logic was:

  • if dags.gitSync.enabled then:
    1. add git_sync_container-init and git_sync_container to scheduler pods containers
    2. load the volume dags from:
      • a PVC if dags.persistence.enabled: claim existing PVC if dags.persistence.existingClaim is defined or create a new PVC if not
      • emptyDir volume if not
    3. mount the volume dags to scheduler container as readOnly
    4. mount the volume dags to git_sync_container
  • elif dags.persistence.enabled then:
    • load the volume dags from an existing PVC if dags.persistence.existingClaim is defined or create a new PVC if not
    • add this volume to schedulers container

So when dags.gitSync.enabled we always had a git_sync_container running in the schedulers pods, and the dags.persistence.enabled is used to specify the volume type

After the PR, we create the git_sync_container in the scheduler pods only if:

  • executor is one of the Local Executors and gitSync is enabled
  • or dagProcessor is not enabled and gitSync is enabled

which I find logical (code)

But we create the git_sync_container in the dag-processor only if gitSync is enabled and dags.persistence is not enabled (code)

I think we can always create git_sync_container in the dag-processor pod when it is enabled and gitSync is enabled, and we use dags.persistence to switch between PVC and emptyDir

@boring-cyborg
Copy link

boring-cyborg bot commented Oct 18, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Oct 24, 2022

Stupid question. Why do you need gitSync AND dagPersitence @hussein-awala ? It always bothered me, because it never made any sense to me.

What is the reason you need persistence when you have git Sync? What does it give you more than just gitSync?

What is your expectation by enabling it ?

@hussein-awala
Copy link
Member Author

What is the reason you need persistence when you have git Sync? What does it give you more than just gitSync?

The main reason is pulling dags from different sources (some dags from git repo and other dags from CI/CD for ex). To avoid losing the other dags after a pod restart, we need to activate the persistence.
To achieve my goal, I need also to replace this block

{{ define "airflow_dags" -}}
{{- if .Values.dags.gitSync.enabled -}}
{{ (printf "%s/dags/repo/%s" .Values.airflowHome .Values.dags.gitSync.subPath) }}
{{- else -}}
{{ (printf "%s/dags" .Values.airflowHome) }}
{{- end -}}
{{- end -}}
by:

{{ define "airflow_dags" -}}
{{ (printf "%s/dags" .Values.airflowHome) }}
{{- end -}}

in this case, I could add the other dags to a new folder /dags/other_dags, the git-sync could pull the git repo dags to /dags/repo, and the dag-processor will process all the dags in the folder /dags.
(I will test it tomorrow before commit my code)

Later, I will propose a new feature: add support to sync the dags from different git repos and adding dags from other sources to separate sub folders in dags folder, and if we don't want to add git dags to the persistent volume, we can just change the volume mount path and add an empty volume for git dags:

/dags:
  /persistent_dags -> persitent volume:
    /folder1
    /folder2
    ...
  /git_dags -> empty dir
    /repo1
    /repo2
    ...

@potiuk
Copy link
Member

potiuk commented Oct 25, 2022

I think you are complicating Airflow's DAG syncing needlessly. IMHO It should not be Airflow's job to combine syncing data from multiple sources - this is not Airflow's job really and I would be the first one to strongly oppose such feature. It has a number of problems (includiing managing access to write different DAGs, knowing when to sync which dirs, knowing what are relationships between different subfolders etc. that should be solved externally to Airflow). We certainly do not want to solve all those problems at the level of Airflow. And we do not have to. This problem can be solved way better.

I believe you should combine all of your DAG sources outside of Airflow and let Airflow use single GitSync from single repository (by using submodules and recursive update you should be able to achieve pulling DAGs from multiple repositories).

You can achieve what you want by:

  1. Creating single git repo which will have mutliple sub-repos - each git subrepo in different directory
  2. Making sure your CI/CD and other sources push their DAGs to their respective Git repos

This (and even way more complex) setup have been successfully deployed and works very well for a number of companies. You can watch this - fantastic - talk from JAGEX where they explained how they manage to run 170+ github repos synced by a single GitSync and effectively manage dependencies between different parts of they DAG folder using git. This enables you setting up your own conventions, CI checks, switch different branches of different repos manually for all your sub-folders and generally keep full history of all "valid" combinations of all repos in the single "umbrella" git repo where all the sub-dags are kept as submodules. It can't get better than that - all without even touching any of Airflow complexity - either in Airflow itself or in the Helm chart.

The talk is here https://airflowsummit.org/sessions/2022/manage-dags-at-scale/

Please watch it and let me know why you think such solution will not work for you. I think you will need extremely strong justifications if you want to increase complexity and add such feature to Airflow - because you can solve it way simpler following good and working practices.

@potiuk
Copy link
Member

potiuk commented Jan 10, 2023

I proposed a PR which would disable the combo #28822

potiuk added a commit to potiuk/airflow that referenced this pull request Jan 10, 2023
Git Sync and Persistence for DAGs makes very little sense together
and is largely misleading our users on what it does.

Git Sync provides atomicity of DAG folder synchronisation via
checking out a complete copy of the DAGs folder and swapping
symbolic link pointing to it. It does not play well with
networked persistence.

It makes it super-easy by users unaware how git-sync and
persistence work under-the-hood to walk into several traps:

* git sync on persistent remote volumes such as EFS generate a LOT
  of extra traffic due to the way how git sync works (it creates
  second working folder for dags and replaces symbolic link to folders
  which effectively forces full sync of whole DAG folder for all
  involved instances with every commit
* due to that sync that gets distributed over multiple clients of
  persistent volumes it looses the atomicity property of git sync
  and the above case where there are burst of synchronisation betwween
  multiple nodes, it is very likely to trigger inconsistent DAG parsing
* the problem amplifies when the network volumes are distributed among
  multiple nodes and there are some networking limits (for example
  not provisioned IOPS in EFS). The amount of traffic generated at
  sync might cause even more inconsistencies - only solvable by paying
  extra IOPS (where it would not be needed normally)
* users might be tricked into trying to use gitSync and also update
  DAGs using persistence (so basically combine the development friendly
  dag distribution over persistent volumes and production-ready
  git-sync - without being aware that git-sync will override the
  manually synced DAGS when swapping the symbolic links

On top of it, the current status is that it does not work. There
are several issues where volumes are missing when the combo is used
in certain situations and better than fixing those is to disable it.

Closes: apache#27545
Closes: apache#27476
Closes: apache#27080

Related: apache#27124
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 25, 2023
@github-actions github-actions bot closed this Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Git Sync containers are missing when Dag Processor is enabled
4 participants