Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy repo-server as StatefulSet instead of Deployment #7927

Open
kadamanas93 opened this issue Dec 14, 2021 · 33 comments
Open

Deploy repo-server as StatefulSet instead of Deployment #7927

kadamanas93 opened this issue Dec 14, 2021 · 33 comments
Labels
enhancement New feature or request

Comments

@kadamanas93
Copy link

Summary

In the High Availability document (https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/), it mentions that argocd-repo-server uses /tmp directory store git repository. If we encounter an issue with disk space, then use a persistent volume. This usage of persistent volume in a deployment prevents the deployment from being scaled up.

Motivation

Our git repository is a monorepo and contains 10 years of history. As such the git fetch process takes more than 5mins sometimes. I have used ARGOCD_EXEC_TIMEOUT to increase this timeout and the pod is able to retrieve the repo most of the time. However, the pod eventually exceeds the emptyDir volume that was mounted on /tmp and k8s evicts the pod. This triggers another download of the git repo and the cycle continues.
The way I have workaround this problem is by adding a PVC to the deployment and mounting it on /tmp. The size is big enough to hold all the data and it does survive a restart from the pod. But a new issue arises where I can't scale up the reposerver and have to function with a single pod.

Proposal

Provide manifests where the reposerver can be deployed using a StatefulSet. An ephemeral volume is not a good solution as the contents do get deleted so the repo will need to get re-fetched.

@kadamanas93 kadamanas93 added the enhancement New feature or request label Dec 14, 2021
@kadamanas93 kadamanas93 changed the title [argo-cd] Deploy repo-server as StatefulSet instead of Deployment Deploy repo-server as StatefulSet instead of Deployment Dec 14, 2021
@diranged
Copy link

@kadamanas93 I agree with this feature request... we have the same basic problem that you do. Would the maintainers accept a PR to optionally use a StatefulSet for the Repo-Server I wonder?

@pierluigilenoci
Copy link

@mkilchhofer do you accept PR from external people?

@mkilchhofer
Copy link
Member

mkilchhofer commented May 27, 2022

For sure we accept PRs from external prople in the argo-helm repository.

But our (argo-helm) intent is to be inline with the upstream manifests in this repo here. Eg. convertig repo-server to a StatefulSet in argo-helm only would not comply with our intent stated in the chart README:

The default installation is intended to be similar to the provided Argo CD releases.

@pierluigilenoci
Copy link

pierluigilenoci commented May 27, 2022

@mkilchhofer what you say makes sense but so how can we fix it?
Is there any way to go?

Many charts (for example Prometheus, see below) allow you to choose whether to use Deployment or Stateful set.
By adding this option one can stay aligned (using Deployment as default) giving the choice.
Could it be an option?

Ref:
https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus/templates/server/deploy.yaml#L2

and
https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus/templates/server/sts.yaml#L2

@crenshaw-dev
Copy link
Collaborator

@kadamanas93 I'm not opposed to the PVC route, but I'd love to make that option unnecessary for you if possible.

As such the git fetch process takes more than 5mins sometimes.

I'm not a git expert, but I think there are ways to decrease the size of a large repo, e.g. repacking. I believe git fetch also has options to only fetch the N most recent commits. Would something like that be helpful for your use case?

@diranged
Copy link

diranged commented Jun 8, 2022

@crenshaw-dev I'm sure it would help .. but our mono-repo is a ~10-15GB checkout and gets thousands of commits per day. I really think that from an efficiency standpoint, there is a point at which the repo-server should be a statefulset with a persistent volume behind it. Not all orgs will need this of course... but some will.

@crenshaw-dev
Copy link
Collaborator

What version of Argo CD are you running? Recent versions should be fetching a relatively small amount of data for each commit. If those small fetches are adding up to something huge over time, we could consider a gc or repack cron to mitigate the issue.

But yeah @diranged I agree a PVC does make sense at some point. It's complicated by the fact that repo paths are randomized to protect against directory traversal. We'd have to persist a repo-to-directory map in redis or something to facilitate recovery after a restart.

@diranged
Copy link

diranged commented Jun 9, 2022

What version of Argo CD are you running? Recent versions should be fetching a relatively small amount of data for each commit. If those small fetches are adding up to something huge over time, we could consider a gc or repack cron to mitigate the issue.

But yeah @diranged I agree a PVC does make sense at some point. It's complicated by the fact that repo paths are randomized to protect against directory traversal. We'd have to persist a repo-to-directory map in redis or something to facilitate recovery after a restart.

{
    "Version": "v2.3.3+07ac038",
    "BuildDate": "2022-03-30T00:06:18Z",
    "GitCommit": "07ac038a8f97a93b401e824550f0505400a8c84e",
    "GitTreeState": "clean",
    "GoVersion": "go1.17.6",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KsonnetVersion": "v0.13.1",
    "KustomizeVersion": "v4.4.1 2021-11-11T23:36:27Z",
    "HelmVersion": "v3.8.0+gd141386",
    "KubectlVersion": "v0.23.1",
    "JsonnetVersion": "v0.18.0"
}

@crenshaw-dev
Copy link
Collaborator

Yeah, that's got the patch which is supposed to keep repo size down. But an occasional gc could help. How much bigger than the source repo is that directory getting on disk?

@diranged
Copy link

diranged commented Jun 9, 2022

Yeah, that's got the patch which is supposed to keep repo size down. But an occasional gc could help. How much bigger than the source repo is that directory getting on disk?

Hard to say .. it's really hard to get into that container. The only real information I've got is that we have had to significantly increase our checkout timeouts beacuse we were hitting context exceeded alarms. Is there a reasonable way for me to exec into the container and get you any useful data? I know it's been locked down recently (good).

@crenshaw-dev
Copy link
Collaborator

we have had to significantly increase our checkout timeouts beacuse we were hitting context exceeded alarms

Yeah that's the super weird part. It shouldn't be pulling that much data for just a few commits (I assume the fetch is happening a fairly high number of times daily).

I know it's been locked down recently (good).

If you have exec k8s RBAC privileges you should be able to get in. Just kubectl get po -n argocd to get the repo-server pod name and then kubectl exec -it -- bash. Once you're in, start chmod-ing stuff to grant rx on directories in /tmp. If there are a bunch of directories, it may take a while to find the randomly-named dir that contains the right repo.

Once you've found the right one, du -sh should give a nice-to-read directory size.

I wouldn't worry too much about setting permissions back. The repo-server locks down permissions after every manifest refresh. If you're super worried, restart the Deployment so it starts fresh.

@diranged
Copy link

diranged commented Jun 9, 2022

🤦

I had gotten this far and then gave up thinking that I wouldn't have the permissions to go further:

$ k exec -ti argocd-repo-server-b6d5ff99d-9592m -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
argocd@argocd-repo-server-b6d5ff99d-9592m:~$ cd /tmp
argocd@argocd-repo-server-b6d5ff99d-9592m:/tmp$ ls
_argocd-repo  reposerver-ask-pass.sock
argocd@argocd-repo-server-b6d5ff99d-9592m:/tmp$ ls _argocd-repo/
ls: cannot open directory '_argocd-repo/': Permission denied

Here's our big ass repo:

argocd@argocd-service-repo-server-64d7587bd8-82ml9:/tmp/_argocd-repo$ du -sch a9588946-38c5-48db-bda9-e8aa0fee4b92
8.0G	a9588946-38c5-48db-bda9-e8aa0fee4b92
8.0G	total

@crenshaw-dev
Copy link
Collaborator

@diranged yeah, we don't want to allow root on the repo-server, but that keeps us from using user/group access to limit repo filesystem permissions. So if an attacker can get chmod access, they can undo the protection. But it's better than nothing. :-)

Sweet, so the repo isn't especially large on-disk. I was worried it was gonna be 50GB or something due to all the fetches.

This makes me wonder why each fetch is so large.

Could you pop into that repo and run GIT_TRACE=1 git fetch origin --tags --force? It should print the transfer size.

@diranged
Copy link

diranged commented Jun 9, 2022

GIT_TRACE=1 git fetch origin --tags --force

Unfortunately we use the Github App authentication model, which isn't really supported from within the container with git... that said, I ran it from my own laptop. I've stripped out some of our own information here, but I am not sure what you're looking for. Are you looking for the "average amount of change" between each argocd git pull?

06:41:46.939377 git.c:455               trace: built-in: git config --get-all hub.host
06:41:46.947786 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Library/Developer/CommandLineTools/usr/bin/git
06:41:46.948010 exec-cmd.c:238          trace: resolved executable dir: /Library/Developer/CommandLineTools/usr/bin
06:41:46.948251 git.c:455               trace: built-in: git fetch origin --tags --force
06:41:46.966377 run-command.c:667       trace: run_command: unset GIT_PREFIX; GIT_PROTOCOL=version=2 ssh -o SendEnv=GIT_PROTOCOL org-...@github.com 'git-upload-pack '\''/.../....'\'''
remote: Enumerating objects: 19908, done.
remote: Counting objects: 100% (6985/6985), done.
remote: Compressing objects: 100% (900/900), done.
06:42:00.948757 run-command.c:667       trace: run_command: git index-pack --stdin -v --fix-thin '--keep=fetch-pack 10738 on MacBook-Air....' --pack_header=2,19908
06:42:00.961928 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Library/Developer/CommandLineTools/usr/libexec/git-core/git
06:42:00.963214 exec-cmd.c:238          trace: resolved executable dir: /Library/Developer/CommandLineTools/usr/libexec/git-core
06:42:00.964391 git.c:455               trace: built-in: git index-pack --stdin -v --fix-thin '--keep=fetch-pack 10738 on MacBook-Air....' --pack_header=2,19908
remote: Total 19908 (delta 6386), reused 6457 (delta 6061), pack-reused 12923
Receiving objects: 100% (19908/19908), 55.32 MiB | 6.77 MiB/s, done.
Resolving deltas: 100% (11997/11997), completed with 1276 local objects.
06:42:10.568856 run-command.c:667       trace: run_command: git rev-list --objects --stdin --not --all --quiet --alternate-refs
06:42:10.575086 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Library/Developer/CommandLineTools/usr/libexec/git-core/git
06:42:10.575803 exec-cmd.c:238          trace: resolved executable dir: /Library/Developer/CommandLineTools/usr/libexec/git-core
06:42:10.576647 git.c:455               trace: built-in: git rev-list --objects --stdin --not --all --quiet --alternate-refs
From git+ssh://github.com/.../...
...
06:42:12.533248 run-command.c:1628      run_processes_parallel: preparing to run up to 1 tasks
06:42:12.534034 run-command.c:1660      run_processes_parallel: done
06:42:12.534631 run-command.c:667       trace: run_command: git maintenance run --auto --no-quiet
06:42:12.537641 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Library/Developer/CommandLineTools/usr/libexec/git-core/git
06:42:12.537934 exec-cmd.c:238          trace: resolved executable dir: /Library/Developer/CommandLineTools/usr/libexec/git-core
06:42:12.538226 git.c:455               trace: built-in: git maintenance run --auto --no-quiet

@crenshaw-dev
Copy link
Collaborator

Are you looking for the "average amount of change" between each argocd git pull?

Exactly! Just trying to figure out why the fetches take so long. I get why the initial one is slow. But then it should be pulling only incremental changes.

@crenshaw-dev
Copy link
Collaborator

crenshaw-dev commented Jun 9, 2022

Receiving objects: 100% (19908/19908), 55.32 MiB | 6.77 MiB/s, done.

It that's typical, we should expect 10s fetches. Still a bit slow, but nowhere near the timeout. I wonder what's going wrong on the repo-server vs. your local machine.

I see that OP is the one who had to increase the timeout. Maybe the symptoms for your mono-repo aren't nearly as severe as what they're facing.

@diranged
Copy link

diranged commented Jun 9, 2022

@crenshaw-dev Ok - for us the fetches aren't the end of the world. I definitely can see though as the repo grows how it could become that. For the time being, our core issue is that we do not want the repo-server pod going away and losing its storage because it is "difficult" for it to recover it due to the size. I appreciate the issue you pointed out about how its not just an STS issue, but also there needs to be some mapping/state storage. It would be great if you would consider at least the statefulset/mapping-saving as an issue for improvement.

@crenshaw-dev
Copy link
Collaborator

crenshaw-dev commented Jun 9, 2022

Gotcha, makes sense. At some point we can't optimize the repo-server any more, and it's just time for an STS. Will keep it in mind for future work. Thanks for the help debugging!

@kadamanas93
Copy link
Author

Running the command in git repo on the repo server yields this:

16:04:48.109817 git.c:444               trace: built-in: git fetch origin --tags --force
16:04:48.115427 run-command.c:664       trace: run_command: unset GIT_PREFIX; GIT_PROTOCOL=version=2 ssh -o SendEnv=GIT_PROTOCOL git@github.com 'git-upload-pack '\''*****.git'\'''
Warning: Permanently added the ECDSA host key for IP address '********' to the list of known hosts.
remote: Enumerating objects: 88, done.
remote: Counting objects: 100% (78/78), done.
remote: Compressing objects: 100% (20/20), done.
16:04:49.258010 run-command.c:664       trace: run_command: git unpack-objects --pack_header=2,88
remote: Total 88 (delta 55), reused 71 (delta 55), pack-reused 10
16:04:49.260674 git.c:444               trace: built-in: git unpack-objects --pack_header=2,88
Unpacking objects: 100% (88/88), 25.20 KiB | 108.00 KiB/s, done.
16:04:49.536971 run-command.c:664       trace: run_command: git rev-list --objects --stdin --not --all --quiet --alternate-refs
16:04:49.539423 git.c:444               trace: built-in: git rev-list --objects --stdin --not --all --quiet --alternate-refs
From github.com:**********
...
16:04:49.626349 run-command.c:1625      run_processes_parallel: preparing to run up to 1 tasks
16:04:49.626787 run-command.c:1657      run_processes_parallel: done
16:04:49.628868 run-command.c:664       trace: run_command: git maintenance run --auto --no-quiet
16:04:49.630759 git.c:444               trace: built-in: git maintenance run --auto --no-quiet

This process was really fast. The reason why I had to increase the timeout was really the first repo pull. That sometimes used to exceed 10mins. Just fyi, we are deploying ArgoCD v2.2.4.

@kadamanas93
Copy link
Author

@kadamanas93 I'm not opposed to the PVC route, but I'd love to make that option unnecessary for you if possible.

As such the git fetch process takes more than 5mins sometimes.

I'm not a git expert, but I think there are ways to decrease the size of a large repo, e.g. repacking. I believe git fetch also has options to only fetch the N most recent commits. Would something like that be helpful for your use case?

Would setting git fetch to fetch only N most recent commits apply to the initial git fetch? If so, I think it would be extremely useful.

@crenshaw-dev
Copy link
Collaborator

Would setting git fetch to fetch only N most recent commits apply to the initial git fetch? If so, I think it would be extremely useful.

I think that would work. I'm not sure if there are any negative side-effects to getting a partial repo and then fetching additional commits. I wonder if it still stores things efficiently on disk that way. I'd be interested in offering this as an opt-in feature when adding a new repo.

ArgoCD v2.2.4

Aside: I'd bump to 2.2.9 for all the security fixes. :-)

@kadamanas93
Copy link
Author

Would setting git fetch to fetch only N most recent commits apply to the initial git fetch? If so, I think it would be extremely useful.

I think that would work. I'm not sure if there are any negative side-effects to getting a partial repo and then fetching additional commits. I wonder if it still stores things efficiently on disk that way. I'd be interested in offering this as an opt-in feature when adding a new repo.

Is pulling a single commit that wasn't part of this initial git clone be problematic?

I would also suggest like a cron that would pull the git repo in batches (something like pull the last 1000 commits, then pull the 1000 commits before that). Is that something that is possible?

@kadamanas93
Copy link
Author

I would also suggest like a cron that would pull the git repo in batches (something like pull the last 1000 commits, then pull the 1000 commits before that). Is that something that is possible?

Looking into this, there is an option --deepen that can be executed in a loop until the whole repo is fetched. I am not sure what the end condition would be, but at least ArgoCD can start with a partial repo.

@crenshaw-dev
Copy link
Collaborator

I guess batch-fetching would allow you to avoid making the timeout super high.

An alternative would be to have a separate timeout for the initial fetch vs subsequent fetches. It wouldn't save disk space, but it would avoid making the timeout unnecessarily high for normal fetches.

@kadamanas93
Copy link
Author

I guess batch-fetching would allow you to avoid making the timeout super high.

An alternative would be to have a separate timeout for the initial fetch vs subsequent fetches. It wouldn't save disk space, but it would avoid making the timeout unnecessarily high for normal fetches.

The issue with the alternative is that the initial fetch has to succeed. Otherwise ArgoCD ends up in a state where all the Applications are in Unknown state. At least with batch fetching, ArgoCD can partially run.

@pierluigilenoci
Copy link

pierluigilenoci commented Jun 10, 2022

@crenshaw-dev @kadamanas93 interesting discussion but I would like to point out that there is also another interesting point that is at the heart of the matter.

The repo server cache also creates problems on node disks.
With pod evicted with the message Pod The node had condition: [DiskPressure].

Quoting myself:

I just wish that in the presence of large caches, a situation aggravated also by some bug (#4002 and helm/helm#10583), the Repo Server pods are evicted or, even worse, other pods are evicted in the same node

Ref: argoproj/argo-helm#438 (comment)

It might make sense to link the CNCF's Slack (stranded) discussion as well:
https://cloud-native.slack.com/archives/C020XM04CUW/p1653511733823829

@pierluigilenoci
Copy link

@kadamanas93 any update on this?

@crenshaw-dev
Copy link
Collaborator

@pierluigilenoci none from me. Glancing back over the thread, it looks like we have a few potential workarounds. The STS change would require preserving some index of the cache.

With either a workaround or a persisted index cache, I don't currently have time to work on it. If you're up to contribute one of these, I'd be happy to help with the dev env and/or review.

@pierluigilenoci
Copy link

@crenshaw-dev sorry for the delay in answering.

The workarounds that exist are either poor (only one replica?) or do not solve some problems (ie disk pressure on the node's disk) unless there is one clever enough I find that StatefulSet is the only solution.

@crenshaw-dev
Copy link
Collaborator

@pierluigilenoci there have been a number of bugs causing disk pressure, but I think it's preferable to tackle those bugs rather than introducing another potentially-buggy feature. Disk usage should be fairly flat after repos are initially configured. If it's not, we should work to tune cache settings and potentially offer features to run git gc or configure fetches to be lighter-weight.

@crenshaw-dev
Copy link
Collaborator

Again, I'm not completely opposed to PVC. I just haven't seen a use case yet that I think is impossible to solve with less-drastic changes.

@blakepettersson
Copy link
Member

blakepettersson commented May 19, 2023

This seems like something which would be better addressed by adding support for sparse checkout / partial clones IMO (see #11198)

@mkilchhofer
Copy link
Member

Maybe something for sig-scalability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants