Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make git fetch behavior configurable #8938

Open
yujunz opened this issue Mar 30, 2022 · 14 comments
Open

Make git fetch behavior configurable #8938

yujunz opened this issue Mar 30, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@yujunz
Copy link
Contributor

yujunz commented Mar 30, 2022

Summary

In the latest release v2.3.3, git fetch behavior was changed to default refspecs first by #8897. This causes performance hit for gerrit server which keeps in-review commits in refs/changes/* outside default refspec.

Motivation

The 👍s in #5605 shows the use case of fetching commit outside default ref spec is not neglect-able.

Our deployment model depends heavily on unmerged commit especially for frequently deployed stage environments. Fetch default then checkout will fail almost on every deploy while fetching specific ref will succeed no matter it is in default spec or not. The disk usage is a fair tradeoff for performance to us.

Checked some of our repository using the fetch commit first mechanism. The frequent used repository does cost more disk usage (553M / 11 files vs 6.4G / 41 files). Despite the high ratio (~8x by size, ~4x by files), the absolute disk usage is not huge. And git gc would effectively reduce it to 540M / 3 files in a few seconds.

Proposal

  1. Make git fetch behavior configurable. It doesn't have be to exposed in UI. A boolean field in the spec would be good enough as a feature flag since it is a one time configuration and most admins should be comfortable to set it via CLI or declarative configuration.
  2. Run git gc periodically in repo server to reduce the disk usage if that is a concern. In this way, we get both the performance on checkout and efficient disk usage with a relatively low price of a background job when idle.
@yujunz yujunz added the enhancement New feature or request label Mar 30, 2022
@crenshaw-dev
Copy link
Collaborator

@yujunz thanks for following up on this! A couple questions:

  1. Can you quantify the performance hit?
  2. I think the fetch refspec is configurable. What if we just added a fetchRefSpec field so that git checkout won't fail after the initial fetch?

@yujunz
Copy link
Contributor Author

yujunz commented Mar 31, 2022

@crenshaw-dev thanks for getting back to this topic

Gerrit model is much different from GitHub/GitLab. The default checkout command for gerrit is

git fetch ssh://yujunz@git/repo refs/changes/08/34308/4 && git checkout FETCH_HEAD

Gerrit creates a new ref for every commit uploaded for review, i.e. refs/change/<shard>/<change_num>/<patch_num>. So a single branch repository with only one head in default refspec may have ~100k refs in total. Fetching all of them is not only slow (in walking all refs and pack the delta in remote) but also cost more disk usage (and most of them are not used). So changing ref spec is really not a feasible solution.

The performance hit of fetching default first is an extra git fetch and checkout which basically doubles the time cost for a single fetch and checkout.

Do you think running git gc in reposerver background would solve the disk usage problem without sacrificing the performance of checkout?

@crenshaw-dev
Copy link
Collaborator

The performance hit of fetching default first is an extra git fetch and checkout which basically doubles the time cost for a single fetch and checkout.

Can you quantify that doubling in absolute terms? It'll be easier to advocate for this change if we can say "checkout is taking 1000ms instead of 500ms, and the extra 500ms are a problem because of ____." Without that analysis, we risk pushing for something that looks like premature optimization.

Do you think running git gc in reposerver background would solve the disk usage problem without sacrificing the performance of checkout?

There are a few options I've thought of for gc:

  1. gc after every fetch
  2. gc after every N fetches
  3. gc every M seconds

A gc after every fetch would 1) slow down GenerateManifest and 2) might prevent parallel repo access (slowing down other GenerateManifest calls).

A gc after every N fetches would 1) require the user to research their repo's packfile sizes and configure N to keep disk usage in a tolerable range and 2) pick "loser" GenerateManifest requests to take the gc performance hit.

A gc every M seconds would lock the repo at potentially-inconvenient times (like times with high load), and it wouldn't offer any disk usage guarantees (a burst of commits could exceed tolerations).

I think all of those are problematic enough to justify looking for alternatives.

changing ref spec is really not a feasible solution.

Yep, that makes sense! I think your initial suggestion is the most compelling: add a field to the Application spec to configure whether we fetch specific commit SHAs or just do a revisionless fetch.

@yujunz
Copy link
Contributor Author

yujunz commented Apr 1, 2022

Can you quantify that doubling in absolute terms? It'll be easier to advocate for this change if we can say "checkout is taking 1000ms instead of 500ms, and the extra 500ms are a problem because of ____." Without that analysis, we risk pushing for something that looks like premature optimization.

Good point. I just made a test in my local machine

Fetch default

time git fetch origin
From ssh://git:29418/repo
   0b92f49d5f..683c914127  main       -> origin/main
git fetch origin  0.05s user 0.05s system 1% cpu 6.682 total

❯ time git fetch origin --verbose
From ssh://git:29418/repo
 = [up to date]            main                                         -> origin/main
...
 = [up to date]            wip/yujunz/topic                             -> origin/wip/yujunz/topic
 
git fetch origin --verbose  0.05s user 0.04s system 1% cpu 6.825 total

Then fetch revision

time git fetch origin 1b6309b0538f610cf42706b7e45745215e78f6bb
remote: Counting objects: 35576, done
remote: Finding sources: 100% (8/8)
remote: Total 8 (delta 6), reused 8 (delta 6)
Unpacking objects: 100% (8/8), 724 bytes | 2.00 KiB/s, done.
From ssh://git:29418/repo
 * branch                  1b6309b0538f610cf42706b7e45745215e78f6bb -> FETCH_HEAD
git fetch origin 1b6309b0538f610cf42706b7e45745215e78f6bb  0.16s user 0.19s system 2% cpu 11.820 total

❯ time git fetch origin 1b6309b0538f610cf42706b7e45745215e78f6bb
From ssh://git:29418/repo
 * branch                  1b6309b0538f610cf42706b7e45745215e78f6bb -> FETCH_HEAD
git fetch origin 1b6309b0538f610cf42706b7e45745215e78f6bb  0.10s user 0.07s system 1% cpu 11.798 total

This is a typical latency of a cross region fetch. It is a surprise to me that the second fetch is not faster than the first and fetching default is not helping on fetching revision. IIUC, the latency comes from local, remote, roundtrip and network transfer. The performance could be much better on the server side since it is co-located with git repository and much powerful than my Mac. The latency should be no more than a few seconds I suppose.

There are a few options I've thought of for gc:

I was thinking about a weekly job during weekend or daily job over night, i.e. during idle period. It is not a time critical job anyway.

Yep, that makes sense! I think your initial suggestion is the most compelling: add a field to the Application spec to configure whether we fetch specific commit SHAs or just do a revisionless fetch.

Good to know!

@crenshaw-dev
Copy link
Collaborator

I was thinking about a weekly job during weekend or daily job over night, i.e. during idle period. It is not a time critical job anyway.

I guess there could be a /gc API endpoint, so the user could configure a CronJob that matches their needs.

Are you facing significant enough performance issues to justify writing a PR with the new Application config field? Unless it's become a problem for a large number of users, I don't know if I'll be able to dedicate dev time to it any time soon.

@yujunz
Copy link
Contributor Author

yujunz commented Apr 2, 2022

Not yet, we are still on the previous release that fetches revision then fallback to default. So not an urgent issue for now.

@MichaelSp
Copy link

MichaelSp commented Jun 7, 2022

For us this is also an issue. And since we have to run v2.3.4 because of the security fixes, there is no way back to pre 2.3.3 releases.
In our use-case we need to fetch a specific SHA from GitHub.

From the proposed options 2 and 3 would work for us (2. gc after every N fetches, 3. gc every M seconds). FMPOV the default setup should work for regular cases. If you have a very specific demand when to run GC on demand, the /gc endpoint sounds like a good customisable solution.

@crenshaw-dev
Copy link
Collaborator

Would anyone be up for contributing a PR? I'd be happy to review.

@MichaelSp
Copy link

Can we please revert this performance improvement, because for us now it's not working at all.

@crenshaw-dev
Copy link
Collaborator

crenshaw-dev commented Jun 23, 2022

@MichaelSp unfortunately I don't think reversion is an option. The change wasn't only a performance improvement, it resolved issues where folks' disks were filling up and crashing the repo-server. At Intuit we had to downgrade an Argo CD instance until this patch was released, because it was causing issues for an internal customer.

for us now it's not working at all

My understanding from the conversation above is that the change is only an issue for high-latency (typically cross-origin) fetches, because the second call introduces more round trips. I'm not sure what you mean when you say that "it's not working at all." Are fetches completely broken, or is the additional latency causing some issue?

@MichaelSp
Copy link

MichaelSp commented Jun 23, 2022

Some repos we fetch by branch and some we fetch with a specific SHA. ALL SHA fetches do no longer work.

Example by branch: ✅

project: service
source:
  repoURL: 'https://github/org/service'
  path: service/dev
  targetRevision: main

Example by SHA: ❌

project: operator
source:
  repoURL: 'https://github/org/service'
  path: operator/one
  targetRevision: 8b5b0099a0298f3e95320b78cd4bd6e54af43f85

Error message

ComparisonError: rpc error: code = Internal desc = Failed to checkout 8b5b0099a0298f3e95320b78cd4bd6e54af43f85: `git checkout --force 8b5b0099a0298f3e95320b78cd4bd6e54af43f85` failed exit status 128: fatal: reference is not a tree: 8b5b0099a0298f3e95320b78cd4bd6e54af43f85

but cloning the repo locally it is very well possible to git checkout --force 8b5b0099a0298f3e95320b78cd4bd6e54af43f85.

Conclusion: The error message is correct, there is no ref like that in the tree, because it has not been fetched.

@crenshaw-dev
Copy link
Collaborator

@MichaelSp can you confirm that this worked before #8897 ?

The exact same actions should be taken, just in a different order.

Before (abbreviated):

	err = gitClient.Fetch(revision)
	if err != nil {
		err = gitClient.Fetch("")
		err = gitClient.Checkout(revision, submoduleEnabled)
	}

After (abbreviated):

	err = gitClient.Fetch("")
	err = gitClient.Checkout(revision, submoduleEnabled)
	if err != nil {
		// Looks like this is where it's failing for you. But this would have been the _first_ thing called before.
		err = gitClient.Fetch(revision)
	}

@MichaelSp
Copy link

I can't run particular builds, but only releases:

v2.4.2+c6d0c8b
v2.3.4+ac8b7df
v2.3.3+07ac038 ❌ (introduces #8845 via PR-#8897)
v2.3.2+ecc2af9 ❌ 😱
v2.3.1+b65c169 ❌ 😡

I guess I have some debugging to do. Thanks @crenshaw-dev for putting my on the right track. 😢 I'll come back with my findings.

@MichaelSp
Copy link

Apologies 🤦 I was running an outdated repo-server image regardless. It works! ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants