Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent git commands on the same repo fail #211

Open
mattwynne opened this issue Sep 10, 2020 · 13 comments · Fixed by #578
Open

Concurrent git commands on the same repo fail #211

mattwynne opened this issue Sep 10, 2020 · 13 comments · Fixed by #578
Labels
debt Improvement of software/operational architecture (e.g. infrastructure, automation, refactoring) pinned

Comments

@mattwynne
Copy link
Contributor

Related to #210, if you run a manual fetch immediately after a connect, you'll see a failure because the background fetch triggered by the domain rule seems to be clashing with the manual fetch.

The actual error I see is this:

[server]   <-- POST /repos/smoke-test-423955FC-68A3-4A72-B4ED-B9FCBEED12D8
[server] (node:7496) UnhandledPromiseRejectionWarning: Error: Git command `config gc.pruneExpire never` returned exit code 255:
[server] error: could not lock config file config: File exists
[server]
[server]     at GitDirectory.exec (/Users/matt/projects/git-en-boite/packages/local-git/src/git_directory.ts:27:13)
[server]     at processTicksAndRejections (internal/process/task_queues.js:97:5)
[server]     at exports.handleInit (/Users/matt/projects/git-en-boite/packages/local-git/src/handlers/handleInit.ts:9:3)
[server]     at RepoFactory.open (/Users/matt/projects/git-en-boite/packages/local-git/src/repo_factory.ts:49:5)
[server]     at Function.openGitRepo (/Users/matt/projects/git-en-boite/packages/local-git/src/dugite_git_repo.ts:17:22)
[server]     at BackgroundGitRepos.openGitRepo (/Users/matt/projects/git-en-boite/packages/local-git/src/background_git_repos.ts:64:21)
[server]     at DiskRepoIndex.find (/Users/matt/projects/git-en-boite/packages/repo-index/src/disk_repo_index.ts:20:29)
[server]     at LaBoîte.fetchFromRemote (/Users/matt/projects/git-en-boite/packages/app/src/la_boîte.ts:43:18)

So this looks like at the moment it's just caused the config commands we run in handleInit which we're blindly running each time you open a repo folder. We can probably avoid this, but there might be other problems with the other git commands when we get past this. Needs more investigation.

@mattwynne mattwynne changed the title Concurrent fetch commands fail Concurrent git commands fail Sep 11, 2020
@mattwynne mattwynne changed the title Concurrent git commands fail Concurrent git commands on the same repo fail Sep 11, 2020
@mattwynne mattwynne changed the title Concurrent git commands on the same repo fail Concurrent git init commands on the same repo fail Sep 28, 2020
mattwynne added a commit that referenced this issue Sep 28, 2020
Use it in the appropriate places in the tests
@mattwynne mattwynne changed the title Concurrent git init commands on the same repo fail Concurrent git commands on the same repo fail Sep 28, 2020
@stale
Copy link

stale bot commented Oct 28, 2020

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙏.

@stale stale bot added the wontfix label Oct 28, 2020
@stale stale bot closed this as completed Nov 5, 2020
@mattwynne mattwynne added the debt Improvement of software/operational architecture (e.g. infrastructure, automation, refactoring) label Nov 11, 2020
@mattwynne mattwynne reopened this Feb 25, 2021
@mattwynne
Copy link
Contributor Author

We saw this again today when testing #545

@mattwynne
Copy link
Contributor Author

mattwynne commented Mar 4, 2021

OK I've caught this today with a new acceptance test. You can reproduce it with:

show_logs=1 yarn acceptance test:wip

Here's the error we see:

fatal: Unable to create '/private/var/folders/l9/95tdbmtd3_s0jh3m9hjt6sk80000gn/T/tmp-84916-93ii628SIl64/7265706f2d38323034333935352d383137302d343661302d613234302d316536353434313064626237/shallow.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.

at GitDirectory.exec (/Users/matt/projects/git-en-boite/packages/local-clones/src/git_directory.ts:24:30)

@mattwynne
Copy link
Contributor Author

It's interesting that it's using shallow.lock in this instance. Most of the stuff I can find about git's lockfiles talks about index.lock though even that I can't find much documentation about.

mattwynne added a commit that referenced this issue Mar 4, 2021
Relates-to #211

It's debatable whether this deserves to be surfaced in an acceptance
test, but I'm not sure yet which layer is the right place to solve this,
so it makes sense to put it in our full-stack tests in any case. And
it's probably useful to document this capabilty.
@mattwynne
Copy link
Contributor Author

Moved this into a branch: 211-support-concurrent-git-commands so we don't have too many WIP scenarios in the main branch.

@mattwynne
Copy link
Contributor Author

I've resolved this for now by returning a 503 error with a 60-second retry-after header. This is better than a dumb 500 error, but much quicker than trying to handle the concurrency issue properly. I think we should monitor it and see how much of a problem it becomes.

We could also do more proactive exploratory testing on it. I can imagine that there could be other errors coming from git than the ones it currently handles if we try doing several things at the same time, such as making a commit during a fetch, or fetching during a push.

@mattwynne mattwynne reopened this Mar 5, 2021
@mattwynne
Copy link
Contributor Author

mattwynne commented Mar 5, 2021

I've had a think about the options for resolving this "properly". Largely, this comes down to choices about which layer we want to handle this problem at. Is it something for the local-clones git adapter to handle, or should the domain be aware that a repo could be busy? Or should the application itself be responsible for queuing commands for each repo so only one can happen at the same time.

One thing I'm pretty sure of is that we should avoid trying to solve this by bending the bullmq infrastructure to use one queue per repo. We tried this before (see ADR 0011) and it didn't work out well.

My hunch is that the problem could be solved by domain events, once we've implemented #242 so that any node can see all the events.

mattwynne added a commit that referenced this issue Mar 8, 2021
Ref: #211

Co-authored by: Romain Gérard <romaingweb@gmail.com>
@mattwynne
Copy link
Contributor Author

I wonder whether this is another instance of this same problem:

InvalidRepoUrl: warning: unable to access './config': Stale file handle
fatal: unable to access 'config': Stale file handle

    at handleRemoteError (/app/packages/local-clones/dist/handlers/handleValidateRemote.js:9:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async handleValidateRemote (/app/packages/local-clones/dist/handlers/handleValidateRemote.js:12:5)
    at async handleConnect (/app/packages/local-clones/dist/handlers/handleConnect.js:6:5)
    at async Worker.processJob [as processFn] (/app/packages/local-clones/dist/background_worker_local_clones.js:123:28)
    at async Worker.processJob (/app/node_modules/bullmq/dist/classes/worker.js:234:28)
    at async Worker.run (/app/node_modules/bullmq/dist/classes/worker.js:99:34)

@mattwynne
Copy link
Contributor Author

We're doing #572 first, which should cut down a lot of the problems by only doing fetches when absolutely neccesary.

@mattwynne
Copy link
Contributor Author

I have been thinking about this today, inspired by the GitHub Desktop blog post shared by @romaingweb on Slack.

If we want to get an exclusive lock on a Repo before working with it, I think the right place to do that would be in the InventoryOfRepos which is where we get hold of a Repo domain model instance before calling methods to do work through our local clones.

We can change this interface to offer either an ExclusiveRepo or a ConcurrentRepo depending on which method you call, and put the different methods on Repo into either interface, depending on whether we believe the operation in question needs to be done exclusively or not. For calls to the ExclusiveRepo we can drop a lock file on disk until the transaction has completed, and any other calls to ask for an ExclusiveRepo can block until that lock file has been removed.

@stale
Copy link

stale bot commented Apr 17, 2021

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙏.

@stale stale bot added the wontfix Things we've decided not to do label Apr 17, 2021
@stale stale bot closed this as completed Apr 24, 2021
@mattwynne
Copy link
Contributor Author

We could use something like https://www.npmjs.com/package/redis-semaphore to hold a mutex on the repo while working on it.

@mattwynne mattwynne reopened this Apr 26, 2021
@stale stale bot removed the wontfix Things we've decided not to do label Apr 26, 2021
@mattwynne mattwynne added pinned wontfix Things we've decided not to do labels Apr 26, 2021
@stale stale bot removed the wontfix Things we've decided not to do label Apr 26, 2021
@cbohiptest
Copy link
Contributor

The concurrency problems does not happen anymore since only one container is used for workers.
With the deployment on rancher the number of container has been reduced to 1 that prevents parallel call on the same repo.

The problem will need to be resolved when more containers will be needed.

An ADR has been written in 382a2d0.

mattwynne referenced this issue May 5, 2021
Co-authored-by: Julien Biezemans <jb@jbpros.com>
Co-authored-by: cbohiptest <celine.bon@hiptest.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Improvement of software/operational architecture (e.g. infrastructure, automation, refactoring) pinned
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants