Fetch and clone support (bare) #450

Byron · 2022-07-01T03:23:49Z

We want shallow clones and this issue tracks what needs to be done to get there.

Prerequisite tasks for bare clones

Follow-ups of `ditch naive implementation`

Most of these are optional, but represent opportunities to make gix better, so shouldn't be lost.

nicer API for gix-config overrides #883
64bit dates #892
traverse-with-commitgraph #887 ~~see if commit_graph() can return our own type connected to Repo, or if the graph can be made to be more convenient to use with gix::Id~~ - not really, but getting traversal with commitgraph support would be great. Probably it can simply be retro-fitted to the existing traversal. But then again, it would speed up generating ids, but most people using that kind of traversal would just want to access commits plainly, which forces loading them anyway. So it's probably OK to keep it as is. - retro-fitted commit-graph support, because it will be useful to some
visualize commit-graphs as SVG #893
gix corpus MVP #897 (initial version with tracing)
gix corpus with a little more to do

Additional tasks

These are for correctness, but don't block cargo integration as no cargo tests depend on them.

allow to downgrade connections like git does, should be no problem. Maybe find a way to let the user enforce protocol versions, let's see how git does it.
make it possible to not send streaming posts - that is only needed for posting packs and some git servers can't handle 'chunked' encoding that results from it. Lastly, git itself uses content-length as the buffer is pre-created in memory.
additional HTTP configuration as per cargo configuration
correctly re-obtain credential helper configuration for each URL (but don't rewrite, it's Remote's only)
make pack tempfiles appear like they do in git to help with cleanup in case of SIGKILL.
ability to turn off 'is currently checked out' sanity check to emulate git fetch --update-head-ok. Cargo passes it to the CLI and maybe it's something we will need too just to make its updates work.

Tasks for proper transport configuration

try to implement complex http.<url>.* based option overrides

Tasks for shallow cloning

Research needed, but the libgit2 issue might be helpful for more hints.

Research

a nice overview document
packs are forced non-thin when .git/shallow is present (containing the commits that are the shallow boundary, present, but without parents)
shallow repositories can be cloned from and remotes send that information along, making the clone shallow, too.

Watch out

Much of this work is happening in git-repository, which is tracked in git-repository/gix towards 1.0 #470 .
subsequent fetches must not accidentally change the depths of the repository, but only fetch what changed inbetween. See point 2 in this comment. Note that I believe that pathological CPU usage in shallow clones on the server has been fixed by now.
Ed Page states that according to GitHub employees, shallow clones are only expensive if depth > 1 or converting it back to having full history.

The text was updated successfully, but these errors were encountered:

chazer · 2022-07-24T04:10:56Z

I recently encountered an problems to clone a large repository over a extremely slow data link. After a certain timeout, the server (or intermediate proxy) terminated the connection.

Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits). Git gets an error and doesn't unpack the truncated response. Need unpack it manually. And list of 'have'-directives in protocol request didn't help for me.
I had to learn the low protocol and recursively fetch each tree-objects (in single, until disconnect) and then missing blob-objects (in batches).

Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken.

(p.s. this shallow cloning took me 24 GB over 1 week)

Byron · 2022-07-25T01:26:28Z

Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits).

Did you try the --depth 1 option? With that git would prepare only the objects that are relevant to the commit at the requested reference. This in conjunction with the --filter option allows to split clones into receiving only trees and then filling in the blobs in a separate commit. That way it's even possible to obtain the entire history, commits only, and trees and blobs for the most recent commit.

Git gets an error and doesn't unpack the truncated response.

That's true - the reason might be that it is unable to validate the received objects as the trailing hash of the received pack would be missing. However, I also have been burned by this which is why there is a special restore mode when receiving a pack. It salvages the received objects at least.

However, the way the git protocol works the server still may send all the objects the next time the reference is requested as the algorithm's granularity is only per commit. With partial packs, it' entirely unclear which objects are present and which aren't unless they are all traversed and verified. So, in order to actually have a benefit from keeping a partial pack one would have to see which commits are completely available (while handling --filter correctly, I presume), to then be able to avoid having these complete commits being resent. Of course there is no guarantee that any commit is actually complete due to the way objects are sorted into a pack, which is to optimize compression as opposed to 'distance to the owning commit'.

That said, there is a bunch of things one could implement to help this case if the client and server would implement some custom extensions.

I had to learn the low protocol and recursively fetch each tree-objects (in single, until disconnect) and then missing blob-objects (in batches).

Awesome, I love it! I would have given up for sure!

Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken.

It would certainly be interesting to learn more about the algorithm you used to split up big clones into many smaller ones as it wouldn't require a server and client extension to the protocol. Such client-side only algorithm could possibly be implemented in gitoxide then, and I am open to that, too.

#450)

That way the most complex thing will be the validation along with the matching.

Some possible states are still missing though, like deletion in pushes.

It's the simplest possible one, but it shows the test framework is up to the task now so it can be test-driven. We should be able to construct a test for each possible instruction and eventually pass all tests, including the baseline ones.

Even though for now everything is without validation

It's not documented in `git-push`, even though git parses it fine for some reason.

…e::Name`. (#450) That way it's made clear the remote can also be a URL, while rejecting illformed UTF8. The latter isn't valid for remote names anyway as these only support a very limited character set. Note that this error currently is degenerated, making it appear if the remote name doesn't exists if illformed UTF-8 is found in what appears to be a symbolic ref.

`reference::remote::Name` (#450)

…#450) We can also parse it, adding yet another variant to `fetch::Refs`.

That way the caller has to be aware of the possibility of an unborn branch (probably the only unborn branch) on the remote.

…stination) (#450) That's exactly what git does, so it's probably the right thing to do if in doubt.

Previously we assumed this could only happen for `HEAD`, but in fact dangling symrefs are possible and they might end up in the server response that way.

Don't use `static` unless it's really needed

However, it's not yet refreshed in the repository we create, so that needs fixing. Implementing `repo.config()` would be too much effort for now, so let's continue forcing it in another way.

gitoxide integration: fetch This PR is the first step towards resolving #1171. In order to get there, we integrate `gitoxide` into `cargo` in such a way that one can control its usage in nightly via `-Zgitoxide` or `Zgitoxide=<feature>[,featureN]`. Planned features are: * **fetch** - all fetches are done with `gitxide` (this PR) * **shallow_index** - the crates index will be a shallow clone (_planned_) * **shallow_deps** - git dependencies will be a shallow clone (_planned_) * **checkout** - plain checkouts with `gitoxide` (_planned_) The above list is a prediction and might change as we understand the requirements better. ### Testing and Transitioning By default, everything stays as is. However, relevant tests can be re-runwith `gitoxide` using ``` RUSTFLAGS='--cfg always_test_gitoxide' cargo test git ``` There are about 200 tests with 'git' in their name and I plan to enable them one by one. That way the costs for CI stay managable (my first measurement with one test was 2min 30s), while allowing to take one step at a time. Custom tests shall be added once we realize that more coverage is needed. That way we should be able to maintain running `git2` and `gitoxide` side by side until we are willing to switch over to `gitoxide` entirely on stable cargo. Then turning on `git2` might be a feature toggle for a while until we finally remove it from the codebase. _Please see the above paragraph as invitation for discussion, it's merely a basis to explore from and improve upon._ ### Tasks * [x] add feature toggle * [x] setup test system with one currently successful test * [x] implement fetch with `gitoxide` (MVP) * [x] fetch progress * [x] detect spurious errors * [x] enable as many git tests as possible (and ignore what's not possible) * [x] fix all git-related test failures (except for 1: built-in upload-pack, skipped for now) * [x] validate that all HTTP handle options that come from `cargo` specific values are passed to `gitoxide` * [x] a test to validate `git2` code can handle crates-index clones created with `gitoxide` and vice-versa * [x] remove patches that enabled `gitoxide` enabled testing - it's not used anymore * [x] ~~remove all TODOs and use crates-index version of `git-repository`~~ The remaining 2 TODO's are more like questions for the reviewer. * [x] run all tests with gitoxide on the fastest platform as another parallel task * [x] switch to released version * [x] [Tasks from first review round](#11448 (comment)) * [x] create a new `gitoxide` release and refer to the latest version from crates.io (instead of git-dependency) * [x] [address 2nd review round comments](#11448 (comment)) ### Postponed Tasks I suggest to go breadth-first and implement the most valuable features first, and then aim for a broad replacement of `git2`. What's left is details and improved compatibility with the `git2` implementation that will be required once `gitoxide` should become the default implementation on stable to complete the transition. * **built-in support for serving the `file` protocol** (i.e. without using `git`). Simple cases like `clone` can probably be supported quickly, `fetch` needs more work though due to negotiation. * SSH name fallbacks via a native (probably ~~libssh~~ (avoid LGPL) `libssh2` based) transport. Look at [this issue](#2399) for some history. * additional tasks from [this tracking issue](Byron/gitoxide#450 (comment)) ### Proposed Workflow I am now using [stacked git](https://stacked-git.github.io) to keep commits meaningful during development. This will also mean that before reviews I will force-push a lot as changes will be bucketed into their respective commits. Once review officially begins I will stop force-pushing and create small commits to address review comments. That way it should be easier to understand how things change over time. Those review-comments can certainly be squashed into one commit before merging. _Please let me know if this is feasible or if there are other ways of working you prefer._ ### Development notes * unrelated: [this line](https://github.com/rust-lang/cargo/blob/9827412fee4f5a88ac85e013edd954b2b63f399b/src/cargo/ops/registry.rs#L620) refers to an issue that has since been resolved in `curl`. * Additional tasks related to a correct fetch implementation are collected in this [tracking issue](Byron/gitoxide#450). **These affect how well the HTTP transport can be configured, needs work** * _authentication_ [is quite complex](https://github.com/rust-lang/cargo/blob/37cad5bd7f7dcd2f6d3e45312a99a9d3eec1e2a0/src/cargo/sources/git/utils.rs#L490) and centred around making SSH connections work. This feature is currently the weakest in `gitoxide` as it simply uses `ssh` (the program) and calls it a day. No authentication flows are supported there yet and the goal would be to match `git` there at least (which it might already do by just calling `ssh`). Needs investigation. Once en-par with `git` I think `cargo` can restart the whole fetch operation to try different user names like before. - the built-in `ssh`-program based transport can now understand permission-denied errors, but the capability isn't used after all since a builtin ssh transport is required. * It would be possible to implement `git::Progress` and just ignore most of the calls, but that's known to be too slow as the implementation assumes a `Progress::inc()` call is as fast as an atomic increment and makes no attempt to reduce its calls to it. * learning about [a way to get custom traits in `thiserror`](dtolnay/thiserror#212) could make spurious error checks nicer and less error prone during maintenance. It's not a problem though. * I am using `RUSTFLAGS=--cfg` to influence the entire build and unit-tests as environment variables didn't get through to the binary built and run for tests. ### Questions * The way `gitoxide` is configured the user has the opportunity to override these values using more specific git options, for example using url specific http settings. This looks like a feature to me, but if it's not `gitoxide` needs to provide a way to disable applying these overrides. Please let me know what's desired here - my preference is to allow overrides. * `gitoxide` currently opens repositories similar to how `git` does which respects git specific environment variables. This might be a deviation from how it was before and can be turned off. My preference is to see it as a feature. ### Prerequisite PRs * #11602

ofek · 2024-01-20T15:58:18Z

Is this actually complete despite the unfinished tasks in the OP?

Byron · 2024-01-20T16:44:57Z

It works for all intents and purposes but isn’t perfect related to some details. These are still tracked here, maybe they can be moved into a follow-up issue.

Byron added the C-tracking-issue An issue to track to track the progress of multiple PRs or issues label Jul 1, 2022

Byron added this to In progress in Collaboration Board Jul 1, 2022

Byron mentioned this issue Jul 1, 2022

non-bare clone (with worktrees) #303

Open

30 tasks

Byron added a commit that referenced this issue Aug 5, 2022

empty git-refspec crate for name reservation prior to implementation (

871a3c0

#450)

Byron added a commit that referenced this issue Aug 5, 2022

prepare git-refspec changelog prior to release (#450)

3383408

Byron added a commit that referenced this issue Aug 5, 2022

all baseline test cases from git's test-suite (#450)

362bd46

Byron added a commit that referenced this issue Aug 5, 2022

sketch data structure that should do the trick (#450)

5c823dc

That way the most complex thing will be the validation along with the matching.

Byron added a commit that referenced this issue Aug 5, 2022

frame for basic parsing (#450)

b9a4bdc

Byron added a commit that referenced this issue Aug 5, 2022

first few bits of error handling in parser (#450)

9c5fed2

Byron added a commit that referenced this issue Aug 6, 2022

run the baseline test and gather some information (#450)

5e4ee9b

Byron added a commit that referenced this issue Aug 6, 2022

sketch all possible instructions of fetch and push refspecs (#450)

0ba1b73

Byron added a commit that referenced this issue Aug 6, 2022

refactor (#450)

6713793

Byron added a commit that referenced this issue Aug 6, 2022

sort out how expectations can be expressed in test suite (#450)

3f264af

Some possible states are still missing though, like deletion in pushes.

Byron added a commit that referenced this issue Aug 6, 2022

add include directive (#450)

701d46f

Byron added a commit that referenced this issue Aug 6, 2022

support for deletion (#450)

966a9e9

Even though for now everything is without validation

Byron added a commit that referenced this issue Aug 6, 2022

handle colon and empty on the right side (#450)

7afebb7

Byron added a commit that referenced this issue Aug 6, 2022

basic validation and detection of patterns (#450)

e4227d6

Byron added a commit that referenced this issue Aug 6, 2022

Better handling of special cases (#450)

c99f575

Byron added a commit that referenced this issue Aug 6, 2022

tests for handling exclusions (#450)

c4499ce

Byron added a commit that referenced this issue Aug 6, 2022

tests causing all instrucitons (#450)

c23a21d

Byron added a commit that referenced this issue Aug 6, 2022

refactor (#450)

e8c072e

Byron added a commit that referenced this issue Aug 6, 2022

handle ref-name validation mostly correctly (#450)

d7c2789

Byron added a commit that referenced this issue Aug 6, 2022

and the entire test-suite passes (#450)

3fa52f8

Byron added a commit that referenced this issue Aug 7, 2022

support for @ shortcut. (#450)

32d98e9

Byron added a commit that referenced this issue Aug 7, 2022

negative must not be empty (#450)

79e0eaf

Byron added a commit that referenced this issue Aug 7, 2022

don't allow object hashes in excludes (#450)

b889953

Byron added a commit that referenced this issue Aug 7, 2022

disallow excludes in push mode (#450)

9c280b2

It's not documented in `git-push`, even though git parses it fine for some reason.

Byron added a commit that referenced this issue Nov 1, 2022

refactor (#450)

7bd5263

Byron added a commit that referenced this issue Nov 1, 2022

keep track of several branch related keys (#450)

443a75a

Byron added a commit that referenced this issue Nov 1, 2022

update docs (#450)

c788b51

Byron added a commit that referenced this issue Nov 1, 2022

failing test for us setting up remote information after cloning (#450)

07efbce

Byron added a commit that referenced this issue Nov 1, 2022

fix: support for proper identification of '.' remote paths in

b219033

`reference::remote::Name` (#450)

Byron added a commit that referenced this issue Nov 1, 2022

feat!: Support for unborn ls-refs capability if server supports it. (…

02e37f0

…#450) We can also parse it, adding yet another variant to `fetch::Refs`.

Byron added a commit that referenced this issue Nov 1, 2022

prepare test for handling the 'unborn' lsrefs extension (#450)

547e450

Byron added a commit that referenced this issue Nov 1, 2022

adjust to changes in git-protocol (#450)

ffefe88

Byron added a commit that referenced this issue Nov 2, 2022

change!: fetch::Ref::unpack() returns Option<oid>. (#450)

cd867ad

That way the caller has to be aware of the possibility of an unborn branch (probably the only unborn branch) on the remote.

Byron added a commit that referenced this issue Nov 2, 2022

adapt to changes in git-protocol (#450)

179ccd7

Byron added a commit that referenced this issue Nov 2, 2022

adapt to changes in git-protocol (#450)

64db0b2

Byron added a commit that referenced this issue Nov 2, 2022

Support unborn remotes and pick up their default branch name. (#450)

619fd61

Byron added a commit that referenced this issue Nov 2, 2022

Don't deviate by creating strange reflogs (with null-source & null-de…

f1b5570

…stination) (#450) That's exactly what git does, so it's probably the right thing to do if in doubt.

Byron added a commit that referenced this issue Nov 2, 2022

don't degenerate information about the unborn fetch ref's path. (#450)

42c977f

Previously we assumed this could only happen for `HEAD`, but in fact dangling symrefs are possible and they might end up in the server response that way.

Byron added a commit that referenced this issue Nov 2, 2022

adapt to changes in git-protocol (#450)

d61eb2c

Byron added a commit that referenced this issue Nov 2, 2022

adapt to changes in git-protocol (#450)

f409e4a

Byron added a commit that referenced this issue Nov 2, 2022

refactor (#450)

4dd67db

Don't use `static` unless it's really needed

Byron added a commit that referenced this issue Nov 2, 2022

fix build (#450)

f5155e0

Byron added a commit that referenced this issue Nov 2, 2022

make last test work to allow us to clone properly (#450)

3890f1a

This was referenced Dec 2, 2022

[integration] Shallow clones for cargo #449

Open

gitoxide integration: fetch rust-lang/cargo#11448

Merged

Byron mentioned this issue Dec 15, 2022

auto-tag support #651

Merged

9 tasks

Byron mentioned this issue Jan 4, 2023

Shallow clone #686

Closed

Byron mentioned this issue Jun 5, 2023

integrate gix-negotiate #861

Merged

13 tasks

NobodyXu mentioned this issue Jun 23, 2023

feat: Implement --branch, --tag and --rev cargo-bins/cargo-binstall#1165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch and clone support (bare) #450

Fetch and clone support (bare) #450

Byron commented Jul 1, 2022 •

edited

chazer commented Jul 24, 2022

Byron commented Jul 25, 2022

ofek commented Jan 20, 2024

Byron commented Jan 20, 2024

Fetch and clone support (bare) #450

Fetch and clone support (bare) #450

Comments

Byron commented Jul 1, 2022 • edited

Prerequisite tasks for bare clones

Follow-ups of ditch naive implementation

Additional tasks

Tasks for proper transport configuration

Tasks for shallow cloning

Research

Watch out

chazer commented Jul 24, 2022

Byron commented Jul 25, 2022

ofek commented Jan 20, 2024

Byron commented Jan 20, 2024

Byron commented Jul 1, 2022 •

edited

Follow-ups of `ditch naive implementation`