Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch and clone support (bare) #450

Open
23 of 27 tasks
Tracked by #449
Byron opened this issue Jul 1, 2022 · 4 comments
Open
23 of 27 tasks
Tracked by #449

Fetch and clone support (bare) #450

Byron opened this issue Jul 1, 2022 · 4 comments
Labels
C-tracking-issue An issue to track to track the progress of multiple PRs or issues

Comments

@Byron
Copy link
Owner

Byron commented Jul 1, 2022

We want shallow clones and this issue tracks what needs to be done to get there.

Prerequisite tasks for bare clones

Follow-ups of ditch naive implementation

Most of these are optional, but represent opportunities to make gix better, so shouldn't be lost.

  • nicer API for gix-config overrides #883
  • 64bit dates  #892
  • traverse-with-commitgraph #887 see if commit_graph() can return our own type connected to Repo, or if the graph can be made to be more convenient to use with gix::Id - not really, but getting traversal with commitgraph support would be great. Probably it can simply be retro-fitted to the existing traversal. But then again, it would speed up generating ids, but most people using that kind of traversal would just want to access commits plainly, which forces loading them anyway. So it's probably OK to keep it as is. - retro-fitted commit-graph support, because it will be useful to some
  • visualize commit-graphs as SVG #893
  • gix corpus MVP #897 (initial version with tracing)
  • gix corpus with a little more to do

Additional tasks

These are for correctness, but don't block cargo integration as no cargo tests depend on them.

  • allow to downgrade connections like git does, should be no problem. Maybe find a way to let the user enforce protocol versions, let's see how git does it.
  • make it possible to not send streaming posts - that is only needed for posting packs and some git servers can't handle 'chunked' encoding that results from it. Lastly, git itself uses content-length as the buffer is pre-created in memory.
  • additional HTTP configuration as per cargo configuration
  • correctly re-obtain credential helper configuration for each URL (but don't rewrite, it's Remote's only)
  • make pack tempfiles appear like they do in git to help with cleanup in case of SIGKILL.
  • ability to turn off 'is currently checked out' sanity check to emulate git fetch --update-head-ok. Cargo passes it to the CLI and maybe it's something we will need too just to make its updates work.

Tasks for proper transport configuration

  • try to implement complex http.<url>.* based option overrides

Tasks for shallow cloning

Research needed, but the libgit2 issue might be helpful for more hints.

Research

  • a nice overview document
  • packs are forced non-thin when .git/shallow is present (containing the commits that are the shallow boundary, present, but without parents)
  • shallow repositories can be cloned from and remotes send that information along, making the clone shallow, too.

Watch out

  • Much of this work is happening in git-repository, which is tracked in git-repository/gix towards 1.0 #470 .
  • subsequent fetches must not accidentally change the depths of the repository, but only fetch what changed inbetween. See point 2 in this comment. Note that I believe that pathological CPU usage in shallow clones on the server has been fixed by now.
  • Ed Page states that according to GitHub employees, shallow clones are only expensive if depth > 1 or converting it back to having full history.
@Byron Byron added the C-tracking-issue An issue to track to track the progress of multiple PRs or issues label Jul 1, 2022
@Byron Byron added this to In progress in Collaboration Board Jul 1, 2022
@chazer
Copy link

chazer commented Jul 24, 2022

I recently encountered an problems to clone a large repository over a extremely slow data link. After a certain timeout, the server (or intermediate proxy) terminated the connection.

Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits). Git gets an error and doesn't unpack the truncated response. Need unpack it manually. And list of 'have'-directives in protocol request didn't help for me.
I had to learn the low protocol and recursively fetch each tree-objects (in single, until disconnect) and then missing blob-objects (in batches).

Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken.

(p.s. this shallow cloning took me 24 GB over 1 week)

@Byron
Copy link
Owner Author

Byron commented Jul 25, 2022

Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits).

Did you try the --depth 1 option? With that git would prepare only the objects that are relevant to the commit at the requested reference. This in conjunction with the --filter option allows to split clones into receiving only trees and then filling in the blobs in a separate commit. That way it's even possible to obtain the entire history, commits only, and trees and blobs for the most recent commit.

Git gets an error and doesn't unpack the truncated response.

That's true - the reason might be that it is unable to validate the received objects as the trailing hash of the received pack would be missing. However, I also have been burned by this which is why there is a special restore mode when receiving a pack. It salvages the received objects at least.

However, the way the git protocol works the server still may send all the objects the next time the reference is requested as the algorithm's granularity is only per commit. With partial packs, it' entirely unclear which objects are present and which aren't unless they are all traversed and verified. So, in order to actually have a benefit from keeping a partial pack one would have to see which commits are completely available (while handling --filter correctly, I presume), to then be able to avoid having these complete commits being resent. Of course there is no guarantee that any commit is actually complete due to the way objects are sorted into a pack, which is to optimize compression as opposed to 'distance to the owning commit'.

That said, there is a bunch of things one could implement to help this case if the client and server would implement some custom extensions.

I had to learn the low protocol and recursively fetch each tree-objects (in single, until disconnect) and then missing blob-objects (in batches).

Awesome, I love it! I would have given up for sure!

Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken.

It would certainly be interesting to learn more about the algorithm you used to split up big clones into many smaller ones as it wouldn't require a server and client extension to the protocol. Such client-side only algorithm could possibly be implemented in gitoxide then, and I am open to that, too.

Byron added a commit that referenced this issue Aug 5, 2022
That way the most complex thing will be the validation along with the
matching.
Byron added a commit that referenced this issue Aug 5, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Some possible states are still missing though, like deletion in pushes.
Byron added a commit that referenced this issue Aug 6, 2022
It's the simplest possible one, but it shows the test framework is up to
the task now so it can be test-driven.
We should be able to construct a test for each possible instruction and
eventually pass all tests, including the baseline ones.
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Even though for now everything is without validation
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 6, 2022
Byron added a commit that referenced this issue Aug 7, 2022
Byron added a commit that referenced this issue Aug 7, 2022
Byron added a commit that referenced this issue Aug 7, 2022
It's not documented in `git-push`, even though git parses it fine for
some reason.
Byron added a commit that referenced this issue Nov 1, 2022
Byron added a commit that referenced this issue Nov 1, 2022
Byron added a commit that referenced this issue Nov 1, 2022
…e::Name`. (#450)

That way it's made clear the remote can also be a URL, while rejecting
illformed UTF8. The latter isn't valid for remote names anyway as these
only support a very limited character set.

Note that this error currently is degenerated, making it appear if the
remote name doesn't exists if illformed UTF-8 is found in what appears
to be a symbolic ref.
Byron added a commit that referenced this issue Nov 1, 2022
Byron added a commit that referenced this issue Nov 1, 2022
…#450)

We can also parse it, adding yet another variant to `fetch::Refs`.
Byron added a commit that referenced this issue Nov 1, 2022
Byron added a commit that referenced this issue Nov 2, 2022
That way the caller has to be aware of the possibility of an unborn
branch (probably the only unborn branch) on the remote.
Byron added a commit that referenced this issue Nov 2, 2022
Byron added a commit that referenced this issue Nov 2, 2022
Byron added a commit that referenced this issue Nov 2, 2022
…stination) (#450)

That's exactly what git does, so it's probably the right thing to do if
in doubt.
Byron added a commit that referenced this issue Nov 2, 2022
Previously we assumed this could only happen for `HEAD`, but in fact
dangling symrefs are possible and they might end up in the server
response that way.
Byron added a commit that referenced this issue Nov 2, 2022
Byron added a commit that referenced this issue Nov 2, 2022
Byron added a commit that referenced this issue Nov 2, 2022
Don't use `static` unless it's really needed
Byron added a commit that referenced this issue Nov 2, 2022
However, it's not yet refreshed in the repository we create, so
that needs fixing.

Implementing `repo.config()` would be too much effort for now, so
let's continue forcing it in another way.
Byron added a commit that referenced this issue Nov 2, 2022
@Byron Byron mentioned this issue Dec 15, 2022
9 tasks
@Byron Byron mentioned this issue Jan 4, 2023
bors added a commit to rust-lang/cargo that referenced this issue Mar 2, 2023
gitoxide integration: fetch

This PR is the first step towards resolving #1171.

In order to get there, we integrate `gitoxide` into `cargo` in such a way that one can control its usage in nightly via `-Zgitoxide` or `Zgitoxide=<feature>[,featureN]`.

Planned features are:

* **fetch** - all fetches are done with `gitxide` (this PR)
* **shallow_index** - the crates index will be a shallow clone (_planned_)
* **shallow_deps** - git dependencies will be a shallow clone (_planned_)
* **checkout** - plain checkouts with `gitoxide` (_planned_)

The above list is a prediction and might change as we understand the requirements better.

### Testing and Transitioning

By default, everything stays as is. However, relevant tests can be re-runwith `gitoxide` using

```
RUSTFLAGS='--cfg always_test_gitoxide' cargo test git
```

There are about 200 tests with 'git' in their name and I plan to enable them one by one. That way the costs for CI stay managable (my first measurement with one test was 2min 30s), while allowing to take one step at a time.

Custom tests shall be added once we realize that more coverage is needed.

That way we should be able to maintain running `git2` and `gitoxide` side by side until we are willing to switch over to `gitoxide` entirely on stable cargo. Then turning on `git2` might be a feature toggle for a while until we finally remove it from the codebase.

_Please see the above paragraph as invitation for discussion, it's merely a basis to explore from and improve upon._

### Tasks

* [x] add feature toggle
* [x] setup test system with one currently successful test
* [x] implement fetch with `gitoxide` (MVP)
* [x] fetch progress
* [x] detect spurious errors
* [x] enable as many git tests as possible (and ignore what's not possible)
* [x] fix all git-related test failures (except for 1: built-in upload-pack, skipped for now)
* [x] validate that all HTTP handle options that come from `cargo` specific values are passed to `gitoxide`
* [x] a test to validate `git2` code can handle crates-index clones created with `gitoxide` and vice-versa
* [x] remove patches that enabled `gitoxide` enabled testing - it's not used anymore
* [x] ~~remove all TODOs and use crates-index version of `git-repository`~~ The remaining 2 TODO's are more like questions for the reviewer.
* [x] run all tests with gitoxide on the fastest platform as another parallel task
* [x] switch to released version
* [x] [Tasks from first review round](#11448 (comment))
* [x] create a new `gitoxide` release and refer to the latest version from crates.io (instead of git-dependency)
* [x] [address 2nd review round comments](#11448 (comment))

### Postponed Tasks

I suggest to go breadth-first and implement the most valuable features first, and then aim for a broad replacement of `git2`. What's left is details and improved compatibility with the `git2` implementation that will be required once `gitoxide` should become the default implementation on stable to complete the transition.

* **built-in support for serving the `file` protocol** (i.e. without using `git`). Simple cases like `clone` can probably be supported quickly, `fetch` needs more work though due to negotiation.
* SSH name fallbacks via a native (probably ~~libssh~~ (avoid LGPL) `libssh2` based) transport. Look at [this issue](#2399) for some history.
* additional tasks from [this tracking issue](Byron/gitoxide#450 (comment))

### Proposed Workflow

I am now using [stacked git](https://stacked-git.github.io) to keep commits meaningful during development. This will also mean that before reviews I will force-push a lot as changes will be bucketed into their respective commits.

Once review officially begins I will stop force-pushing and create small commits to address review comments. That way it should be easier to understand how things change over time.

Those review-comments can certainly be squashed into one commit before merging.

_Please let me know if this is feasible or if there are other ways of working you prefer._

### Development notes

* unrelated: [this line](https://github.com/rust-lang/cargo/blob/9827412fee4f5a88ac85e013edd954b2b63f399b/src/cargo/ops/registry.rs#L620) refers to an issue that has since been resolved in `curl`.
* Additional tasks related to a correct fetch implementation are collected in this [tracking issue](Byron/gitoxide#450). **These affect how well the HTTP transport can be configured, needs work**
* _authentication_ [is quite complex](https://github.com/rust-lang/cargo/blob/37cad5bd7f7dcd2f6d3e45312a99a9d3eec1e2a0/src/cargo/sources/git/utils.rs#L490) and centred around making SSH connections work. This feature is currently the weakest in `gitoxide` as it simply uses `ssh` (the program) and calls it a day.  No authentication flows are supported there yet and the goal would be to match `git` there at least (which it might already do by just calling `ssh`). Needs investigation. Once en-par with `git` I think `cargo` can restart the whole fetch operation to try different user names like before.
   - the built-in `ssh`-program based transport can now understand permission-denied errors, but the capability isn't used after all since a builtin ssh transport is required.
* It would be possible to implement `git::Progress` and just ignore most of the calls, but that's known to be too slow as the implementation assumes a `Progress::inc()` call is as fast as an atomic increment and makes no attempt to reduce its calls to it.
* learning about [a way to get custom traits in `thiserror`](dtolnay/thiserror#212) could make spurious error checks nicer and less error prone during maintenance. It's not a problem though.
* I am using `RUSTFLAGS=--cfg` to influence the entire build and unit-tests as environment variables didn't get through to the binary built and run for tests.

### Questions

* The way `gitoxide` is configured the user has the opportunity to override these values using more specific git options, for example using url specific http settings. This looks like a feature to me, but if it's not `gitoxide` needs to provide a way to disable applying these overrides. Please let me know what's desired here - my preference is to allow overrides.
* `gitoxide` currently opens repositories similar to how `git` does which respects git specific environment variables. This might be a deviation from how it was before and can be turned off. My preference is to see it as a feature.

### Prerequisite PRs

* #11602
@Byron Byron mentioned this issue Jun 5, 2023
13 tasks
@ofek
Copy link

ofek commented Jan 20, 2024

Is this actually complete despite the unfinished tasks in the OP?

@Byron
Copy link
Owner Author

Byron commented Jan 20, 2024

It works for all intents and purposes but isn’t perfect related to some details. These are still tracked here, maybe they can be moved into a follow-up issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue An issue to track to track the progress of multiple PRs or issues
Projects
Development

No branches or pull requests

3 participants