Cache dependencies between crates published by the same owner #1757

jyn514 · 2022-07-11T19:36:52Z

Right now, when a large workspace publishes its crates, we spend a lot of time rebuilding the same dependencies over and over again (e.g. https://github.com/aptos-labs/aptos-core). To reduce the load on docs.rs and speed up builds, we can cache those dependencies to avoid rebuilding them.

We are only planning to do this between crates owned by the same crates.io owner for security reasons; see https://discord.com/channels/442252698964721669/541978667522195476/996125244026794134 for a discussion of how this causes issues between crates. In practice, this should still be a giant speed up since most of our build times come from large workspaces.

In order to have deterministic time limits, we are going to count the time spent to compile a dependency against the time limit, even if we already have the dependency cached locally. To avoid running out of disk space we'll use an LRU cache that deletes crates when we have less than ~20 GB free on the production server. For reproducibility we are going to make the cache read-only, and wipe the target/doc directory after each root crate we build.

Here is the rough approach we're imagining:

Run cargo check --timings=json -Zunstable-options -p {dep} for each dependency of the crate we're building
If we can parse the JSON output, record the timings in a database. Otherwise, make a note to wipe the cache for everything we were unable to parse after this build finishes.
Run chmod -w on everything in target (except for the top-level directory, so cargo can create target/doc)
Subtract the timings of all dependencies from the time limit (even if some were already cached). If we were unable to parse the JSON, subtract the entire time needed in step 2.
Run cargo doc and upload to S3.
Delete target/doc.
repeat 1-6 for all new crates.io crates published by the same owner.
Wipe the cache and the database table the next time we call rustwide.update_toolchain, since the new nightly will be unable to reusse the cache.

The text was updated successfully, but these errors were encountered:

syphar · 2022-07-16T16:12:52Z

Run chmod -w on everything in target

Tthis sounds like cargo check for each dependency ends up collecting everything we need for rustdoc. Also assuming we can make rustdoc work with /target/ dependencies readonly except the output directory.

Shouldn't this prevent any possible malicious changes that the crate author can do?
So we actually could cache the dependency for all owners?

Also the --timings=json logic and calculation feels like quite some complexity "only" for deterministic time-limits.

Nemo157 · 2022-07-16T17:32:00Z

The most aggressive attack that @pietroalbini mentioned was:

We build some crate foo under control of the attacker
- This has a build script that generates a bar_macro.rlib and associated files into the target-dir to make it look like bar-macro has already been built.
We build docs for another crate bar which depends on a proc-macro bar-macro
- Cargo sees that (the attacker generated) bar_macro.rlib is up-to-date and uses it, now the attacker has code running in the rustc process during the build for another crates docs and can essentially do anything it wants to those docs.

syphar · 2022-07-16T17:35:36Z

Wouldn't that be prevented by only having /target/doc writable?

Or does cargo check not do everything cargo doc needs and we will add files?

Nemo157 · 2022-07-16T17:43:01Z

Oh, I forgot one indirection in the first step, the crate foo depends on another crate foo2 under control of the attacker. The foo2 build script will be run during the cargo check step while the whole target-dir is writeable.

syphar · 2022-07-16T18:06:17Z

( sorry if I'm missing something, these build specifics are new to me )

Oh, I forgot one indirection in the first step, the crate foo depends on another crate foo2 under control of the attacker. The foo2 build script will be run during the cargo check step while the whole target-dir is writeable.

How would that be different with or without caching?

Nemo157 · 2022-07-16T20:40:43Z

Without caching: if foo2 writes libbar_macro.rlib into the target directory, that will be wiped before we build the docs for bar, so there's no way for one build-tree to affect another build-tree (you only have to trust your dependencies, not everything else that has been built by the builder).

syphar · 2022-07-17T07:41:50Z

Ah, then I get it. Thank you for spending the time explaining!

Couldn't we keep the cache by top-level dependency and version? Of course this would mean duplicate storage of the output binaries, but the implementation could be quite simple

syphar · 2022-07-17T07:49:53Z

That means the possible process could be (without timings):

for each top-level dependency:

check if we have a cached target directory. If yes, copy it to the future doc-build target, skipping any duplicate files.
if we don't have a cached target, run cargo check dep@vers in an isolated env. Put the output into the cache, & copy it over to the doc-build target

When we did this for all top-level dependencies we can run cargo doc without any permission change since we don't put any output files into the cache. Also if we missed something from the cache, cargo doc will just build it.

pietroalbini · 2022-07-17T08:40:35Z

I think the most effective way would be to just keep the same target directory only when the previous crate was published by the same person as the new crate. So, for example:

foo by @alice, uses a fresh target directory as no build happened before
bar by @alice, reuses the target directory of foo as it's also published by @alice
baz by @bob, removes old target dir and uses a fresh one as it's not published by @alice
foobar by @alice, removes old target dir and uses a fresh one as it's not published by @bob

That would solve the problem of big workspaces being publishes, as I assume they would all be published by the same user, and removes the chance of someone else sneaking something into someone else's docs.

syphar · 2022-07-17T08:42:49Z

Ah, so this would also depend on the queued order of the crates?

syphar · 2022-07-17T08:56:59Z

So it would only help specifically for these bulk crates that are quickly released after each other, and not for any other crates.

Nemo157 · 2022-07-17T09:12:38Z

We have a hard-limit of max 1 day of caching because of updating the compiler. There are probably plenty of cases where a few related crates are published throughout the day which we would miss, but the really bad situations are when we have the mass publishes of workspaces with very deep dependency trees.

syphar · 2022-07-17T12:19:42Z

how would that look like with multiple owners? treat them as one (unique) new owner/team?

Nemo157 · 2022-07-17T12:49:24Z

We can get the actual publisher of each release from the crates.io API (even for crates with multiple owners, if they're publishing a workspace at once I'd assume a single account will be used for all the crates).

jyn514 · 2023-03-09T17:36:19Z

FWIW this is painful enough I would be ok with omitting the "record the timings in a database" bit for now if it lets us get the caching in sooner, it can only have false positives (your crate intermittently builds when otherwise it would have always failed), not false negatives (your crate intermittently fails when it should have always succeeded).

Verifying the crates are owned by the same owner is not something we can omit though, for security reasons.

syphar · 2023-03-09T18:10:16Z

I'm with you on both points, the first one we also already talked about, the second was clear to me.

What's slowing me down is mostly the iteration speed / debugging with docker on my mac, since I have to go through the web container for everything.
On top of the fact that the build process & rustdoc are definitely the parts I know the least, so much trial & error is needed.

While improving the dev experience is certainly possible, currently I'm trying to invest time into artifact caching here, after the CPU load problem was worked around & a possible fix is in a PR.

jyn514 · 2023-03-09T18:12:30Z

@syphar yup definitely, I wasn't trying to imply you weren't doing enough work ❤️ for context @alice-i-cecile was asking how she could help speed up the build queue and I wanted to make sure the issue was up to date since she hasn't been sitting in on our meetings.

alice-i-cecile · 2023-03-09T18:15:01Z

Yep, the Bevy community was curious about how we could help, and I was pointed here <3

syphar · 2023-03-09T18:16:36Z

@syphar yup definitely, I wasn't trying to imply you weren't doing enough work ❤️

I wasn't trying to implying that you implied ;)

I myself feel slow when I'm working on the build-process, also visible when I want to dig into the other bigger build-process topic (adding builds into the DB before the build)

for context @alice-i-cecile was asking how she could help speed up the build queue and I wanted to make sure the issue was up to date since she hasn't been sitting in on our meetings.

ah, thanks for the explanation!

If needed, I can create WIP PR + some explanation where I am right now, if @alice-i-cecile wants to help out.

alice-i-cecile · 2023-03-09T18:20:33Z

Sure, that sounds lovely :) I don't personally have a ton of time or expertise here, but I'll take a look and share it around the community: we have a lot of talented folks and care about improving things for the whole ecosystem.

syphar · 2023-03-09T20:17:01Z

I created the draft PR with what I already had: #2079

jyn514 added the A-builds Area: Building the documentation for a crate label Jul 11, 2022

jyn514 mentioned this issue Sep 3, 2022

Add a secondary mode with --document-private-items #304

Open

syphar added the E-medium Effort: This requires a fair amount of work label Sep 21, 2022

This was referenced Feb 20, 2023

Add an autoscaling group for the docs-rs-builder rust-lang/simpleinfra#243

Merged

bevy_ui_bits got a bad prioritization for its name #2054

Closed

syphar mentioned this issue Mar 9, 2023

fetch publisher data from crates.io, start build artifact caching #2079

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache dependencies between crates published by the same owner #1757

Cache dependencies between crates published by the same owner #1757

jyn514 commented Jul 11, 2022

syphar commented Jul 16, 2022

Nemo157 commented Jul 16, 2022 •

edited

syphar commented Jul 16, 2022

Nemo157 commented Jul 16, 2022

syphar commented Jul 16, 2022

Nemo157 commented Jul 16, 2022

syphar commented Jul 17, 2022

syphar commented Jul 17, 2022

pietroalbini commented Jul 17, 2022

syphar commented Jul 17, 2022

syphar commented Jul 17, 2022

Nemo157 commented Jul 17, 2022

syphar commented Jul 17, 2022

Nemo157 commented Jul 17, 2022

jyn514 commented Mar 9, 2023 •

edited

syphar commented Mar 9, 2023

jyn514 commented Mar 9, 2023

alice-i-cecile commented Mar 9, 2023

syphar commented Mar 9, 2023

alice-i-cecile commented Mar 9, 2023

syphar commented Mar 9, 2023

Cache dependencies between crates published by the same owner #1757

Cache dependencies between crates published by the same owner #1757

Comments

jyn514 commented Jul 11, 2022

syphar commented Jul 16, 2022

Nemo157 commented Jul 16, 2022 • edited

syphar commented Jul 16, 2022

Nemo157 commented Jul 16, 2022

syphar commented Jul 16, 2022

Nemo157 commented Jul 16, 2022

syphar commented Jul 17, 2022

syphar commented Jul 17, 2022

pietroalbini commented Jul 17, 2022

syphar commented Jul 17, 2022

syphar commented Jul 17, 2022

Nemo157 commented Jul 17, 2022

syphar commented Jul 17, 2022

Nemo157 commented Jul 17, 2022

jyn514 commented Mar 9, 2023 • edited

syphar commented Mar 9, 2023

jyn514 commented Mar 9, 2023

alice-i-cecile commented Mar 9, 2023

syphar commented Mar 9, 2023

alice-i-cecile commented Mar 9, 2023

syphar commented Mar 9, 2023

Nemo157 commented Jul 16, 2022 •

edited

jyn514 commented Mar 9, 2023 •

edited