Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache dependencies between crates published by the same owner #1757

Open
jyn514 opened this issue Jul 11, 2022 · 21 comments
Open

Cache dependencies between crates published by the same owner #1757

jyn514 opened this issue Jul 11, 2022 · 21 comments
Labels
A-builds Area: Building the documentation for a crate E-medium Effort: This requires a fair amount of work

Comments

@jyn514
Copy link
Member

jyn514 commented Jul 11, 2022

Right now, when a large workspace publishes its crates, we spend a lot of time rebuilding the same dependencies over and over again (e.g. https://github.com/aptos-labs/aptos-core). To reduce the load on docs.rs and speed up builds, we can cache those dependencies to avoid rebuilding them.

We are only planning to do this between crates owned by the same crates.io owner for security reasons; see https://discord.com/channels/442252698964721669/541978667522195476/996125244026794134 for a discussion of how this causes issues between crates. In practice, this should still be a giant speed up since most of our build times come from large workspaces.

In order to have deterministic time limits, we are going to count the time spent to compile a dependency against the time limit, even if we already have the dependency cached locally. To avoid running out of disk space we'll use an LRU cache that deletes crates when we have less than ~20 GB free on the production server. For reproducibility we are going to make the cache read-only, and wipe the target/doc directory after each root crate we build.

Here is the rough approach we're imagining:

  1. Run cargo check --timings=json -Zunstable-options -p {dep} for each dependency of the crate we're building
  2. If we can parse the JSON output, record the timings in a database. Otherwise, make a note to wipe the cache for everything we were unable to parse after this build finishes.
  3. Run chmod -w on everything in target (except for the top-level directory, so cargo can create target/doc)
  4. Subtract the timings of all dependencies from the time limit (even if some were already cached). If we were unable to parse the JSON, subtract the entire time needed in step 2.
  5. Run cargo doc and upload to S3.
  6. Delete target/doc.
  7. repeat 1-6 for all new crates.io crates published by the same owner.
  8. Wipe the cache and the database table the next time we call rustwide.update_toolchain, since the new nightly will be unable to reusse the cache.
@jyn514 jyn514 added the A-builds Area: Building the documentation for a crate label Jul 11, 2022
@syphar
Copy link
Member

syphar commented Jul 16, 2022

Run chmod -w on everything in target

Tthis sounds like cargo check for each dependency ends up collecting everything we need for rustdoc. Also assuming we can make rustdoc work with /target/ dependencies readonly except the output directory.

Shouldn't this prevent any possible malicious changes that the crate author can do?
So we actually could cache the dependency for all owners?

Also the --timings=json logic and calculation feels like quite some complexity "only" for deterministic time-limits.

@Nemo157
Copy link
Member

Nemo157 commented Jul 16, 2022

The most aggressive attack that @pietroalbini mentioned was:

  • We build some crate foo under control of the attacker
    • This has a build script that generates a bar_macro.rlib and associated files into the target-dir to make it look like bar-macro has already been built.
  • We build docs for another crate bar which depends on a proc-macro bar-macro
    • Cargo sees that (the attacker generated) bar_macro.rlib is up-to-date and uses it, now the attacker has code running in the rustc process during the build for another crates docs and can essentially do anything it wants to those docs.

@syphar
Copy link
Member

syphar commented Jul 16, 2022

Wouldn't that be prevented by only having /target/doc writable?

Or does cargo check not do everything cargo doc needs and we will add files?

@Nemo157
Copy link
Member

Nemo157 commented Jul 16, 2022

Oh, I forgot one indirection in the first step, the crate foo depends on another crate foo2 under control of the attacker. The foo2 build script will be run during the cargo check step while the whole target-dir is writeable.

@syphar
Copy link
Member

syphar commented Jul 16, 2022

( sorry if I'm missing something, these build specifics are new to me )

Oh, I forgot one indirection in the first step, the crate foo depends on another crate foo2 under control of the attacker. The foo2 build script will be run during the cargo check step while the whole target-dir is writeable.

How would that be different with or without caching?

@Nemo157
Copy link
Member

Nemo157 commented Jul 16, 2022

Without caching: if foo2 writes libbar_macro.rlib into the target directory, that will be wiped before we build the docs for bar, so there's no way for one build-tree to affect another build-tree (you only have to trust your dependencies, not everything else that has been built by the builder).

@syphar
Copy link
Member

syphar commented Jul 17, 2022

Ah, then I get it. Thank you for spending the time explaining!

Couldn't we keep the cache by top-level dependency and version? Of course this would mean duplicate storage of the output binaries, but the implementation could be quite simple

@syphar
Copy link
Member

syphar commented Jul 17, 2022

That means the possible process could be (without timings):

for each top-level dependency:

  1. check if we have a cached target directory. If yes, copy it to the future doc-build target, skipping any duplicate files.
  2. if we don't have a cached target, run cargo check dep@vers in an isolated env. Put the output into the cache, & copy it over to the doc-build target

When we did this for all top-level dependencies we can run cargo doc without any permission change since we don't put any output files into the cache. Also if we missed something from the cache, cargo doc will just build it.

@pietroalbini
Copy link
Member

I think the most effective way would be to just keep the same target directory only when the previous crate was published by the same person as the new crate. So, for example:

  • foo by @alice, uses a fresh target directory as no build happened before
  • bar by @alice, reuses the target directory of foo as it's also published by @alice
  • baz by @bob, removes old target dir and uses a fresh one as it's not published by @alice
  • foobar by @alice, removes old target dir and uses a fresh one as it's not published by @bob

That would solve the problem of big workspaces being publishes, as I assume they would all be published by the same user, and removes the chance of someone else sneaking something into someone else's docs.

@syphar
Copy link
Member

syphar commented Jul 17, 2022

Ah, so this would also depend on the queued order of the crates?

@syphar
Copy link
Member

syphar commented Jul 17, 2022

So it would only help specifically for these bulk crates that are quickly released after each other, and not for any other crates.

@Nemo157
Copy link
Member

Nemo157 commented Jul 17, 2022

We have a hard-limit of max 1 day of caching because of updating the compiler. There are probably plenty of cases where a few related crates are published throughout the day which we would miss, but the really bad situations are when we have the mass publishes of workspaces with very deep dependency trees.

@syphar
Copy link
Member

syphar commented Jul 17, 2022

how would that look like with multiple owners? treat them as one (unique) new owner/team?

@Nemo157
Copy link
Member

Nemo157 commented Jul 17, 2022

We can get the actual publisher of each release from the crates.io API (even for crates with multiple owners, if they're publishing a workspace at once I'd assume a single account will be used for all the crates).

@jyn514
Copy link
Member Author

jyn514 commented Mar 9, 2023

FWIW this is painful enough I would be ok with omitting the "record the timings in a database" bit for now if it lets us get the caching in sooner, it can only have false positives (your crate intermittently builds when otherwise it would have always failed), not false negatives (your crate intermittently fails when it should have always succeeded).

Verifying the crates are owned by the same owner is not something we can omit though, for security reasons.

@syphar
Copy link
Member

syphar commented Mar 9, 2023

I'm with you on both points, the first one we also already talked about, the second was clear to me.

What's slowing me down is mostly the iteration speed / debugging with docker on my mac, since I have to go through the web container for everything.
On top of the fact that the build process & rustdoc are definitely the parts I know the least, so much trial & error is needed.

While improving the dev experience is certainly possible, currently I'm trying to invest time into artifact caching here, after the CPU load problem was worked around & a possible fix is in a PR.

@jyn514
Copy link
Member Author

jyn514 commented Mar 9, 2023

@syphar yup definitely, I wasn't trying to imply you weren't doing enough work ❤️ for context @alice-i-cecile was asking how she could help speed up the build queue and I wanted to make sure the issue was up to date since she hasn't been sitting in on our meetings.

@alice-i-cecile
Copy link

Yep, the Bevy community was curious about how we could help, and I was pointed here <3

@syphar
Copy link
Member

syphar commented Mar 9, 2023

@syphar yup definitely, I wasn't trying to imply you weren't doing enough work ❤️

I wasn't trying to implying that you implied ;)

I myself feel slow when I'm working on the build-process, also visible when I want to dig into the other bigger build-process topic (adding builds into the DB before the build)

for context @alice-i-cecile was asking how she could help speed up the build queue and I wanted to make sure the issue was up to date since she hasn't been sitting in on our meetings.

ah, thanks for the explanation!

If needed, I can create WIP PR + some explanation where I am right now, if @alice-i-cecile wants to help out.

@alice-i-cecile
Copy link

Sure, that sounds lovely :) I don't personally have a ton of time or expertise here, but I'll take a look and share it around the community: we have a lot of talented folks and care about improving things for the whole ecosystem.

@syphar
Copy link
Member

syphar commented Mar 9, 2023

I created the draft PR with what I already had: #2079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-builds Area: Building the documentation for a crate E-medium Effort: This requires a fair amount of work
Projects
None yet
Development

No branches or pull requests

5 participants