fetch publisher data from crates.io, start build artifact caching #2079

syphar · 2023-03-09T20:12:53Z

This is the first draft of my idea for the artifact caching between builds, to solve #1757 .

The basic idea is:

for each release, fetch the publisher from the API
let rustwide prepare all the needed folders, including the target folder
replace the target folder with a cached target folder, by publisher github ID.
run the build
clean the doc artifacts from the target directory
move the target back to the main cache folder.

also,

when the nightly toolchain changes, delete the whole cache.

I intentionally didn't compress the cache to I can easily just rename / move the folder.

My first short tests worked, but this needs validating by more people.

open questions

check doc cleanup technique
can we / should we use a symlink instead of the fs::rename? Someone with a linux machine should test if if it would also fallback to the copy mechanism, which could be too slow for caching. There is also a chance I'm missing some docker specifics that make the move / symlink impossible.
I'm not sure how much of the permission related parts of Cache dependencies between crates published by the same owner #1757 we would need here?

TODO

more manual tests
monitor free disk space & delete old artifacts if it gets too full
tests for the cache itself
tests for the builder-changes
globally enable/disable the cache via config

src/docbuilder/caching.rs

syphar · 2023-03-09T20:16:21Z

@Nemo157 @jyn514 I would love to have your feedback, especially on the changes in the build process itself.

( most of the changed LOC are fetching the publisher / a related refactor)

syphar · 2023-03-09T20:16:45Z

@Nemo157 @jyn514 I would love to have your feedback, especially on the changes in the build process itself.

More specifically:
for a simple approach, ignoring timings, what is missing?

jyn514

I'm not sure how much of the permission related parts of #1757 we would need here?

The main thing we're missing is that we're not marking the cache directory as read-only when we restore it. I would prefer to do that, but since we're only using builds by the same owner, it can only be a correctness issue, not a security issue, and in practice I doubt anything will go wrong.

monitor free disk space & delete old artifacts if it gets too full

~~We need to at least clear the cache every 24 hours when we update the toolchain, or we will run out of disk space fairly soon~~ whoops, I see you already did this. IMO we should also clean the cache after every ~10 builds or so, but I feel less strongly about that.

I left comments on your other questions inline.

src/docbuilder/caching.rs

src/docbuilder/rustwide_builder.rs

src/index/api.rs

src/docbuilder/caching.rs

Mark-Simulacrum · 2023-03-10T13:39:03Z

Can you expand on the rationale for going by same owner rather than by same crate? It seems to me that any cross-crate sharing is very iffy, since crates typically expect to build in a distinct workspace rather than in the same target directories as other crates. Crater has encountered a long tail of problems here that we're gradually fixing or hiding under the rug, and I wouldn't want to repeat that for something user facing like docs.rs.

One particular thing to call out is to make sure to completely get rid of the cache on each nightly bump, Crater used to share target directories for that but stopped as we were hitting too many "wrong rustc version" errors in a typical run.

Given that lack of sharing opportunity, it seems like this would only help if crates are published more than once a day, which seems like a rare use case compared to the complexity of getting this right.

Long-term (or really hopefully quite soon) I hope that the build environment will be an entirely ephemeral ec2 instance, which may get replaced even after every build eventually. That'll make caching here even harder.

jyn514 · 2023-03-10T13:42:16Z

@Mark-Simulacrum #1757 has a detailed rationale. The tldr is that crates in workspaces often have very large dependency trees that are the same between all the crates in the workspace.

Given that lack of sharing opportunity, it seems like this would only help if crates are published more than once a day,

Crates in a workspace are often published all at once, one after another.

syphar · 2023-03-10T13:50:24Z

Thank you for taking the time to look at it!

Can you expand on the rationale for going by same owner rather than by same crate?

We have some publishers that always release batches of crates (sometimes 50-100 releases at once), also visible in the current queue. Recent examples would be google-*, waffles-solana-*, swc-*, azure-*, bevy_, ...
We typically deprioritize these, but sharing the assets between them would vastly increase build speed for these.

It seems to me that any cross-crate sharing is very iffy, since crates typically expect to build in a distinct workspace rather than in the same target directories as other crates. Crater has encountered a long tail of problems here that we're gradually fixing or hiding under the rug, and I wouldn't want to repeat that for something user facing like docs.rs.

Could you elaborate on this? Where these build failures? In which cases did crater share the target directory?

One particular thing to call out is to make sure to completely get rid of the cache on each nightly bump, Crater used to share target directories for that but stopped as we were hitting too many "wrong rustc version" errors in a typical run.

We're already doing this in this PR. When the nightly version switches, we purge the whole cache.

Given that lack of sharing opportunity, it seems like this would only help if crates are published more than once a day, which seems like a rare use case compared to the complexity of getting this right.

See the rationale, this is mainly for bulk-publishes.

Long-term (or really hopefully quite soon) I hope that the build environment will be an entirely ephemeral ec2 instance, which may get replaced even after every build eventually. That'll make caching here even harder.

I believe one-instance-per-build is more mid- to long-term.
What is currently in progress with infra is that we split the web & build-server, run web on ECS, and have an easy way to recreate build-servers when we want to. This would then be used to scale the build-servers (horizontally & vertically) when the queue gets full. So for these cases the caching would still help.

syphar · 2023-03-10T16:06:27Z

I'm not sure how much of the permission related parts of #1757 we would need here?

The main thing we're missing is that we're not marking the cache directory as read-only when we restore it. I would prefer to do that, but since we're only using builds by the same owner, it can only be a correctness issue, not a security issue, and in practice I doubt anything will go wrong.

@jyn514 could you explain more? You mean making it readonly when it's in the cache? Or when it's restored for the build?

To me a correctness issue with security impact is worth digging into, if we can protect ourselves against ourselves ;)

jyn514 · 2023-03-10T16:14:15Z

@jyn514 could you explain more? You mean making it readonly when it's in the cache? Or when it's restored for the build?

When it's restored. The idea is to prevent cargo (or build scripts) from modifying files it's already created in a previous build.

To me a correctness issue with security impact is worth digging into, if we can protect ourselves against ourselves ;)

This is a correctness issue that does not have a security impact.

syphar · 2023-03-10T21:02:05Z

@jyn514 could you explain more? You mean making it readonly when it's in the cache? Or when it's restored for the build?

When it's restored. The idea is to prevent cargo (or build scripts) from modifying files it's already created in a previous build.

So the build will ever only create files? And never update or delete them?
Setting them to readonly should be possible then, but that would have to be file-by-file, right?

To me a correctness issue with security impact is worth digging into, if we can protect ourselves against ourselves ;)

This is a correctness issue that does not have a security impact.

👍

jyn514 · 2023-03-10T21:25:16Z

So the build will ever only create files? And never update or delete them?
Setting them to readonly should be possible then, but that would have to be file-by-file, right?

that's what I was imagining, yeah.

Mark-Simulacrum · 2023-03-11T18:11:44Z

It seems to me that any cross-crate sharing is very iffy, since crates typically expect to build in a distinct workspace rather than in the same target directories as other crates. Crater has encountered a long tail of problems here that we're gradually fixing or hiding under the rug, and I wouldn't want to repeat that for something user facing like docs.rs.

Could you elaborate on this? Where these build failures? In which cases did crater share the target directory?

Crater currently shares target directories across ~all (up to disk space limits, so no per-crate property) crates it builds for a single run and compiler version.

Historically Crater used a single target directory for both the baseline and new compiler versions, which regularly exposed us to errors. I didn't spend a lot of time tracking those errors down in terms of root cause, but they typically were of the shape "found artifact X compiled by different version of rustc" or something along those lines.

I think I have seen cases where a crate won't build or will fail tests due to stale files, e.g., because it's build script assumes that it is only running once -- but with different features etc that may not be the case -- but I don't have links to those off hand. I think my advice here comes down to: be prepared for trouble if you do this, and make sure you have a way to disable it. Consider only enabling for known well-behaved crates, especially if you already have logic for those to automatically de-prioritize them in the queue.

This would also make me feel better because relying on crate owners as a security boundary doesn't feel right to me. For example, I have publishing rights to some rust-lang-owned crates, but I wouldn't want to see that mean that there is a security boundary breached if I have a side-project auto-published crate from some random repository or whatever.

It seems to me that the primary benefit is to already large projects, and ones we already treat specially in terms of queues etc, so we can readily mark those as sharing a build directory while not using any crate property.

One particular thing to call out is to make sure to completely get rid of the cache on each nightly bump, Crater used to share target directories for that but stopped as we were hitting too many "wrong rustc version" errors in a typical run.

We're already doing this in this PR. When the nightly version switches, we purge the whole cache.

👍 -- this seems like it would prevent most of the issues Crater saw.

I believe one-instance-per-build is more mid- to long-term.

What is currently in progress with infra is that we split the web & build-server, run web on ECS, and have an easy way to recreate build-servers when we want to. This would then be used to scale the build-servers (horizontally & vertically) when the queue gets full. So for these cases the caching would still help.

Yes, that seems right in terms of current trajectory. I would like to move us eventually to having no or minimal sandboxing ourselves (in Crater, docs.rs, playground) but rather rely on boundaries made by others (e.g., EC2 instance isolation).

Nemo157 · 2023-03-13T10:45:37Z

You can't set OUT_DIR's to read-only, build scripts can rerun and cargo doesn't clean the OUT_DIR between them. I've had issues before when I set files within ~/.cargo/src/ to read-only and that attribute was copied into the OUT_DIR by some buildscript so it failed on a second run.

We could use sccache to manage the caching for us. Just mount a per-owner directory into the build container for it to use. It will then handle copying the minimal final artifacts for each dependency into this directory, and grabbing the correct ones on the next build. That avoids all issues around copying things like doc/ that aren't cacheable, and reduces the disk usage by not caching other intermediate files that aren't necessary.

syphar · 2023-03-18T15:35:13Z

I think I have seen cases where a crate won't build or will fail tests due to stale files, e.g., because it's build script assumes that it is only running once -- but with different features etc that may not be the case -- but I don't have links to those off hand. I think my advice here comes down to: be prepared for trouble if you do this, and make sure you have a way to disable it. Consider only enabling for known well-behaved crates, especially if you already have logic for those to automatically de-prioritize them in the queue.

I'll keep that in mind, thanks! The deprioritization is currently only based on name patterns, which means there are false positives where crates from other orgs might be also deprioritized only because they share the same prefix. So we would have to keep in mind that crates from different owners might be mixed up under the same prefix.

This would also make me feel better because relying on crate owners as a security boundary doesn't feel right to me. For example, I have publishing rights to some rust-lang-owned crates, but I wouldn't want to see that mean that there is a security boundary breached if I have a side-project auto-published crate from some random repository or whatever.

The current idea is that the security boundary is not the owner, but the publisher.

It seems to me that the primary benefit is to already large projects, and ones we already treat specially in terms of queues etc, so we can readily mark those as sharing a build directory while not using any crate property.

So this sounds like it boils down to a decision between a manual opt-in for caching artifacts, and automated caching based on some criteria (publisher, ... )

We could use sccache to manage the caching for us. Just mount a per-owner directory into the build container for it to use. It will then handle copying the minimal final artifacts for each dependency into this directory, and grabbing the correct ones on the next build. That avoids all issues around copying things like doc/ that aren't cacheable, and reduces the disk usage by not caching other intermediate files that aren't necessary.

I'll check out if / how that would work. I assume just setting RUSTC_WRAPPER should be good enough.

…esystem

syphar · 2023-05-14T15:46:15Z

That said landing sccache seems fine as a first step as long as you don't close out the tracking issue.

Hm, this sounds you would prefer the current approach over using sccache?

syphar · 2023-05-14T15:46:54Z

I am dubious how much sccache will actually help in practice. I would like to see benchmarks on how much it helps with large workspace if you're planning to only switch to sccache without further changes.

I definitely wanted to run some tests before starting to implement this

syphar · 2023-06-19T20:15:18Z

@jyn514 @Nemo157 I think this is ready for another round of checks.

It would be awesome to have your feedback on what might be missing, from my side it's relatively complete.

I'm not 100% certain about the disk-space safeguard, typically cache size limits are configured as cache-size, but that would have meant I have to du the whole cache directory each time.

syphar · 2023-10-04T13:56:26Z

In the meantime we got a much bigger EC2 machine for the docs.rs server, which improved build speed very much.
After even adding another build process I'm currently not sure any more if this change is worth the effort, since the potential improvement was always limited to workspace builds.

For now I would park this and focus on other topics that I see more critical, and revisit it when it becomes a problem again.

github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Mar 9, 2023

syphar commented Mar 9, 2023

View reviewed changes

src/docbuilder/caching.rs Show resolved Hide resolved

syphar commented Mar 9, 2023

View reviewed changes

src/docbuilder/caching.rs Outdated Show resolved Hide resolved

syphar commented Mar 9, 2023

View reviewed changes

src/docbuilder/caching.rs Outdated Show resolved Hide resolved

syphar mentioned this pull request Mar 9, 2023

Cache dependencies between crates published by the same owner #1757

Open

jyn514 reviewed Mar 10, 2023

View reviewed changes

syphar added S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Mar 26, 2023

syphar added 9 commits May 11, 2023 18:28

fetch publisher data from crates.io, start build artifact caching

c559add

update docker-compose config so rustwide & prefix are on the same fil…

fa1981f

…esystem

always move the cache / targets, no copy needed

ef74011

also clean doc-output for proc macro crates

3fdf1fd

adapt comment

a82e6dc

add freeing disk space functionliaty

bf8440b

update log messages

4e67646

extend tests for crate-publisher fetching

b0126a4

extend tests for crate-release info fetchign

a15fc75

syphar force-pushed the get-publisher branch from 1b2d382 to a15fc75 Compare May 11, 2023 16:28

syphar added 2 commits June 2, 2023 11:07

ensure if target directory exists

09588f1

Merge branch 'master' into get-publisher

5e7d6c7

syphar changed the title ~~RFC: fetch publisher data from crates.io, start build artifact caching~~ fetch publisher data from crates.io, start build artifact caching Jun 2, 2023

add more logging context, try cleaning up target on restore

2bc2cab

syphar force-pushed the get-publisher branch 3 times, most recently from 4c3cc3c to 5e9a80e Compare June 4, 2023 06:07

don't try to create target directory when it already exists

c48e0f4

syphar force-pushed the get-publisher branch from 5e9a80e to c48e0f4 Compare June 4, 2023 08:57

syphar added 2 commits June 4, 2023 10:59

purge artifact cache for tests

ce2bedc

start adding cache test

3758e0d

syphar force-pushed the get-publisher branch from cb53c33 to 3758e0d Compare June 18, 2023 08:43

syphar added 2 commits June 18, 2023 11:08

add global config to enable artifact cache, debug logging for test

c3f4901

try other publisher id for the test

231de62

syphar force-pushed the get-publisher branch from ee27253 to 231de62 Compare June 18, 2023 09:53

syphar added 5 commits June 18, 2023 18:53

remove debug-output from tests

bb96f3b

start adding caching tests

ce370b1

fix cache dir cleanup for free disk space, add test

a11b75b

start freeing disk-space

184a00d

add config var for free disk space goal

70771d5

syphar marked this pull request as ready for review June 19, 2023 20:13

syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch publisher data from crates.io, start build artifact caching #2079

fetch publisher data from crates.io, start build artifact caching #2079

syphar commented Mar 9, 2023 •

edited

syphar commented Mar 9, 2023

syphar commented Mar 9, 2023

jyn514 left a comment •

edited

Mark-Simulacrum commented Mar 10, 2023

jyn514 commented Mar 10, 2023

syphar commented Mar 10, 2023

syphar commented Mar 10, 2023

jyn514 commented Mar 10, 2023

syphar commented Mar 10, 2023

jyn514 commented Mar 10, 2023 •

edited

Mark-Simulacrum commented Mar 11, 2023

Nemo157 commented Mar 13, 2023

syphar commented Mar 18, 2023

syphar commented May 14, 2023

syphar commented May 14, 2023

syphar commented Jun 19, 2023

syphar commented Oct 4, 2023

fetch publisher data from crates.io, start build artifact caching #2079

Are you sure you want to change the base?

fetch publisher data from crates.io, start build artifact caching #2079

Conversation

syphar commented Mar 9, 2023 • edited

open questions

TODO

syphar commented Mar 9, 2023

syphar commented Mar 9, 2023

jyn514 left a comment • edited

Choose a reason for hiding this comment

Mark-Simulacrum commented Mar 10, 2023

jyn514 commented Mar 10, 2023

syphar commented Mar 10, 2023

syphar commented Mar 10, 2023

jyn514 commented Mar 10, 2023

syphar commented Mar 10, 2023

jyn514 commented Mar 10, 2023 • edited

Mark-Simulacrum commented Mar 11, 2023

Nemo157 commented Mar 13, 2023

syphar commented Mar 18, 2023

syphar commented May 14, 2023

syphar commented May 14, 2023

syphar commented Jun 19, 2023

syphar commented Oct 4, 2023

syphar commented Mar 9, 2023 •

edited

jyn514 left a comment •

edited

jyn514 commented Mar 10, 2023 •

edited