Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch publisher data from crates.io, start build artifact caching #2079

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

syphar
Copy link
Member

@syphar syphar commented Mar 9, 2023

This is the first draft of my idea for the artifact caching between builds, to solve #1757 .

The basic idea is:

  • for each release, fetch the publisher from the API
  • let rustwide prepare all the needed folders, including the target folder
  • replace the target folder with a cached target folder, by publisher github ID.
  • run the build
  • clean the doc artifacts from the target directory
  • move the target back to the main cache folder.

also,

  • when the nightly toolchain changes, delete the whole cache.

I intentionally didn't compress the cache to I can easily just rename / move the folder.

My first short tests worked, but this needs validating by more people.

open questions

  • check doc cleanup technique
  • can we / should we use a symlink instead of the fs::rename? Someone with a linux machine should test if if it would also fallback to the copy mechanism, which could be too slow for caching. There is also a chance I'm missing some docker specifics that make the move / symlink impossible.
  • I'm not sure how much of the permission related parts of Cache dependencies between crates published by the same owner #1757 we would need here?

TODO

  • more manual tests
  • monitor free disk space & delete old artifacts if it gets too full
  • tests for the cache itself
  • tests for the builder-changes
  • globally enable/disable the cache via config

@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Mar 9, 2023
src/docbuilder/caching.rs Outdated Show resolved Hide resolved
src/docbuilder/caching.rs Outdated Show resolved Hide resolved
@syphar
Copy link
Member Author

syphar commented Mar 9, 2023

@Nemo157 @jyn514 I would love to have your feedback, especially on the changes in the build process itself.

( most of the changed LOC are fetching the publisher / a related refactor)

@syphar
Copy link
Member Author

syphar commented Mar 9, 2023

@Nemo157 @jyn514 I would love to have your feedback, especially on the changes in the build process itself.

More specifically:
for a simple approach, ignoring timings, what is missing?

Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much of the permission related parts of #1757 we would need here?

The main thing we're missing is that we're not marking the cache directory as read-only when we restore it. I would prefer to do that, but since we're only using builds by the same owner, it can only be a correctness issue, not a security issue, and in practice I doubt anything will go wrong.

monitor free disk space & delete old artifacts if it gets too full

We need to at least clear the cache every 24 hours when we update the toolchain, or we will run out of disk space fairly soon whoops, I see you already did this. IMO we should also clean the cache after every ~10 builds or so, but I feel less strongly about that.

I left comments on your other questions inline.

src/docbuilder/caching.rs Outdated Show resolved Hide resolved
src/docbuilder/caching.rs Show resolved Hide resolved
src/docbuilder/caching.rs Show resolved Hide resolved
src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved
src/docbuilder/rustwide_builder.rs Outdated Show resolved Hide resolved
src/index/api.rs Outdated Show resolved Hide resolved
src/index/api.rs Show resolved Hide resolved
src/docbuilder/caching.rs Outdated Show resolved Hide resolved
@Mark-Simulacrum
Copy link
Member

Can you expand on the rationale for going by same owner rather than by same crate? It seems to me that any cross-crate sharing is very iffy, since crates typically expect to build in a distinct workspace rather than in the same target directories as other crates. Crater has encountered a long tail of problems here that we're gradually fixing or hiding under the rug, and I wouldn't want to repeat that for something user facing like docs.rs.

One particular thing to call out is to make sure to completely get rid of the cache on each nightly bump, Crater used to share target directories for that but stopped as we were hitting too many "wrong rustc version" errors in a typical run.

Given that lack of sharing opportunity, it seems like this would only help if crates are published more than once a day, which seems like a rare use case compared to the complexity of getting this right.

Long-term (or really hopefully quite soon) I hope that the build environment will be an entirely ephemeral ec2 instance, which may get replaced even after every build eventually. That'll make caching here even harder.

@jyn514
Copy link
Member

jyn514 commented Mar 10, 2023

@Mark-Simulacrum #1757 has a detailed rationale. The tldr is that crates in workspaces often have very large dependency trees that are the same between all the crates in the workspace.

Given that lack of sharing opportunity, it seems like this would only help if crates are published more than once a day,

Crates in a workspace are often published all at once, one after another.

@syphar
Copy link
Member Author

syphar commented Mar 10, 2023

Thank you for taking the time to look at it!

Can you expand on the rationale for going by same owner rather than by same crate?

We have some publishers that always release batches of crates (sometimes 50-100 releases at once), also visible in the current queue. Recent examples would be google-*, waffles-solana-*, swc-*, azure-*, bevy_, ...
We typically deprioritize these, but sharing the assets between them would vastly increase build speed for these.

It seems to me that any cross-crate sharing is very iffy, since crates typically expect to build in a distinct workspace rather than in the same target directories as other crates. Crater has encountered a long tail of problems here that we're gradually fixing or hiding under the rug, and I wouldn't want to repeat that for something user facing like docs.rs.

Could you elaborate on this? Where these build failures? In which cases did crater share the target directory?

One particular thing to call out is to make sure to completely get rid of the cache on each nightly bump, Crater used to share target directories for that but stopped as we were hitting too many "wrong rustc version" errors in a typical run.

We're already doing this in this PR. When the nightly version switches, we purge the whole cache.

Given that lack of sharing opportunity, it seems like this would only help if crates are published more than once a day, which seems like a rare use case compared to the complexity of getting this right.

See the rationale, this is mainly for bulk-publishes.

Long-term (or really hopefully quite soon) I hope that the build environment will be an entirely ephemeral ec2 instance, which may get replaced even after every build eventually. That'll make caching here even harder.

I believe one-instance-per-build is more mid- to long-term.
What is currently in progress with infra is that we split the web & build-server, run web on ECS, and have an easy way to recreate build-servers when we want to. This would then be used to scale the build-servers (horizontally & vertically) when the queue gets full. So for these cases the caching would still help.

@syphar
Copy link
Member Author

syphar commented Mar 10, 2023

I'm not sure how much of the permission related parts of #1757 we would need here?

The main thing we're missing is that we're not marking the cache directory as read-only when we restore it. I would prefer to do that, but since we're only using builds by the same owner, it can only be a correctness issue, not a security issue, and in practice I doubt anything will go wrong.

@jyn514 could you explain more? You mean making it readonly when it's in the cache? Or when it's restored for the build?

To me a correctness issue with security impact is worth digging into, if we can protect ourselves against ourselves ;)

@jyn514
Copy link
Member

jyn514 commented Mar 10, 2023

@jyn514 could you explain more? You mean making it readonly when it's in the cache? Or when it's restored for the build?

When it's restored. The idea is to prevent cargo (or build scripts) from modifying files it's already created in a previous build.

To me a correctness issue with security impact is worth digging into, if we can protect ourselves against ourselves ;)

This is a correctness issue that does not have a security impact.

@syphar
Copy link
Member Author

syphar commented Mar 10, 2023

@jyn514 could you explain more? You mean making it readonly when it's in the cache? Or when it's restored for the build?

When it's restored. The idea is to prevent cargo (or build scripts) from modifying files it's already created in a previous build.

So the build will ever only create files? And never update or delete them?
Setting them to readonly should be possible then, but that would have to be file-by-file, right?

To me a correctness issue with security impact is worth digging into, if we can protect ourselves against ourselves ;)

This is a correctness issue that does not have a security impact.

👍

@jyn514
Copy link
Member

jyn514 commented Mar 10, 2023

So the build will ever only create files? And never update or delete them?
Setting them to readonly should be possible then, but that would have to be file-by-file, right?

that's what I was imagining, yeah.

@Mark-Simulacrum
Copy link
Member

It seems to me that any cross-crate sharing is very iffy, since crates typically expect to build in a distinct workspace rather than in the same target directories as other crates. Crater has encountered a long tail of problems here that we're gradually fixing or hiding under the rug, and I wouldn't want to repeat that for something user facing like docs.rs.

Could you elaborate on this? Where these build failures? In which cases did crater share the target directory?

Crater currently shares target directories across ~all (up to disk space limits, so no per-crate property) crates it builds for a single run and compiler version.

Historically Crater used a single target directory for both the baseline and new compiler versions, which regularly exposed us to errors. I didn't spend a lot of time tracking those errors down in terms of root cause, but they typically were of the shape "found artifact X compiled by different version of rustc" or something along those lines.

I think I have seen cases where a crate won't build or will fail tests due to stale files, e.g., because it's build script assumes that it is only running once -- but with different features etc that may not be the case -- but I don't have links to those off hand. I think my advice here comes down to: be prepared for trouble if you do this, and make sure you have a way to disable it. Consider only enabling for known well-behaved crates, especially if you already have logic for those to automatically de-prioritize them in the queue.

This would also make me feel better because relying on crate owners as a security boundary doesn't feel right to me. For example, I have publishing rights to some rust-lang-owned crates, but I wouldn't want to see that mean that there is a security boundary breached if I have a side-project auto-published crate from some random repository or whatever.

It seems to me that the primary benefit is to already large projects, and ones we already treat specially in terms of queues etc, so we can readily mark those as sharing a build directory while not using any crate property.

One particular thing to call out is to make sure to completely get rid of the cache on each nightly bump, Crater used to share target directories for that but stopped as we were hitting too many "wrong rustc version" errors in a typical run.

We're already doing this in this PR. When the nightly version switches, we purge the whole cache.

👍 -- this seems like it would prevent most of the issues Crater saw.

I believe one-instance-per-build is more mid- to long-term.

What is currently in progress with infra is that we split the web & build-server, run web on ECS, and have an easy way to recreate build-servers when we want to. This would then be used to scale the build-servers (horizontally & vertically) when the queue gets full. So for these cases the caching would still help.

Yes, that seems right in terms of current trajectory. I would like to move us eventually to having no or minimal sandboxing ourselves (in Crater, docs.rs, playground) but rather rely on boundaries made by others (e.g., EC2 instance isolation).

@Nemo157
Copy link
Member

Nemo157 commented Mar 13, 2023

You can't set OUT_DIR's to read-only, build scripts can rerun and cargo doesn't clean the OUT_DIR between them. I've had issues before when I set files within ~/.cargo/src/ to read-only and that attribute was copied into the OUT_DIR by some buildscript so it failed on a second run.

We could use sccache to manage the caching for us. Just mount a per-owner directory into the build container for it to use. It will then handle copying the minimal final artifacts for each dependency into this directory, and grabbing the correct ones on the next build. That avoids all issues around copying things like doc/ that aren't cacheable, and reduces the disk usage by not caching other intermediate files that aren't necessary.

@syphar
Copy link
Member Author

syphar commented Mar 18, 2023

I think I have seen cases where a crate won't build or will fail tests due to stale files, e.g., because it's build script assumes that it is only running once -- but with different features etc that may not be the case -- but I don't have links to those off hand. I think my advice here comes down to: be prepared for trouble if you do this, and make sure you have a way to disable it. Consider only enabling for known well-behaved crates, especially if you already have logic for those to automatically de-prioritize them in the queue.

I'll keep that in mind, thanks! The deprioritization is currently only based on name patterns, which means there are false positives where crates from other orgs might be also deprioritized only because they share the same prefix. So we would have to keep in mind that crates from different owners might be mixed up under the same prefix.

This would also make me feel better because relying on crate owners as a security boundary doesn't feel right to me. For example, I have publishing rights to some rust-lang-owned crates, but I wouldn't want to see that mean that there is a security boundary breached if I have a side-project auto-published crate from some random repository or whatever.

The current idea is that the security boundary is not the owner, but the publisher.

It seems to me that the primary benefit is to already large projects, and ones we already treat specially in terms of queues etc, so we can readily mark those as sharing a build directory while not using any crate property.

So this sounds like it boils down to a decision between a manual opt-in for caching artifacts, and automated caching based on some criteria (publisher, ... )

We could use sccache to manage the caching for us. Just mount a per-owner directory into the build container for it to use. It will then handle copying the minimal final artifacts for each dependency into this directory, and grabbing the correct ones on the next build. That avoids all issues around copying things like doc/ that aren't cacheable, and reduces the disk usage by not caching other intermediate files that aren't necessary.

I'll check out if / how that would work. I assume just setting RUSTC_WRAPPER should be good enough.

@syphar syphar added S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Mar 26, 2023
@syphar
Copy link
Member Author

syphar commented May 14, 2023

That said landing sccache seems fine as a first step as long as you don't close out the tracking issue.

Hm, this sounds you would prefer the current approach over using sccache?

@syphar
Copy link
Member Author

syphar commented May 14, 2023

I am dubious how much sccache will actually help in practice. I would like to see benchmarks on how much it helps with large workspace if you're planning to only switch to sccache without further changes.

I definitely wanted to run some tests before starting to implement this

@syphar syphar changed the title RFC: fetch publisher data from crates.io, start build artifact caching fetch publisher data from crates.io, start build artifact caching Jun 2, 2023
@syphar syphar force-pushed the get-publisher branch 3 times, most recently from 4c3cc3c to 5e9a80e Compare June 4, 2023 06:07
@syphar syphar marked this pull request as ready for review June 19, 2023 20:13
@syphar
Copy link
Member Author

syphar commented Jun 19, 2023

@jyn514 @Nemo157 I think this is ready for another round of checks.

It would be awesome to have your feedback on what might be missing, from my side it's relatively complete.

I'm not 100% certain about the disk-space safeguard, typically cache size limits are configured as cache-size, but that would have meant I have to du the whole cache directory each time.

@syphar syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels Jun 19, 2023
@syphar syphar added S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Oct 4, 2023
@syphar
Copy link
Member Author

syphar commented Oct 4, 2023

In the meantime we got a much bigger EC2 machine for the docs.rs server, which improved build speed very much.
After even adding another build process I'm currently not sure any more if this change is worth the effort, since the potential improvement was always limited to workspace builds.

For now I would park this and focus on other topics that I see more critical, and revisit it when it becomes a problem again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. S-waiting-on-author Status: This PR is incomplete or needs to address review comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants