Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache own code with integrated cargo-sweep-like solution #43

Open
Madoshakalaka opened this issue Jan 8, 2022 · 10 comments
Open

Cache own code with integrated cargo-sweep-like solution #43

Madoshakalaka opened this issue Jan 8, 2022 · 10 comments

Comments

@Madoshakalaka
Copy link

cargo-sweep is a tool to be used around any build actions to timestamp and remove unvisited cache. It does so by checking the atime (access time) of the files in the target directory.

cargo sweep -s

<Insert any number of cargo build, cargo test etc...>

cargo sweep -f

With this method, incremental build can be left on, and build of own code can be cached for greater speed.
Unfortunately, cargo-sweep doesn't publish any binary builds nor github actions.

#37 mentions the benefits of not caching one's own code:

  1. it would only cache the first invocation when it generates a new cache key.
  2. with time the code will diverge more, causing longer builds.
  3. incremental needs a lot of disk space, and thus cache space and time to fetch & extract, lowering the benefits.
  1. is no longer an issue with the kind of smart diffing of cargo-sweep.
@Madoshakalaka
Copy link
Author

Can we make this an option?

@Swatinem
Copy link
Owner

The first problem still remains, as you cannot update the cache, so it will still be ineffective. But a better solution for cleaning the cache (based on access times, or whatever) would be nice for sure. I’m not quite up-to-date what the current way of using binary assets in actions is like, would need to look that up.

@Madoshakalaka
Copy link
Author

how about using src code as part of the keys too? With the valina actions/cache it would be something like:

      - uses: actions/cache@v2
        with:
          path: |
           # ...
          key: cargo-${{ runner.os }}-${{ hashFiles('my-crate-directory/src/**') }}-${{ hashFiles('Cargo.lock') }}
          restore-keys: |
          # ... 

To make this work for workspaces, rust-cache can allow the users to specify the member crates to be hashed.

@nipunn1313
Copy link

I use a workspace with many (~20) crates in a larger project. I'd certainly get benefit from having some of the lower level dependencies within our own crates cached. We use rust-cache to cache all our external dependencies, but there's still 2-3 minutes worth of build at the end to build our 20 crates.

I could imagine a solution to issue (1) by including say the timestamp of the week in the cache key - such that it's cleared every week (or day). I could also envision a solution where the src is part of the hash key, but it would be important for the restore-key to also be willing to try older ones, since src changes a lot.

It would be nice to be able to opt-in to caching ones own workspace code - for projects willing to make a different tradeoff than the default.

Something like cargo sweep would be a cherry on top, but for us, saving 2-3 minutes in the common case would be huge.

@adam-azarchs
Copy link

I think in this case the right answer is probably to have separate cache blobs for external vs. internal dependencies. External dependencies only change when (certain) changes are made to Cargo.toml or Cargo.lock, but internal dependencies change due to changes to source files (which keep in mind can absolutely include files other than .rs files because e.g. include_str! is a thing).

However there's a separate problem, which is that cargo will probably ignore the cache because it's idea of a "hash" of the source files includes the file modification timestamp, and that timestamp will be different every time you check out the code. This is a problem in general, but particularly for such internal dependencies. See rust-lang/cargo#6529.

@polarathene
Copy link

For reference (presently only available in nightly): https://blog.rust-lang.org/2023/12/11/cargo-cache-cleaning.html

@torokati44
Copy link

torokati44 commented Feb 27, 2024

We use rust-cache to cache all our external dependencies, but there's still 2-3 minutes worth of build at the end to build our 20 crates.

+1

Something like cargo sweep would be a cherry on top, but for us, saving 2-3 minutes in the common case would be huge.

+1

The first problem still remains, as you cannot update the cache, so it will still be ineffective.

Always pushing a new cache, and generating/specifying a different (shorter) restore key, the latest cache with a matching prefix will be fetched.

In our (desired) model, workflow runs on master (after PRs got merged, or on a nightly schedule) would be the "source" of the cache - both for external dependencies; and for our own code (multiple large crates in the repo), with incremental compilation enabled - and the PR runs would be "consumers" of the cache, for speedy merge-ability checks.

@polarathene
Copy link

polarathene commented Feb 29, 2024

but there's still 2-3 minutes worth of build at the end to build our 20 crates.

Does that extra time not reproduce locally? (may be less on a local system that is more powerful than a runner)

Recently I was building Vector locally and exploring caching. Running cargo build again after the first build (without changing anything) still took quite a while. This was with 680+ crates compiled.

  • Some it came from LTO (which is enabled by default, unless you have opt-level=0 / codegen-units = 1 or set lto = "off").
  • The other part I'm not sure where that time was being spent, but incremental = true brought that down to 6 seconds.

I still had 50s (with implicit default lto = false, aka crate local thin LTO?), 44s was due to the implicit LTO.

When building the full project (900+ crates, instead of my reduced feature set build):

  • A build (with the target cache) is still 3 minutes with that implicit LTO, or only 36s with lto = "off" (with incremental = true, otherwise 3m14s). So over 6 minutes total with implicit LTO and no incremental builds enabled.
  • A build (without the target cache) is 9m25s, or with lto="false" 7m30s.

You can also use linker-plugin-lto with Clang linker and Mold (actual linker via plugin) to also cache LTO (you need to ensure your clang and rust toolchain versions are compatible):

# .cargo/config.toml
# The `build` section applies for all targets:
[build]
rustflags = [
  # Set a linker like clang (affects link-arg compatibility):
  "-C", "linker=clang",
  # This one is required to enable the LTO caching option:
  "-C", "linker-plugin-lto",
  # This is important for linker-plugin-lto to work (note: `-flto=thin` / `-flto=full` have no effect here?):
  "-C", "link-arg=-flto",
  # By default this value is 0, which is 1 thread per physical core.
  # Use `all` to get 1 thread per logical core to use full CPU like rust normally would do:
  "-C", "link-arg=-flto-jobs=all",
  "-C", "link-arg=-Wl,--thinlto-cache-dir=lto-cache",
  # Actual linker to use:
  "-C", "link-arg=-fuse-ld=mold"
]

Some of the link-arg vary based on configured linker (-C link-arg=-fuse-ld=...), for mold the link-arg options match those documented for lld. Worth keeping in mind if you use something else instead of mold (lld / ld), or don't use clang (uses gcc then AFAIK which also has different options).

This adds a little overhead, but with the cache available, you get faster rebuilds.

  • 13s total for my reduced feature set build (6s via incremental = true + 7s with -C linker-plugin-lto).
  • 56s for the full build (36s via incremental = true + 20s with -C linker-plugin-lto).

Over 70% reduction! 🎉


Give it a try, perhaps it'll make a difference for you? :)

However there's a separate problem, which is that cargo will probably ignore the cache because it's idea of a "hash" of the source files includes the file modification timestamp, and that timestamp will be different every time you check out the code.

This concern has some workarounds?:

Earthly has a feature to optionally preserve timestamps when copying from host to image, noting it's important for incremental = true cache to work correctly, and the linked concern for git-restore-mtime may not apply to Github Actions, so it may be a valid workaround 😅


Unfortunately, cargo-sweep doesn't publish any binary builds nor github actions.

cargo binstall can work?:

# Grab and extract a static release of cargo-binstall:
curl -sSfL "https://github.com/cargo-bins/cargo-binstall/releases/latest/download/cargo-binstall-$(uname -m)-unknown-linux-musl.tgz" | tar -xz

# Install cargo-binstall into a proper `cargo binstall` command if useful afterwards, and install cargo-sweep:
# These will be installed to the default $CARGO_HOME (eg: /usr/local/cargo/bin/)
# If that location is part of the PATH env, the installed binaries will still be usable if $CARGO_HOME changes.
./cargo-binstall --quiet --no-confirm cargo-binstall cargo-sweep
rm ./cargo-binstall

@NobodyXu
Copy link
Contributor

NobodyXu commented Mar 1, 2024

cargo binstall can work?:

You could use taiki-e/install-action, which automatically fallback to cargo-binstall for crates if doesn't natively support.

It would install cargo-binstall to $CARGO_HOME/bin so that it can be reused by taiki-e/install-action.

@NobodyXu
Copy link
Contributor

NobodyXu commented Mar 1, 2024

I have totally no idea that lto cache exists, would be great if zig-cc supports it since we use it.

Wonder if rustc/cargo would expose such option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants