Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Global Cache of Packages #133

Closed
JordanMartinez opened this issue Mar 12, 2019 · 4 comments 路 Fixed by #188
Closed

Local Global Cache of Packages #133

JordanMartinez opened this issue Mar 12, 2019 · 4 comments 路 Fixed by #188
Assignees
Milestone

Comments

@JordanMartinez
Copy link
Collaborator

Low priority.

Coming from this comment:

I'll also note that there's no technical issue against having a local global cache of packages, it's just that no one opened an issue about it yet 馃槃 (so please open one if you think it's a good thing to have)

Similar to bower in that it has a caching mechanism, provide a cache of already-downloaded dependencies, so that they can be used across multiple projects on the same computer.
Or something like that.

This would help lower the bandwidth my learning repo uses.

I'm not sure how hard this is to implement, nor whether it's a good idea (when full considerations are considered). At the very least, this can be further discussed. I started looking into Nix, but still haven't given it the attention it deserves.

@f-f
Copy link
Member

f-f commented Mar 15, 2019

I think this feature would be very much welcome!

Some design considerations:

  • I think trying to use Nix would cause more problems than benefits, given that it doesn't run on Windows (and I had issues with it on macOS too, but maybe I'm just not good at it)
  • Instead a simple approach with having a "global cache" directory in $HOME/.cache/spago could work well. The approach for "find a suitable cache directory" could be the same that Dhall uses for its cache
  • While we're at it we should really fix how the repos are downloaded.
    Right now for every dependency we create a thread for every dep and that creates a directory in which we put the repo (the directory has the tag). But this has the problem that if some exception happens then we have to clean up, and this doesn't always happen (see the logic here)
    Instead we should fetch the repo to a temp directory (globally), and once that's done we move it to the correct place, so we don't have to do any cleanup and we can be interrupted
  • There should be a command to delete the global cache. Something like spago clean-global-cache

@f-f f-f changed the title Feature Request: Local Global Cache of Packages Local Global Cache of Packages Mar 15, 2019
@f-f f-f added this to the 1.0 milestone Mar 16, 2019
@JordanMartinez
Copy link
Collaborator Author

Instead a simple approach with having a "global cache" directory in $HOME/.cache/spago could work well. The approach for "find a suitable cache directory" could be the same that Dhall uses for its cache

I think that makes the most sense.

While we're at it we should really fix how the repos are downloaded.
Right now for every dependency we create a thread for every dep and that creates a directory in which we put the repo (the directory has the tag). But this has the problem that if some exception happens then we have to clean up, and this doesn't always happen (see the logic here)
Instead we should fetch the repo to a temp directory (globally), and once that's done we move it to the correct place, so we don't have to do any cleanup and we can be interrupted

Sounds like you should open a separate issue for this.

There should be a command to delete the global cache. Something like spago clean-global-cache

Rather than brainstorm ideas, let's start broader in our thinking here.

Context:

  • spago offers guarantees about build reproducibility. A caching mechanism could potentially screw this up (i.e. be like bower in some situations where one must call rm -rf bower_modules/ output/ to fix things)
  • this caching feature is still desirable because it would reduce bandwidth usage

Principles

  • we'd like to keep the command surface small, so that it's easy to know which commands to use to accomplish different things. AFAICT, spago commands break down into 3 groups right now:

    1. commands related to the project (build, test, repl, etc.)
    2. commands related to the package set (verify, freeze, etc.)
    3. commands related to psc-package compatibility (psc-package-* commands)

    Thus, this command would fall under the package set grouping.

  • it should be possible to cache multiple versions of the package set (i.e. versions of the package set for different compiler versions and versions of the package set of the same compiler versions)

  • it should be possible to cache overrides/additions

  • while the above two points are desirable, the higher priority should be on continuing to guarantee build reproducibility. In situations where the cache could screw this up, gains from build reproducibility are prioritized/valued over gains obtained via caching. Thus, I think we should use follow these principles:

    • only use a local "global cache" for the upstream package set
    • only use a project-local cache for overrides/additions (which is what is already done now)

    Now, one might wonder, "Why not cache 'additions' as well?" To answer that, we'd argue, "the package set is created via upstream, which is modified by overrides, which is then modified by additions. If the additions should use one of the overridden packages, it's no longer "safe" to use in some situations where a local project does not include such an override in its local package set. To fix this, we could split additions into two versions: additions that uses only packages in the upstream package set and additions that uses packages that might appear in upstream and overides. Such a decision makes the user interface a bit harder to understand and use for very little benefit. How often are you adding the same package to a project-local package set? If you're doing it a lot, perhaps you should just add it to the upstream package set for everyone instead.

Outcome:

  • caching mechanism should work by default in the background without the user needing to use its command directly.
  • should the caching mechanism cause problems, the user is notified what could be causing the issue and possible ways to fix it.

Possible UX:

# Using a modified version of Fabrizo's command name,
# list all the package set versions we currently have cached
# in the local "global cache"
$ spago global-cache --list
# or maybe without the "--list" command
$ spago global-cache

# When installing packages for the first time, spago should
# default to installing the `upstream` part to the global cache,
# not the project local `.spago/` folder
$ spago install

# to disable this, one could use the below flag
$ spago install --no-global-caching

# When we want to clean the global cache, we can clean
# a specific version:
$ spago global-cache --clean

# what appears below is an idea for what would be printed to the console
You currently have these versions of package sets cached on your computer.
Indicate which you would like to delete by their corresponding number.
To delete multiple ones at the same time, separate their numbers by a space

#    PureScript Compiler  Package Set Date
--   -------------------  -----------------
 1   0.12.1               2018-02-01
 2   0.12.3               2019-01-01
 3   0.12.3               2019-01-08
 4   0.12.3               2019-02-04
 5   0.12.3               2019-03-01
.....
10   0.13.4               2019-09-05
> 1 2 5
You have indicated that you want to delete the following package sets

#    PureScript Compiler  Package Set Date
--   -------------------  -----------------
 1   0.12.1               2018-02-01
 2   0.12.3               2019-01-01
 5   0.12.3               2019-03-01

Is this correct (y=yes, all else = "no")
>y
Deleting the caches of the selected package sets. This make take a while...
Finished.

Now, I don't know how much of the above idea is feasible or what complexities it would add to this program.

@f-f
Copy link
Member

f-f commented Mar 17, 2019

@JordanMartinez great analysis!

Some clarifications:

Sounds like you should open a separate issue for this.

"Fixing how repos are downloaded" is definitely in the scope of this issue. It has to be changed anyways in order to implement this, so we might as well point out which are the problems of the current solution so the new one doesn't have them

it should be possible to cache multiple versions of the package set (i.e. versions of the package set for different compiler versions and versions of the package set of the same compiler versions)

Downloads happen "by package tag" and not "by package set tag".
I.e. if the version of a package is the same across package sets, should we cache it twice?

  • only use a local "global cache" for the upstream package set
  • only use a project-local cache for overrides/additions

There are some aspects that make all of this really hard:

  • since we use Dhall to resolve packages, we don't distinguish between upstream and local packages (we could, but it's hard and fragile) - i.e. everything is just a package. I wouldn't recomment trying to distinguishing between "upstream" and "local"
  • However, a distinction we can do is Local vs Remote package.
    So we could say "let's just cache the remotes" - then a problem you encounter is that git is not immutable. And this is a problem because we support both references to "tags" and "branches". Tags are fairly immutable, but branches are not. So if you have an override that points to a branch and that gets globally cached then you project won't pickup the changes if you add commits to that branch.

Given the above, I propose that the global cache should work only for packages that:

  1. are GitHub repos
  2. and point to tags

This is so that we can query their tags via an API call, so we can reliably know if a reference is a branch or a tag.

Implementation note: we should keep a small "GitHub index" in the global cache, that caches which tags are available for every package (so you don't have to look them up every time you install a project)

Possible UX

I would prefer flags to alter command behaviour instead of interactive mode, mainly because:

  • interactive doesn't work for CI, so you have to also support the flag-based flow, doubling the implementation work
  • all of the information that the program gives to the user is "hidden" behind the interactive choices. Since you have to support flags anyways, they'll need to be documented, possibly duplicating information (and the problem with duplicated information is that you have to keep it in sync)

The reasons why I proposed spago clean-global-cache are that:

  • its inner workings are really simple: it nukes the global cache
  • its UX is really simple, users don't have to choose or know things

@JordanMartinez
Copy link
Collaborator Author

Downloads happen "by package tag" and not "by package set tag".
I.e. if the version of a package is the same across package sets, should we cache it twice?

Oh.... that's what was meant. Yeah, that makes a heck of lot more sense.

Given the above, I propose that the global cache should work only for packages that:

  • are GitHub repos
  • and point to tags

This is so that we can query their tags via an API call, so we can reliably know if a reference is a branch or a tag.

That makes sense.

I would prefer flags to alter command behaviour instead of interactive mode

Also makes a lot of sense.

There's just a lot of sense-making going on in your comments.

@f-f f-f modified the milestones: 1.0, 0.8 Apr 18, 2019
@f-f f-f self-assigned this May 3, 2019
@f-f f-f mentioned this issue May 7, 2019
@f-f f-f modified the milestones: 0.8, 0.9 May 13, 2019
@f-f f-f closed this as completed in #188 May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants