Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer to use local command with pip dependency? #1073

Closed
benjaoming opened this issue Oct 16, 2019 · 9 comments
Closed

Prefer to use local command with pip dependency? #1073

benjaoming opened this issue Oct 16, 2019 · 9 comments

Comments

@benjaoming
Copy link

benjaoming commented Oct 16, 2019

How we install black for pre-commit hooks

CC: @asottile

As per current README, a pre-commit hook is shown that will fetch the full psf/black git repository and invoke setup.py in order to install the black executable for the pre-commit hook.

I would like to recommend that installing the .whl from PyPi is faster and requires less bandwidth.

Moreover, using a locally available black command saves the developer double-trouble, i.e. installing black in both the ~/.cache/pre-commit and in some other dev environments. This might not look good as the default option (mixing pre-commit environment with development/deployment environments), but it's a very nice option for people who prefer having few environments.

Summary

Using PyPi to install, takes 6 CPU seconds, whereas the repo-based install takes 10 CPU seconds.

According to nethogs, using the repo fetches about 3.7MB of data, invoking /usr/lib/git-core/git-remote-https. Using pip and a completely clean ~/.cache/pip, 309 KB is fetched (black and dependencies). With pip cache already populated, just 12 KB is fetched from the network.

Some advantages of using PyPi releases with pip:

  • Speedier bootstrapping of linting and development environments
  • Pre-commit hooks are often run in CIs for extra checks (because some developers don't install the hooks!) - this would mean faster CI
  • Invoking pip also uses pip's system cache, so installing the same version of black several times won't use network
  • The developer can easily and transparently customize the black command args directly by modifying entry: black --check.
  • Less internet traffic means reduced CO2 emissions etc.
  • Specifying black version number for consistent linting environments

Option 1: Current repo-based configuration

repos:
-   repo: https://github.com/psf/black
    rev: stable
    hooks:
    - id: black
      language_version: python3.6

Option 2: A local command with a pip dependency

-   repo: local
    hooks:
    -   id: black
        name: black
        entry: black --check
        language: python
        types: [python]
        additional_dependencies: ['black']
        language_version: python3.6

Option 3: Just use black from the current environment

This requires runnning pip install black before executing pre-commit hooks.

-   repo: local
    hooks:
    -   id: black
        name: black
        entry: black --check
        language: system
        types: [python]

Benchmarks

Option 1:

time gc         
[INFO] Initializing environment for https://github.com/psf/black.
[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
black....................................................................Failed
hookid: black

error: cannot format test.py: Cannot parse: 1:7: syntax error
All done! 💥 💔 💥
1 file failed to reformat.

git commit -v  7.44s user 0.95s system 83% cpu 10.052 total

Option 2:

time gc         
[INFO] Initializing environment for local:black.
[INFO] Installing environment for local.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
black....................................................................Failed
hookid: black

error: cannot format test.py: Cannot parse: 1:7: syntax error
All done! 💥 💔 💥
1 file would fail to reformat.

git commit -v  5.38s user 0.53s system 97% cpu 6.084 total
@benjaoming
Copy link
Author

benjaoming commented Oct 16, 2019

I forgot to mention: This is intended as a documentation issue.

Updated: Added to the list of advantages that this method can pin the black version, using for instance additional_dependencies: ['black==19.3b0'] for consistent linting environments.

@asottile
Copy link
Contributor

first off, you're correct in that the current docs are incorrect -- this is covered in #420 (black shouldn't suggest rev: stable as this is not a supported workflow)

let's address Option 3 first, this misses the idea of the framework entirely so I'd like to address it and eliminate it first :) -- the whole idea of the framework is that you clone a repo and don't need to worry about installing N tools to deal with linting and code formatting -- the framework installs them for you (and can properly manage the versions etc.). The consistent versioning and tool availability is the selling point -- no need to futz with globally installed things, virtualenv activation, system configuration management, fiddly bash scripts, etc. etc.

next let's address the other (which works, but has some pretty major downsides)

the first is that the repository configuration provides reasonable and correct defaults. for instance your two local configurations both suffer from a pretty bad bug that was fixed in this repository -- they create resource starvation due to over-zealous multiprocessing. the repository inclusion received this fix and everyone received this fix automatically. Instead of sitting upon a well established and battle tested base you've got to copy paste your configuration (and you're unable to reap the benefits when improvements are made to the base!).

the second is update ergonomics, with the additional_dependencies approach pre-commit autoupdate cannot help you in upgrading versions automatically (whereas it can with rev based versions) -- this is because the strings in that list are passed directly to pip or other tools and may contain options / etc. (and not always or even necessarily dependencies). pre-commit autoupdate will neatly update your versions (here's an example) making it a snap to pre-commit autoupdate && pre-commit run --all-files && git commit -am 'updated linters / formatters'

the third is your comparison of speed and bandwidth, from my tests I'm not seeing as much of a difference (perhaps it's because I'm using a new enough version of git which can take advantage of shallow clones) -- but even then I don't think 3 seconds is really breaking the bank. Plus, in almost all situations you should be hitting a cache (pre-commit will reuse environments, and it's fairly trivial to add ~/.cache/pre-commit to a travis-ci cache). so ~usually installing a hook environment you've already used will not even leave the network! (pip still reaches out to pypi even if there is a local cache for """security""" purposes that I don't completely understand). I'd also suggest you compare apples to apples (and on a repository of significant size) -- you can compare the two directly by utilizing pre-commit install-hooks and then pre-commit run --all-files -- benchmark both with and without caches (pip, pre-commit, etc.).

the fourth I'd like to discount is the "easily customizable" part -- that's already a built in feature of the current form you simply override entry: or args: and you get what you want:

    -   id: black
        args: [--check]  # or `entry: black --check`

the last, and perhaps the most important advantage of the repository approach is that since it's git you can easily do git things like: use an unreleased version (simply point rev: at a git sha -- no longer need to pester a maintainer to make a release!), use a fork (simply point repo at your own fork of a repository)

note that pre-commit chooses git as a lowest common denominator because while it is a tool written in python and can run tools written in python it is not a python-specific tool. pre-commit aims to target many different programming languages and using repositories provides a good based to build upon without requiring language-specific tooling. repo: local and especially language: system are meant as escape hatches from the """normal way""" and have significant drawbacks (but are provided for those exceptional cases).


thanks for reading, hope this clears some things up

@zsol
Copy link
Collaborator

zsol commented Oct 16, 2019

thanks for the response @asottile, I tend to agree

@zsol zsol closed this as completed Oct 16, 2019
@benjaoming
Copy link
Author

benjaoming commented Oct 17, 2019

Hi @asottile

Thanks for taking time for the thorough response.

Maybe the issue of referencing tools through PyPi releases is a more fundamental upstream one (for instance pertaining autoupdate). I was assuming that it could be fixed in a project (black), but trying out an implementation (see below), I get the feeling that my final proposal is too hacky.

Notes and analysis

Noting that the issue is now closed, I wanted to clarify a few things and refine the proposal.

  1. You are right about Option 3, so scratch that 👍
  2. Git repos are cached locally across pre-commit environments, thanks for making me aware 👍
  3. This was initially meant as a documentation issue, but I now see that it would need a maintained version of .pre-commit-hooks.yaml in some repository in order to align with the goals of pre-commit (such as updates and changes to defaults) 👍
  4. Black and pre-commit are pieces in stacks. In order to bring down resource usage and environmental footprints, I think it's significant to consider things that are in magnitudes of seconds of runtime and megabytes of transfers. 3 seconds of CPU runtime is significant in the larger puzzle, so is saving 3.4 MB of transfer. The latter can take many seconds, depending on where in the world you are setting up a dev environment.
  5. The specific defaults in the entry: black --check wasn't meant to be scrutinized, everything should be customizable through args if configuration is maintained upstream 👍

New proposal (mix of Option 1 and 2)

So to synthesize all this, here's a final proposal:

Externalize .pre-commit-hooks.yaml to live in psf/black-pre-commit with this content:

-   id: black
    name: black
    description: 'Black: The uncompromising Python code formatter'
    entry: black
    language: python
    language_version: python3
    require_serial: true
    types: [python]
    additional_dependencies: ['black==19.3b0']

pre-commit autoupdate would pick up changes committed to the repo with newer versions pinned. An potential newfeature to pre-commit could perhaps mean that autoupdate would have a look at newer releases available on PyPi.

Problems

  • Getting rid of the old .pre-commit-hooks.yaml in the psf/black repo since it's referenced downstream.
  • Current mechanism is hacky, for instance it requires a dummy setup.py in the external repo
  • New releases would require updates to the pre-commit repo.. no one would like to do that manually, I presume.

New Benchmark

As advised, I've benchmarked the solution only with install-hooks so it's not cluttered by anything else. Here are the benchmarks of the previous Option 1 and 2.

New option (configuration stored in external repo)

➜  test git:(master) ✗ time pre-commit install-hooks
[INFO] Initializing environment for https://github.com/benjaoming/black-pre-commit-pip.
[INFO] Initializing environment for https://github.com/benjaoming/black-pre-commit-pip:black==19.3b0.
[INFO] Installing environment for https://github.com/benjaoming/black-pre-commit-pip.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pre-commit install-hooks  4.92s user 0.76s system 79% cpu 7.169 total

Option 1 (configuration stored in main repo)

➜  test git:(master) ✗ time pre-commit install-hooks
[INFO] Initializing environment for https://github.com/psf/black.
[INFO] Installing environment for https://github.com/psf/black.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pre-commit install-hooks  7.40s user 0.88s system 83% cpu 9.929 total

I tried installing it from a different git repo on the same system, and you are right that it's all cached.

I'm not sure about how shallow clones work, but I would expect this to get worse as the repository history grows.

Option 2 (local config referencing a pip dependency)

➜  test git:(master) ✗ time pre-commit install-hooks
[INFO] Initializing environment for local:black.
[INFO] Installing environment for local.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pre-commit install-hooks  5.11s user 0.72s system 95% cpu 6.107 total

@benjaoming
Copy link
Author

(updated after implementing an external test repository)

@asottile
Copy link
Contributor

I'm not sure about how shallow clones work, but I would expect this to get worse as the repository history grows.

yeah you're not really expecting right :) -- a shallow clone only downloads the contents of the revision you ask for -- the whole point is that it pulls no history. over time this path will be more and more common in pre-commit as people upgrade their OS / etc.

I think maintaining two repos when one is sufficient is a lot of work to ask someone to do when it's not really saving anyone significant time or bandwidth (it's like the equivalent of loading(or not) one webpage written in react)

@benjaoming
Copy link
Author

@asottile

So this is what I make of your response:

In the future, git shallow clones will greatly reduce the data fetched from git repositories. Thus, the performance gains from installing .whl releases from PyPi are unsubstantial.

@asottile
Copy link
Contributor

they're already unsubstantial but will be even more so, yes

@benjaoming
Copy link
Author

benjaoming commented Oct 17, 2019

I don't agree at all, but that can be discussed in another forum some other day :)

Am a happy user of pre-commit and intended to contribute with a positive improvement to gain speed and reduce resource usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants