Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Improve hook speed by skipping before/after diff via readonly flag #1564

Open
jhenkens opened this issue Aug 20, 2020 · 24 comments
Open
Labels

Comments

@jhenkens
Copy link
Contributor

jhenkens commented Aug 20, 2020

diff_cmd = ('git', 'diff', '--no-ext-diff')
diff_before = cmd_output_b(*diff_cmd, retcode=None)
if not hook.pass_filenames:
filenames = ()
time_before = time.time()
language = languages[hook.language]
retcode, out = language.run_hook(hook, filenames, use_color)
duration = round(time.time() - time_before, 2) or 0
diff_after = cmd_output_b(*diff_cmd, retcode=None)
# if the hook makes changes, fail the commit

Hi all,

I am looking at implemented pre-commit to replace a harder-to-maintain custom bash script our company current uses. In my initial tests, pre-commit works great, except for the fact that it is slow. The majority of our hooks are readonly in nature, and run quite fast. However, on our repository, pre-commit is spending close to 1 second per hook performing a before/after diff.

I added some instrumentation, and in our test case with 10 executed hooks, the execution time of pre-commit is >7 seconds with the before/after diffs, and ~1 second with those diffs disabled. I will note this is an extremely large repository, with many thousands of files. Generally, git actions are somewhat slow on it.

In exploring issues related to perf, I had found #510, which seems to tackle a different problem - individual hooks that take a while, rather than the overhead of pre-commit itself. However, there was one comment that got me thinking, regarding a readonly attribute one could apply to a hook. I know - this metadata would only be as truthy as the author who wrote the hook, but if a hook was marked as readonly, pre-commit could theoretically skip the before/after hook diffs.

Any thoughts? Happy to contribute the PR if maintainers see this as valuable.

@asottile
Copy link
Member

this idea was contemplated at some point but was rejected as it would be difficult for two reasons:

  • plugin-based hooks wouldn't ever be able to set this accurately
  • if it was set innaccurately it would look like the framework's fault that it allowed through a commit
  • in tests it really only made a difference on windows (and even then, not a significant difference)

one thing you could pursue is reducing the amount of git diffs by half (instead of before/after, carry over the diff from the previous hook run). but even that wasn't pursued because it wasn't significant

@jhenkens
Copy link
Contributor Author

jhenkens commented Aug 20, 2020

I'm still converting more hooks for our use case, so our runtime is only going to deteriorate further. I'd argue that, for our case, it is substantial. Our current (hard to maintain and grok) hooks execute in the 1-2 second range.

I am running all tests on a Mac (though much of our development team is on Windows, so if it is even slower there than that is a compounding problem).

plugin-based hooks wouldn't ever be able to set this accurately

Perhaps I am mistaken on what a plugin-based hook is - I'm looking at https://pre-commit.com/#plugins, but it seems that it is entirely possible the author could incorrectly set it, but not that they wouldn't ever be able to set it correctly. If I write a validate-only hook, and publish it somewhere, I can, as the author, know it doesn't modify any files, and have it set as readonly.

if it was set innaccurately it would look like the framework's fault that it allowed through a commit

Could this be remedied by performing one "global" before/after diff for the entire run of pre-commit? For non-readonly hooks, we can still perform the before/after diffs, but then add one global check if any hooks were readonly, and fail/complain loudly if a change was improper? Perhaps run all readonly hooks first, with one set of diff, then continue to others?

one thing you could pursue is reducing the amount of git diffs by half (instead of before/after, carry over the diff from the previous hook run).

That would get us to roughly 4 seconds, versus 1 second. Definitely worth an attempt in my book.

@asottile
Copy link
Member

plugin-based hooks wouldn't ever be able to set this accurately

Perhaps I am mistaken on what a plugin-based hook is - I'm looking at https://pre-commit.com/#plugins, but it seems that it is entirely possible the author could incorrectly set it, but not that they wouldn't ever be able to set it correctly. If I write a validate-only hook, and publish it somewhere, I can, as the author, know it doesn't modify any files, and have it set as readonly.

things like flake8 where readonly would be most beneficial couldn't set this accurately as they are plugin systems themselves which could contain rewriting plugins

if it was set innaccurately it would look like the framework's fault that it allowed through a commit

Could this be remedied by performing one "global" before/after diff for the entire run of pre-commit? For non-readonly hooks, we can still perform the before/after diffs, but then add one global check if any hooks were readonly, and fail/complain loudly if a change was improper?

yeah that would be a potential trade off

@jhenkens
Copy link
Contributor Author

things like flake8 where readonly would be most beneficial couldn't set this accurately as they are plugin systems themselves which could contain rewriting plugins

Fair enough - I guess I need more coffee, given I couldn't think of that! In that case, my argument would be that the hook definition should not set readonly: true, but instead the per-repo configuration could override it, just as can be done with nearly every other hook property. If I know my configuration of flake8 should be readonly, I can set it as such.

@asottile
Copy link
Member

yeah the thing is nobody is going to do that, or worse everyone is going to cargo cult copy paste that everywhere :/

@asottile
Copy link
Member

I think for this to be a thing we'd need the following at least:

  • an --audit-readonly or something similar flag to run to debug when hooks set this incorrectly
  • overall before/after diffing to catchall incorrect readonlys
  • some smarter in-between diffing (rw hook followed by 3 ro hooks followed by rw hook should invoke ~4 diffs, validate readonly correctness at the 5th hook and skip the global hook) -- this is going to be very complex I imagine

hmm the other issue is how should readonly default, most things I believe are readonly, but the safest default is readonly: false to preserve the existing behaviour

ah and I just thought of another problem case, sometimes you can change the readonlyness by setting args, for example autopep8 -i vs just autopep8

@jhenkens
Copy link
Contributor Author

Yes - I agree. Can be quite complex in implementation. An additional global setting "run_readonly_hooks_first: true` could also benefit in reducing the number of in-between diffs. Unless I am mistaken, hooks run in the order they are defined, but given they may come from multiple repos, that can block manual ordering to obey without such a feature (unless there is some other sorting/ordering mechanism I am unaware of).

I think many tools/hooks will be left readonly: false by default, and up to the specific repo's implementation to set to true. However, some hooks, like check-added-large-files are always going to be readonly, and are easily set by the maintainer of said hook.

I definitely think this is a "power-user feature" to get the most benefit for any given client implementation, but that's not to say it isn't valuable. I think it will provide some benefit to all users of "stock hooks", once they are updated to mark readonly, but most benefit to those willing to put the effort in for best optimization.

@asottile
Copy link
Member

yeah the thing is, in most cases and for most repos this makes zero difference but would bring on a whole bunch of complexity

@jhenkens
Copy link
Contributor Author

I just tested with a small repo, and you are probably right. It ran all 10 checks in ~1 second.

I think this can be summed up as "pre-commit is not friendly with large repos". Unsure if it is spefically index size, or commit graph size.

Unfortunately for us, that means that pre-commit is unusable unless we fix or fork.

@asottile
Copy link
Member

actually, taking a step back it's very strange that git diff isn't fast though 🤔 -- when I worked at yelp we ran pre-commit on a 6MLOC+ code base with millions of commits and git diff returned nearly instantly

@jhenkens
Copy link
Contributor Author

jhenkens commented Aug 20, 2020

git --version
git version 2.27.0

Small repo

time git diff --no-ext-diff > /dev/null

real	0m0.018s
user	0m0.003s
sys	0m0.009s

Large repo:

time git diff --no-ext-diff > /dev/null

real	0m0.321s
user	0m0.042s
sys	0m1.634s

Both repos only had one change, staged, the .pre-commit-config.yaml.

The "big repo" is in the order of 10s of thousands of files, and hundreds of thousands of commits.

The wall time isn't that dramatic - when running it by itself, it is indistinguishably slower than on a small repo. The problem comes when you run it 20 times in a script. That adds up to 6 seconds of waiting for diffs, which aligns perfectly with my numbers determined by additional logging in run.py.

Edit: Updated git to 2.28:

10:30 $ time git diff --no-ext-diff > /dev/null

real	0m0.260s
user	0m0.039s
sys	0m1.657s

@asottile
Copy link
Member

anyway, I think the first step would be to try and cut the diffs calls in half if you'd like to approach that

@jhenkens
Copy link
Contributor Author

Thanks for helping with the review of #1566 . Is there anything else from our discussion here you think might be worth to add to the main fork?

@asottile
Copy link
Member

the readonly idea might be feasible, but might be best developed separately first and see how it functions (similar to how types was introduced)

@asottile
Copy link
Member

one other incremental improvement that can be done here is language: pygrep and language: fail are always readonly so some work can be skipped there

@jhenkens
Copy link
Contributor Author

jhenkens commented Sep 10, 2020 via email

@admorgan
Copy link

I haven't run the numbers, but since a list of files has already been generated would diffing only the involved files take less time?

@asottile
Copy link
Member

I don't think it would make it faster but you can certainly try that

in the worst case it'll be necessarily slower as it'll need to utilize xargs-like processing (multiple calls due to command line length limits)

@zackw
Copy link

zackw commented Aug 1, 2021

It occurred to me that maybe it would be faster to stat all the files from inside pre-commit, before and after each hook, and look for changes, rather than running git each time. Unfortunately, the overhead of doing the recursive stats and comparison in an interpreted language dominates. Testing on a checkout of GCC, which has 106,000 files:

$ time (cd gcc && git diff --no-ext-diff > /dev/null && true && git diff --no-ext-diff > /dev/null)
real	0m0.051s
user	0m0.087s
sys	0m0.297s

$ time python3 test.py gcc true
real	0m0.584s
user	0m0.398s
sys	0m0.186s

It is hitting the disk less, but it's hitting the CPU 4.5x more.

I'm appending the test program I wrote. Perhaps someone has a clever idea for speeding it up?

(N.B. I don't think it's a good idea to limit the check to only the files the hook was told to look at, as it coulda gone rogue and scribbled over others anyway.)

import os
import sys
import subprocess

def record_mtimes(root):
    mtimes = {}
    relpath = os.path.relpath
    lstat = os.lstat
    for dirpath, dirs, files, dirfd in os.fwalk(root):
        try:
            dirs.remove(".git")
        except ValueError:
            pass
        reldirpath = relpath(dirpath, root)
        for f in files:
            mtimes[(reldirpath, f)] = lstat(f, dir_fd=dirfd).st_mtime_ns
    return mtimes

def report_changed_files(before, after):
    for f, mt_before in before.items():
        mt_after = after.get(f)
        if mt_after is None:
            sys.stdout.write("deleted {!r}\n".format(os.path.join(*f)))
        elif mt_after != mt_before:
            sys.stdout.write("changed {!r}\n".format(os.path.join(*f)))

    for f in after.keys():
        if f not in before:
            sys.stdout.write("added   {!r}\n".format(os.path.join(*f)))

def main():
    workdir = sys.argv[1]
    before = record_mtimes(workdir)
    subprocess.run(sys.argv[2:], cwd=workdir)
    after = record_mtimes(workdir)
    report_changed_files(before, after)

if __name__ == "__main__":
    main()

@asottile
Copy link
Member

asottile commented Aug 1, 2021

mtime is also somewhat insufficient as a file can be changed without updating mtime (probably why git consumes more sys time)

you're also including files which aren't checked into git

@zackw
Copy link

zackw commented Aug 1, 2021

Good point about not noticing changes that don't cause mtime updates. However, scanning everything, whether or not checked in, was intentional: the only way to filter the list down to files that should, abstractly, be watched would be to invoke git for a list of checked-in files, which would cost nearly as much as running git diff, although I suppose it only needs to be done once at the beginning of the run. I figured this would just be a filter to detect any change to the tree, and avoid running git diff when there weren't any changes.

Another possibility occurred to me right after I posted the earlier comment: directory change notifications. More complicated and OS-specific but potentially much faster. I'm not volunteering to implement that, though. (I have no particular need for this feature myself, it just piqued my curiosity when I was looking through the issue list.)

@asottile
Copy link
Member

asottile commented Aug 1, 2021

asking git for a list of checked in files only reads the git index (a single file) -- and it already has to happen anyway

@hterik
Copy link

hterik commented Jan 25, 2022

As a suggestion, would it be possible for hooks to return the new reformatted file content instead of writing to files themselves, then let the framework write the file if it differs. With this approach you can essential consider all hooks as read-only. (Of course there is no 100% guarantee they don't violate this but there is no guarantee they don't do other funny business either...).

We have been running a in-house framework similar to pre-commit, using this strategy and it has been very successful.
Prerequisite for this is that each hook is run file-by-file, haven't seen many formatting-tools where this is much of an issue, they usually operate on single files anyway.

It also has other benefits like centralizing file write avoidance, for files that have identical content after formatting. Many other tools write the formatted file even if content is the same, changing the mtime and disturbing incremental builds. With each hook implementing their own file-write this is still possible of course, but more likely to be missed somewhere.
Readonly sounds like a good way to enable safer parallelism as well. Now hooks run in series if I'm not mistaken?

@asottile
Copy link
Member

pre-commit just runs tools, nothing else

running on a single file at a time would be worlds slower than batched execution -- any performance benefit you'd get from being able to run multiple tools at the same time would be instantly wiped out by the continual startup costs

also if you read the thread you'll note that it wouldn't make much of a performance difference since pre-commit is already running individual tools in parallel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants