Idea: Improve hook speed by skipping before/after diff via readonly flag #1564

jhenkens · 2020-08-20T14:49:14Z

Lines 175 to 185 in 4f5cb99

    
           diff_cmd = ('git', 'diff', '--no-ext-diff') 
        
           diff_before = cmd_output_b(*diff_cmd, retcode=None) 
        
           if not hook.pass_filenames: 
        
               filenames = () 
        
           time_before = time.time() 
        
           language = languages[hook.language] 
        
           retcode, out = language.run_hook(hook, filenames, use_color) 
        
           duration = round(time.time() - time_before, 2) or 0 
        
           diff_after = cmd_output_b(*diff_cmd, retcode=None) 
        
           # if the hook makes changes, fail the commit

Hi all,

I am looking at implemented pre-commit to replace a harder-to-maintain custom bash script our company current uses. In my initial tests, pre-commit works great, except for the fact that it is slow. The majority of our hooks are readonly in nature, and run quite fast. However, on our repository, pre-commit is spending close to 1 second per hook performing a before/after diff.

I added some instrumentation, and in our test case with 10 executed hooks, the execution time of pre-commit is >7 seconds with the before/after diffs, and ~1 second with those diffs disabled. I will note this is an extremely large repository, with many thousands of files. Generally, git actions are somewhat slow on it.

In exploring issues related to perf, I had found #510, which seems to tackle a different problem - individual hooks that take a while, rather than the overhead of pre-commit itself. However, there was one comment that got me thinking, regarding a readonly attribute one could apply to a hook. I know - this metadata would only be as truthy as the author who wrote the hook, but if a hook was marked as readonly, pre-commit could theoretically skip the before/after hook diffs.

Any thoughts? Happy to contribute the PR if maintainers see this as valuable.

The text was updated successfully, but these errors were encountered:

asottile · 2020-08-20T16:10:52Z

this idea was contemplated at some point but was rejected as it would be difficult for two reasons:

plugin-based hooks wouldn't ever be able to set this accurately
if it was set innaccurately it would look like the framework's fault that it allowed through a commit
in tests it really only made a difference on windows (and even then, not a significant difference)

one thing you could pursue is reducing the amount of git diffs by half (instead of before/after, carry over the diff from the previous hook run). but even that wasn't pursued because it wasn't significant

jhenkens · 2020-08-20T16:22:06Z

I'm still converting more hooks for our use case, so our runtime is only going to deteriorate further. I'd argue that, for our case, it is substantial. Our current (hard to maintain and grok) hooks execute in the 1-2 second range.

I am running all tests on a Mac (though much of our development team is on Windows, so if it is even slower there than that is a compounding problem).

plugin-based hooks wouldn't ever be able to set this accurately

Perhaps I am mistaken on what a plugin-based hook is - I'm looking at https://pre-commit.com/#plugins, but it seems that it is entirely possible the author could incorrectly set it, but not that they wouldn't ever be able to set it correctly. If I write a validate-only hook, and publish it somewhere, I can, as the author, know it doesn't modify any files, and have it set as readonly.

if it was set innaccurately it would look like the framework's fault that it allowed through a commit

Could this be remedied by performing one "global" before/after diff for the entire run of pre-commit? For non-readonly hooks, we can still perform the before/after diffs, but then add one global check if any hooks were readonly, and fail/complain loudly if a change was improper? Perhaps run all readonly hooks first, with one set of diff, then continue to others?

one thing you could pursue is reducing the amount of git diffs by half (instead of before/after, carry over the diff from the previous hook run).

That would get us to roughly 4 seconds, versus 1 second. Definitely worth an attempt in my book.

asottile · 2020-08-20T16:25:18Z

plugin-based hooks wouldn't ever be able to set this accurately

Perhaps I am mistaken on what a plugin-based hook is - I'm looking at https://pre-commit.com/#plugins, but it seems that it is entirely possible the author could incorrectly set it, but not that they wouldn't ever be able to set it correctly. If I write a validate-only hook, and publish it somewhere, I can, as the author, know it doesn't modify any files, and have it set as readonly.

things like flake8 where readonly would be most beneficial couldn't set this accurately as they are plugin systems themselves which could contain rewriting plugins

if it was set innaccurately it would look like the framework's fault that it allowed through a commit

Could this be remedied by performing one "global" before/after diff for the entire run of pre-commit? For non-readonly hooks, we can still perform the before/after diffs, but then add one global check if any hooks were readonly, and fail/complain loudly if a change was improper?

yeah that would be a potential trade off

jhenkens · 2020-08-20T16:28:08Z

things like flake8 where readonly would be most beneficial couldn't set this accurately as they are plugin systems themselves which could contain rewriting plugins

Fair enough - I guess I need more coffee, given I couldn't think of that! In that case, my argument would be that the hook definition should not set readonly: true, but instead the per-repo configuration could override it, just as can be done with nearly every other hook property. If I know my configuration of flake8 should be readonly, I can set it as such.

asottile · 2020-08-20T16:30:46Z

yeah the thing is nobody is going to do that, or worse everyone is going to cargo cult copy paste that everywhere :/

asottile · 2020-08-20T16:35:17Z

I think for this to be a thing we'd need the following at least:

an --audit-readonly or something similar flag to run to debug when hooks set this incorrectly
overall before/after diffing to catchall incorrect readonlys
some smarter in-between diffing (rw hook followed by 3 ro hooks followed by rw hook should invoke ~4 diffs, validate readonly correctness at the 5th hook and skip the global hook) -- this is going to be very complex I imagine

hmm the other issue is how should readonly default, most things I believe are readonly, but the safest default is readonly: false to preserve the existing behaviour

ah and I just thought of another problem case, sometimes you can change the readonlyness by setting args, for example autopep8 -i vs just autopep8

jhenkens · 2020-08-20T16:42:26Z

Yes - I agree. Can be quite complex in implementation. An additional global setting "run_readonly_hooks_first: true` could also benefit in reducing the number of in-between diffs. Unless I am mistaken, hooks run in the order they are defined, but given they may come from multiple repos, that can block manual ordering to obey without such a feature (unless there is some other sorting/ordering mechanism I am unaware of).

I think many tools/hooks will be left readonly: false by default, and up to the specific repo's implementation to set to true. However, some hooks, like check-added-large-files are always going to be readonly, and are easily set by the maintainer of said hook.

I definitely think this is a "power-user feature" to get the most benefit for any given client implementation, but that's not to say it isn't valuable. I think it will provide some benefit to all users of "stock hooks", once they are updated to mark readonly, but most benefit to those willing to put the effort in for best optimization.

asottile · 2020-08-20T16:45:52Z

yeah the thing is, in most cases and for most repos this makes zero difference but would bring on a whole bunch of complexity

jhenkens · 2020-08-20T16:58:18Z

I just tested with a small repo, and you are probably right. It ran all 10 checks in ~1 second.

I think this can be summed up as "pre-commit is not friendly with large repos". Unsure if it is spefically index size, or commit graph size.

Unfortunately for us, that means that pre-commit is unusable unless we fix or fork.

asottile · 2020-08-20T17:02:20Z

actually, taking a step back it's very strange that git diff isn't fast though 🤔 -- when I worked at yelp we ran pre-commit on a 6MLOC+ code base with millions of commits and git diff returned nearly instantly

jhenkens · 2020-08-20T17:26:05Z

git --version
git version 2.27.0

Small repo

time git diff --no-ext-diff > /dev/null

real	0m0.018s
user	0m0.003s
sys	0m0.009s

Large repo:

time git diff --no-ext-diff > /dev/null

real	0m0.321s
user	0m0.042s
sys	0m1.634s

Both repos only had one change, staged, the .pre-commit-config.yaml.

The "big repo" is in the order of 10s of thousands of files, and hundreds of thousands of commits.

The wall time isn't that dramatic - when running it by itself, it is indistinguishably slower than on a small repo. The problem comes when you run it 20 times in a script. That adds up to 6 seconds of waiting for diffs, which aligns perfectly with my numbers determined by additional logging in run.py.

Edit: Updated git to 2.28:

10:30 $ time git diff --no-ext-diff > /dev/null

real	0m0.260s
user	0m0.039s
sys	0m1.657s

asottile · 2020-08-20T17:34:05Z

anyway, I think the first step would be to try and cut the diffs calls in half if you'd like to approach that

jhenkens · 2020-08-25T15:44:20Z

Thanks for helping with the review of #1566 . Is there anything else from our discussion here you think might be worth to add to the main fork?

asottile · 2020-08-25T17:25:06Z

the readonly idea might be feasible, but might be best developed separately first and see how it functions (similar to how types was introduced)

asottile · 2020-09-10T00:10:38Z

one other incremental improvement that can be done here is language: pygrep and language: fail are always readonly so some work can be skipped there

jhenkens · 2020-09-10T00:47:41Z

I feel that if we were to implement it for those two (as an always, based on language), we should go all out and allow the customization for any hook via repo level setting.

…

On Sep 9, 2020, at 5:11 PM, Anthony Sottile ***@***.***> wrote: one other incremental improvement that can be done here is language: pygrep and language: fail are always readonly so some work can be skipped there — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

admorgan · 2020-10-19T01:10:13Z

I haven't run the numbers, but since a list of files has already been generated would diffing only the involved files take less time?

asottile · 2020-10-19T01:12:39Z

I don't think it would make it faster but you can certainly try that

in the worst case it'll be necessarily slower as it'll need to utilize xargs-like processing (multiple calls due to command line length limits)

zackw · 2021-08-01T15:51:25Z

It occurred to me that maybe it would be faster to stat all the files from inside pre-commit, before and after each hook, and look for changes, rather than running git each time. Unfortunately, the overhead of doing the recursive stats and comparison in an interpreted language dominates. Testing on a checkout of GCC, which has 106,000 files:

$ time (cd gcc && git diff --no-ext-diff > /dev/null && true && git diff --no-ext-diff > /dev/null)
real	0m0.051s
user	0m0.087s
sys	0m0.297s

$ time python3 test.py gcc true
real	0m0.584s
user	0m0.398s
sys	0m0.186s

It is hitting the disk less, but it's hitting the CPU 4.5x more.

I'm appending the test program I wrote. Perhaps someone has a clever idea for speeding it up?

(N.B. I don't think it's a good idea to limit the check to only the files the hook was told to look at, as it coulda gone rogue and scribbled over others anyway.)

import os
import sys
import subprocess

def record_mtimes(root):
    mtimes = {}
    relpath = os.path.relpath
    lstat = os.lstat
    for dirpath, dirs, files, dirfd in os.fwalk(root):
        try:
            dirs.remove(".git")
        except ValueError:
            pass
        reldirpath = relpath(dirpath, root)
        for f in files:
            mtimes[(reldirpath, f)] = lstat(f, dir_fd=dirfd).st_mtime_ns
    return mtimes

def report_changed_files(before, after):
    for f, mt_before in before.items():
        mt_after = after.get(f)
        if mt_after is None:
            sys.stdout.write("deleted {!r}\n".format(os.path.join(*f)))
        elif mt_after != mt_before:
            sys.stdout.write("changed {!r}\n".format(os.path.join(*f)))

    for f in after.keys():
        if f not in before:
            sys.stdout.write("added   {!r}\n".format(os.path.join(*f)))

def main():
    workdir = sys.argv[1]
    before = record_mtimes(workdir)
    subprocess.run(sys.argv[2:], cwd=workdir)
    after = record_mtimes(workdir)
    report_changed_files(before, after)

if __name__ == "__main__":
    main()

asottile · 2021-08-01T15:55:13Z

mtime is also somewhat insufficient as a file can be changed without updating mtime (probably why git consumes more sys time)

you're also including files which aren't checked into git

zackw · 2021-08-01T16:03:02Z

Good point about not noticing changes that don't cause mtime updates. However, scanning everything, whether or not checked in, was intentional: the only way to filter the list down to files that should, abstractly, be watched would be to invoke git for a list of checked-in files, which would cost nearly as much as running git diff, although I suppose it only needs to be done once at the beginning of the run. I figured this would just be a filter to detect any change to the tree, and avoid running git diff when there weren't any changes.

Another possibility occurred to me right after I posted the earlier comment: directory change notifications. More complicated and OS-specific but potentially much faster. I'm not volunteering to implement that, though. (I have no particular need for this feature myself, it just piqued my curiosity when I was looking through the issue list.)

asottile · 2021-08-01T16:09:32Z

asking git for a list of checked in files only reads the git index (a single file) -- and it already has to happen anyway

hterik · 2022-01-25T07:14:44Z

As a suggestion, would it be possible for hooks to return the new reformatted file content instead of writing to files themselves, then let the framework write the file if it differs. With this approach you can essential consider all hooks as read-only. (Of course there is no 100% guarantee they don't violate this but there is no guarantee they don't do other funny business either...).

We have been running a in-house framework similar to pre-commit, using this strategy and it has been very successful.
Prerequisite for this is that each hook is run file-by-file, haven't seen many formatting-tools where this is much of an issue, they usually operate on single files anyway.

It also has other benefits like centralizing file write avoidance, for files that have identical content after formatting. Many other tools write the formatted file even if content is the same, changing the mtime and disturbing incremental builds. With each hook implementing their own file-write this is still possible of course, but more likely to be missed somewhere.
Readonly sounds like a good way to enable safer parallelism as well. Now hooks run in series if I'm not mistaken?

asottile · 2022-01-25T14:01:32Z

pre-commit just runs tools, nothing else

running on a single file at a time would be worlds slower than batched execution -- any performance benefit you'd get from being able to run multiple tools at the same time would be instantly wiped out by the continual startup costs

also if you read the thread you'll note that it wouldn't make much of a performance difference since pre-commit is already running individual tools in parallel

asottile added the question label Aug 20, 2020

jhenkens mentioned this issue Aug 20, 2020

Save diff between hook executions #1566

Merged

asottile mentioned this issue Nov 19, 2020

Slow on Windows with lots of submodules #1701

Closed

elagil mentioned this issue May 2, 2022

Suggestion: Make pre-commit output easier to parse by problem matchers #2369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Improve hook speed by skipping before/after diff via readonly flag #1564

Idea: Improve hook speed by skipping before/after diff via readonly flag #1564

jhenkens commented Aug 20, 2020 •

edited

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020 •

edited

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020

asottile commented Aug 20, 2020

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020 •

edited

asottile commented Aug 20, 2020

jhenkens commented Aug 25, 2020

asottile commented Aug 25, 2020

asottile commented Sep 10, 2020

jhenkens commented Sep 10, 2020 via email

admorgan commented Oct 19, 2020

asottile commented Oct 19, 2020

zackw commented Aug 1, 2021 •

edited

asottile commented Aug 1, 2021 •

edited

zackw commented Aug 1, 2021 •

edited

asottile commented Aug 1, 2021

hterik commented Jan 25, 2022

asottile commented Jan 25, 2022

Idea: Improve hook speed by skipping before/after diff via readonly flag #1564

Idea: Improve hook speed by skipping before/after diff via readonly flag #1564

Comments

jhenkens commented Aug 20, 2020 • edited

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020 • edited

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020

asottile commented Aug 20, 2020

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020

asottile commented Aug 20, 2020

jhenkens commented Aug 20, 2020 • edited

asottile commented Aug 20, 2020

jhenkens commented Aug 25, 2020

asottile commented Aug 25, 2020

asottile commented Sep 10, 2020

jhenkens commented Sep 10, 2020 via email

admorgan commented Oct 19, 2020

asottile commented Oct 19, 2020

zackw commented Aug 1, 2021 • edited

asottile commented Aug 1, 2021 • edited

zackw commented Aug 1, 2021 • edited

asottile commented Aug 1, 2021

hterik commented Jan 25, 2022

asottile commented Jan 25, 2022

jhenkens commented Aug 20, 2020 •

edited

jhenkens commented Aug 20, 2020 •

edited

jhenkens commented Aug 20, 2020 •

edited

zackw commented Aug 1, 2021 •

edited

asottile commented Aug 1, 2021 •

edited

zackw commented Aug 1, 2021 •

edited