New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Improve hook speed by skipping before/after diff via readonly flag #1564
Comments
this idea was contemplated at some point but was rejected as it would be difficult for two reasons:
one thing you could pursue is reducing the amount of git diffs by half (instead of before/after, carry over the diff from the previous hook run). but even that wasn't pursued because it wasn't significant |
I'm still converting more hooks for our use case, so our runtime is only going to deteriorate further. I'd argue that, for our case, it is substantial. Our current (hard to maintain and grok) hooks execute in the 1-2 second range. I am running all tests on a Mac (though much of our development team is on Windows, so if it is even slower there than that is a compounding problem).
Perhaps I am mistaken on what a plugin-based hook is - I'm looking at https://pre-commit.com/#plugins, but it seems that it is entirely possible the author could incorrectly set it, but not that they wouldn't ever be able to set it correctly. If I write a validate-only hook, and publish it somewhere, I can, as the author, know it doesn't modify any files, and have it set as readonly.
Could this be remedied by performing one "global" before/after diff for the entire run of
That would get us to roughly 4 seconds, versus 1 second. Definitely worth an attempt in my book. |
things like flake8 where readonly would be most beneficial couldn't set this accurately as they are plugin systems themselves which could contain rewriting plugins
yeah that would be a potential trade off |
Fair enough - I guess I need more coffee, given I couldn't think of that! In that case, my argument would be that the hook definition should not set readonly: true, but instead the per-repo configuration could override it, just as can be done with nearly every other hook property. If I know my configuration of |
yeah the thing is nobody is going to do that, or worse everyone is going to cargo cult copy paste that everywhere :/ |
I think for this to be a thing we'd need the following at least:
hmm the other issue is how should readonly default, most things I believe are readonly, but the safest default is ah and I just thought of another problem case, sometimes you can change the readonlyness by setting args, for example |
Yes - I agree. Can be quite complex in implementation. An additional global setting "run_readonly_hooks_first: true` could also benefit in reducing the number of in-between diffs. Unless I am mistaken, hooks run in the order they are defined, but given they may come from multiple repos, that can block manual ordering to obey without such a feature (unless there is some other sorting/ordering mechanism I am unaware of). I think many tools/hooks will be left I definitely think this is a "power-user feature" to get the most benefit for any given client implementation, but that's not to say it isn't valuable. I think it will provide some benefit to all users of "stock hooks", once they are updated to mark readonly, but most benefit to those willing to put the effort in for best optimization. |
yeah the thing is, in most cases and for most repos this makes zero difference but would bring on a whole bunch of complexity |
I just tested with a small repo, and you are probably right. It ran all 10 checks in ~1 second. I think this can be summed up as "pre-commit is not friendly with large repos". Unsure if it is spefically index size, or commit graph size. Unfortunately for us, that means that pre-commit is unusable unless we fix or fork. |
actually, taking a step back it's very strange that |
Small repo
Large repo:
Both repos only had one change, staged, the The "big repo" is in the order of 10s of thousands of files, and hundreds of thousands of commits. The wall time isn't that dramatic - when running it by itself, it is indistinguishably slower than on a small repo. The problem comes when you run it 20 times in a script. That adds up to 6 seconds of waiting for diffs, which aligns perfectly with my numbers determined by additional logging in Edit: Updated git to 2.28:
|
anyway, I think the first step would be to try and cut the diffs calls in half if you'd like to approach that |
Thanks for helping with the review of #1566 . Is there anything else from our discussion here you think might be worth to add to the main fork? |
the |
one other incremental improvement that can be done here is |
I feel that if we were to implement it for those two (as an always, based on language), we should go all out and allow the customization for any hook via repo level setting.
… On Sep 9, 2020, at 5:11 PM, Anthony Sottile ***@***.***> wrote:
one other incremental improvement that can be done here is language: pygrep and language: fail are always readonly so some work can be skipped there
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I haven't run the numbers, but since a list of files has already been generated would diffing only the involved files take less time? |
I don't think it would make it faster but you can certainly try that in the worst case it'll be necessarily slower as it'll need to utilize |
It occurred to me that maybe it would be faster to stat all the files from inside pre-commit, before and after each hook, and look for changes, rather than running git each time. Unfortunately, the overhead of doing the recursive stats and comparison in an interpreted language dominates. Testing on a checkout of GCC, which has 106,000 files:
It is hitting the disk less, but it's hitting the CPU 4.5x more. I'm appending the test program I wrote. Perhaps someone has a clever idea for speeding it up? (N.B. I don't think it's a good idea to limit the check to only the files the hook was told to look at, as it coulda gone rogue and scribbled over others anyway.)
|
mtime is also somewhat insufficient as a file can be changed without updating mtime (probably why git consumes more you're also including files which aren't checked into git |
Good point about not noticing changes that don't cause mtime updates. However, scanning everything, whether or not checked in, was intentional: the only way to filter the list down to files that should, abstractly, be watched would be to invoke git for a list of checked-in files, which would cost nearly as much as running Another possibility occurred to me right after I posted the earlier comment: directory change notifications. More complicated and OS-specific but potentially much faster. I'm not volunteering to implement that, though. (I have no particular need for this feature myself, it just piqued my curiosity when I was looking through the issue list.) |
asking git for a list of checked in files only reads the git index (a single file) -- and it already has to happen anyway |
As a suggestion, would it be possible for hooks to return the new reformatted file content instead of writing to files themselves, then let the framework write the file if it differs. With this approach you can essential consider all hooks as read-only. (Of course there is no 100% guarantee they don't violate this but there is no guarantee they don't do other funny business either...). We have been running a in-house framework similar to pre-commit, using this strategy and it has been very successful. It also has other benefits like centralizing file write avoidance, for files that have identical content after formatting. Many other tools write the formatted file even if content is the same, changing the mtime and disturbing incremental builds. With each hook implementing their own file-write this is still possible of course, but more likely to be missed somewhere. |
pre-commit just runs tools, nothing else running on a single file at a time would be worlds slower than batched execution -- any performance benefit you'd get from being able to run multiple tools at the same time would be instantly wiped out by the continual startup costs also if you read the thread you'll note that it wouldn't make much of a performance difference since pre-commit is already running individual tools in parallel |
pre-commit/pre_commit/commands/run.py
Lines 175 to 185 in 4f5cb99
Hi all,
I am looking at implemented pre-commit to replace a harder-to-maintain custom bash script our company current uses. In my initial tests, pre-commit works great, except for the fact that it is slow. The majority of our hooks are readonly in nature, and run quite fast. However, on our repository, pre-commit is spending close to 1 second per hook performing a before/after diff.
I added some instrumentation, and in our test case with 10 executed hooks, the execution time of pre-commit is >7 seconds with the before/after diffs, and ~1 second with those diffs disabled. I will note this is an extremely large repository, with many thousands of files. Generally, git actions are somewhat slow on it.
In exploring issues related to perf, I had found #510, which seems to tackle a different problem - individual hooks that take a while, rather than the overhead of pre-commit itself. However, there was one comment that got me thinking, regarding a
readonly
attribute one could apply to a hook. I know - this metadata would only be as truthy as the author who wrote the hook, but if a hook was marked as readonly,pre-commit
could theoretically skip the before/after hook diffs.Any thoughts? Happy to contribute the PR if maintainers see this as valuable.
The text was updated successfully, but these errors were encountered: