Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime #8633

dvandersluis · 2020-09-02T14:53:41Z

File.mtime is faster, but inconsistent in CI, because in each CI build the mtime changes, which means that the rubocop cache implicitly cannot be used as-is. It's obviously slower to calculate a hash for each file, but this is still more performant (both in memory consumption and iterations per second) than the original.

I wrote a benchmark to test memory and IPS: https://gist.github.com/dvandersluis/0ddbc936fc858fbbec9cf8fb796ec9a2

Calculating -------------------------------------
            original     1.770M memsize (   622.000  retained)
                       587.000  objects (     3.000  retained)
                        50.000  strings (     2.000  retained)
               mtime    27.343k memsize (     0.000  retained)
                       482.000  objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
           file hash     1.586M memsize (     0.000  retained)
                       668.000  objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
               crc32   940.981k memsize (     0.000  retained)
                       480.000  objects (     0.000  retained)
                        50.000  strings (     0.000  retained)

Comparison:
               mtime:      27343 allocated
               crc32:     940981 allocated - 34.41x more
           file hash:    1585783 allocated - 58.00x more
            original:    1769735 allocated - 64.72x more
Warming up --------------------------------------
            original    19.000  i/100ms
               mtime    66.000  i/100ms
           file hash    21.000  i/100ms
               crc32    37.000  i/100ms
Calculating -------------------------------------
            original    246.123  (± 5.3%) i/s -      1.235k in   5.033696s
               mtime      1.143k (± 2.7%) i/s -      5.742k in   5.026275s
           file hash    276.138  (± 2.9%) i/s -      1.386k in   5.023864s
               crc32    406.651  (± 3.2%) i/s -      2.035k in   5.009418s

Comparison:
               mtime:     1143.3 i/s
               crc32:      406.7 i/s - 2.81x  slower
           file hash:      276.1 i/s - 4.14x  slower
            original:      246.1 i/s - 4.65x  slower

Obviously just accessing a file stat is going to be best, but that doesn't help when the results are inconsistent. I also tested some other options, such as other non-cryptographic hashes that are supposed to be fast (like mumble hash) and got worse results than using crc32 (plus those involve adding a gem which I think is better to avoid).

If this is too slow still, a thought I had would be to switch how to calculate the checksum based on ENV['CI']. Please let me know if you'd like me to make that change, or if there's another strategy to try.

Before submitting the PR make sure the following are checked:

Wrote good commit messages.
Commit message starts with [Fix #issue-number] (if the related issue exists).
Feature branch is up-to-date with master (if not - rebase it).
Squashed related commits together.
Added tests.
Added an entry to the Changelog if the new code introduces user-observable changes. See changelog entry format.
The PR relates to only one subject with a clear title and description in grammatically correct, complete sentences.
Run bundle exec rake default. It executes all tests and RuboCop for itself, and generates the documentation.

bbatsov · 2020-09-03T05:36:10Z

I'm on the fence about this, as I'm always wary to willingly impact the performance. I'm thinking that one middle ground would be to be able to specify the caching strategy with a command-line flag or a configuration option. How does this sound? //cc @p0deje

marcandre · 2020-09-03T12:49:34Z

I didn't realize we made a digest of the executable. I'm not familiar with this code, so I might be off track here, but when we are run with bundle exec, couldn't we do a checksum on:

contents of Gemfile.lock
crc32 on $LOADED_PATH filterered by any local dependencies in there, if any (i.e. PATH entries in the Gemfile.lock)
plus the changes to $LOADED_PATH when RuboCop does the require when loading the configs?

This way, a user using only official gem releases would have near instant checksum calculation (only Gemfile.lock + the config file itself), while users twiddling with their own cops will still not be impacted much.

I mean, it's a bit of a shame that users running RuboCop vX.Y.Z have to recalculate the checksum for the whole source, when "X.Y.Z" is a sufficient checksum.

dvandersluis · 2020-09-03T14:54:29Z

@bbatsov I totally agree about purposely making things less performant, but I think the tradeoff currently is real in that we're talking about milliseconds to calculate the checksum vs minutes to run without a cache. Obviously that's specific to my personal case, but I think it's still a valid point.

A command line flag sounds good to me, that's something that can be set up easily in CI without having to affect local development.

Also good with @marcandre's suggestions! Anything that simplifies the check here works well with me. I'm not 100% sure if I understand how to implement what you're suggesting (particularly how to determine what to check on $LOADED_PATH) so if you can give me some pointers I'm happy to try!

flanger001 · 2020-09-03T15:08:17Z

I want to second @dvandersluis's point that the cache build performance hit matters less than the very real performance hit of running Rubocop in CI without a cache.

marcandre · 2020-09-03T15:33:05Z

Here's what I'd do as an easy way to shave a bit of time:

# lib/rubocop.rb
require 'English'
before_us = $LOADED_FEATURES.dup
#...  rest of the gazillion requires
if we_are_a_gem
  RuboCop::ResultCache.rubocop_required_features = $LOADED_FEATURES - before_us
end

# lib/rubocop/result_cache.rb
         # add a `rubocop_required_features` singleton class read write attribute, `@api private`, initialized with `[]`
         # ...
         source_files = $LOADED_FEATURES  + Find.find(exe_root) - ResultCache.rubocop_required_features
         digest << RuboCop::Version::STRING << RuboCop::AST::Version::STRING

That should save going through more than 900 files...

Not sure what's the best we_are_a_gem test, maybe simply the presence of Gemfile?

Let me know if you'd prefer me to do it.

marcandre · 2020-09-03T15:36:44Z

BTW, on my "memory-choked-and-not-getting-any-younger" 2013 macbook air, the crc32 takes ~0.06 seconds (including RuboCop's own source)

dvandersluis · 2020-09-03T15:41:37Z

@marcandre yeah if you have an idea of how to fix this, go for it! I was playing around with Bundler::LockfileParser but it looks like you don't even need to go that way.

marcandre · 2020-09-03T15:45:05Z

@marcandre yeah if you have an idea of how to fix this, go for it! I was playing around with Bundler::LockfileParser but it looks like you don't even need to go that way.

Yeah, my initial idea was overly complex. Let me whip up a PR then.

marcandre · 2020-09-03T16:05:04Z

#8643 should be merged soon.

I think this present PR should be merged (although a Changelog entry would be nice).

dvandersluis · 2020-09-03T16:15:35Z

@marcandre great! Updated with a changelog entry.

…p#8633]

…for each file rather than using mtime. File.mtime is faster, but inconsistent in CI, because in each CI build the mtime changes, which means that the rubocop cache implicitly cannot be used as-is. It's obviously slower to calculate a hash for each file, but this is still more performant (both in memory consumption and iterations per second) than the original.

marcandre · 2020-09-03T17:14:26Z

@bbatsov still on the fence?

bbatsov · 2020-09-04T06:30:08Z

Guess not. 😄 Might be a good idea to add some comment around the code about the CI impact, so someone wouldn't optimize it again in the future. :-)

marcandre · 2020-09-04T12:59:27Z

Guess not. 😄 Might be a good idea to add some comment around the code about the CI impact, so someone wouldn't optimize it again in the future. :-)

Sure, I added a small comment 👍

bbatsov · 2020-09-04T14:31:33Z

Thanks!

dvandersluis · 2020-09-04T14:47:02Z

Thanks guys! 🙌

dvandersluis changed the title ~~Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime.~~ Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime Sep 2, 2020

dvandersluis force-pushed the fix-ci-caching branch from 83f6136 to 55be764 Compare September 3, 2020 16:15

marcandre added a commit to marcandre/rubocop that referenced this pull request Sep 3, 2020

Optimize cache invalidation when running RuboCop as a gem [See ruboco…

67462d6

…p#8633]

mergify bot pushed a commit that referenced this pull request Sep 3, 2020

Optimize cache invalidation when running RuboCop as a gem [See #8633]

2b71ae3

dvandersluis force-pushed the fix-ci-caching branch from 55be764 to a4b39be Compare September 3, 2020 16:41

marcandre approved these changes Sep 3, 2020

View reviewed changes

bbatsov merged commit 0904a72 into rubocop:master Sep 4, 2020

marcandre added a commit that referenced this pull request Sep 4, 2020

Comment to not rely on mtime [#8633]

8cc6391

dvandersluis deleted the fix-ci-caching branch January 18, 2021 20:34

ChrisBr mentioned this pull request Jul 28, 2022

Don't hash shared libraries for cache key #10841

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime #8633

Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime #8633

dvandersluis commented Sep 2, 2020 •

edited

bbatsov commented Sep 3, 2020

marcandre commented Sep 3, 2020

dvandersluis commented Sep 3, 2020

flanger001 commented Sep 3, 2020

marcandre commented Sep 3, 2020 •

edited

marcandre commented Sep 3, 2020

dvandersluis commented Sep 3, 2020

marcandre commented Sep 3, 2020

marcandre commented Sep 3, 2020

dvandersluis commented Sep 3, 2020

marcandre commented Sep 3, 2020

bbatsov commented Sep 4, 2020

marcandre commented Sep 4, 2020

bbatsov commented Sep 4, 2020

dvandersluis commented Sep 4, 2020

Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime #8633

Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime #8633

Conversation

dvandersluis commented Sep 2, 2020 • edited

bbatsov commented Sep 3, 2020

marcandre commented Sep 3, 2020

dvandersluis commented Sep 3, 2020

flanger001 commented Sep 3, 2020

marcandre commented Sep 3, 2020 • edited

marcandre commented Sep 3, 2020

dvandersluis commented Sep 3, 2020

marcandre commented Sep 3, 2020

marcandre commented Sep 3, 2020

dvandersluis commented Sep 3, 2020

marcandre commented Sep 3, 2020

bbatsov commented Sep 4, 2020

marcandre commented Sep 4, 2020

bbatsov commented Sep 4, 2020

dvandersluis commented Sep 4, 2020

dvandersluis commented Sep 2, 2020 •

edited

marcandre commented Sep 3, 2020 •

edited