Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime #8633

Merged
merged 1 commit into from Sep 4, 2020

Conversation

dvandersluis
Copy link
Member

@dvandersluis dvandersluis commented Sep 2, 2020

Fixes #8629.

File.mtime is faster, but inconsistent in CI, because in each CI build the mtime changes, which means that the rubocop cache implicitly cannot be used as-is. It's obviously slower to calculate a hash for each file, but this is still more performant (both in memory consumption and iterations per second) than the original.

I wrote a benchmark to test memory and IPS: https://gist.github.com/dvandersluis/0ddbc936fc858fbbec9cf8fb796ec9a2

Calculating -------------------------------------
            original     1.770M memsize (   622.000  retained)
                       587.000  objects (     3.000  retained)
                        50.000  strings (     2.000  retained)
               mtime    27.343k memsize (     0.000  retained)
                       482.000  objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
           file hash     1.586M memsize (     0.000  retained)
                       668.000  objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
               crc32   940.981k memsize (     0.000  retained)
                       480.000  objects (     0.000  retained)
                        50.000  strings (     0.000  retained)

Comparison:
               mtime:      27343 allocated
               crc32:     940981 allocated - 34.41x more
           file hash:    1585783 allocated - 58.00x more
            original:    1769735 allocated - 64.72x more
Warming up --------------------------------------
            original    19.000  i/100ms
               mtime    66.000  i/100ms
           file hash    21.000  i/100ms
               crc32    37.000  i/100ms
Calculating -------------------------------------
            original    246.123  (± 5.3%) i/s -      1.235k in   5.033696s
               mtime      1.143k (± 2.7%) i/s -      5.742k in   5.026275s
           file hash    276.138  (± 2.9%) i/s -      1.386k in   5.023864s
               crc32    406.651  (± 3.2%) i/s -      2.035k in   5.009418s

Comparison:
               mtime:     1143.3 i/s
               crc32:      406.7 i/s - 2.81x  slower
           file hash:      276.1 i/s - 4.14x  slower
            original:      246.1 i/s - 4.65x  slower

Obviously just accessing a file stat is going to be best, but that doesn't help when the results are inconsistent. I also tested some other options, such as other non-cryptographic hashes that are supposed to be fast (like mumble hash) and got worse results than using crc32 (plus those involve adding a gem which I think is better to avoid).

If this is too slow still, a thought I had would be to switch how to calculate the checksum based on ENV['CI']. Please let me know if you'd like me to make that change, or if there's another strategy to try.


Before submitting the PR make sure the following are checked:

  • Wrote good commit messages.
  • Commit message starts with [Fix #issue-number] (if the related issue exists).
  • Feature branch is up-to-date with master (if not - rebase it).
  • Squashed related commits together.
  • Added tests.
  • Added an entry to the Changelog if the new code introduces user-observable changes. See changelog entry format.
  • The PR relates to only one subject with a clear title and description in grammatically correct, complete sentences.
  • Run bundle exec rake default. It executes all tests and RuboCop for itself, and generates the documentation.

@dvandersluis dvandersluis changed the title Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime. Calculate rubocop_checksum by caculating a crc32 for each file rather than using mtime Sep 2, 2020
@bbatsov
Copy link
Collaborator

bbatsov commented Sep 3, 2020

I'm on the fence about this, as I'm always wary to willingly impact the performance. I'm thinking that one middle ground would be to be able to specify the caching strategy with a command-line flag or a configuration option. How does this sound? //cc @p0deje

@marcandre
Copy link
Contributor

I didn't realize we made a digest of the executable. I'm not familiar with this code, so I might be off track here, but when we are run with bundle exec, couldn't we do a checksum on:

  • contents of Gemfile.lock
  • crc32 on $LOADED_PATH filterered by any local dependencies in there, if any (i.e. PATH entries in the Gemfile.lock)
  • plus the changes to $LOADED_PATH when RuboCop does the require when loading the configs?

This way, a user using only official gem releases would have near instant checksum calculation (only Gemfile.lock + the config file itself), while users twiddling with their own cops will still not be impacted much.

I mean, it's a bit of a shame that users running RuboCop vX.Y.Z have to recalculate the checksum for the whole source, when "X.Y.Z" is a sufficient checksum.

@dvandersluis
Copy link
Member Author

@bbatsov I totally agree about purposely making things less performant, but I think the tradeoff currently is real in that we're talking about milliseconds to calculate the checksum vs minutes to run without a cache. Obviously that's specific to my personal case, but I think it's still a valid point.

A command line flag sounds good to me, that's something that can be set up easily in CI without having to affect local development.

Also good with @marcandre's suggestions! Anything that simplifies the check here works well with me. I'm not 100% sure if I understand how to implement what you're suggesting (particularly how to determine what to check on $LOADED_PATH) so if you can give me some pointers I'm happy to try!

@flanger001
Copy link
Contributor

I want to second @dvandersluis's point that the cache build performance hit matters less than the very real performance hit of running Rubocop in CI without a cache.

@marcandre
Copy link
Contributor

marcandre commented Sep 3, 2020

Here's what I'd do as an easy way to shave a bit of time:

# lib/rubocop.rb
require 'English'
before_us = $LOADED_FEATURES.dup
#...  rest of the gazillion requires
if we_are_a_gem
  RuboCop::ResultCache.rubocop_required_features = $LOADED_FEATURES - before_us
end

# lib/rubocop/result_cache.rb
         # add a `rubocop_required_features` singleton class read write attribute, `@api private`, initialized with `[]`
         # ...
         source_files = $LOADED_FEATURES  + Find.find(exe_root) - ResultCache.rubocop_required_features
         digest << RuboCop::Version::STRING << RuboCop::AST::Version::STRING

That should save going through more than 900 files...

Not sure what's the best we_are_a_gem test, maybe simply the presence of Gemfile?

Let me know if you'd prefer me to do it.

@marcandre
Copy link
Contributor

BTW, on my "memory-choked-and-not-getting-any-younger" 2013 macbook air, the crc32 takes ~0.06 seconds (including RuboCop's own source)

@dvandersluis
Copy link
Member Author

@marcandre yeah if you have an idea of how to fix this, go for it! I was playing around with Bundler::LockfileParser but it looks like you don't even need to go that way.

@marcandre
Copy link
Contributor

@marcandre yeah if you have an idea of how to fix this, go for it! I was playing around with Bundler::LockfileParser but it looks like you don't even need to go that way.

Yeah, my initial idea was overly complex. Let me whip up a PR then.

@marcandre
Copy link
Contributor

#8643 should be merged soon.

I think this present PR should be merged (although a Changelog entry would be nice).

@dvandersluis
Copy link
Member Author

@marcandre great! Updated with a changelog entry.

…for each file rather than using mtime.

File.mtime is faster, but inconsistent in CI, because in each CI build the mtime changes, which means that the rubocop cache implicitly cannot be used as-is. It's obviously slower to calculate a hash for each file, but this is still more performant (both in memory consumption and iterations per second) than the original.
@marcandre
Copy link
Contributor

@bbatsov still on the fence?

@bbatsov bbatsov merged commit 0904a72 into rubocop:master Sep 4, 2020
@bbatsov
Copy link
Collaborator

bbatsov commented Sep 4, 2020

Guess not. 😄 Might be a good idea to add some comment around the code about the CI impact, so someone wouldn't optimize it again in the future. :-)

marcandre added a commit that referenced this pull request Sep 4, 2020
@marcandre
Copy link
Contributor

Guess not. 😄 Might be a good idea to add some comment around the code about the CI impact, so someone wouldn't optimize it again in the future. :-)

Sure, I added a small comment 👍

@bbatsov
Copy link
Collaborator

bbatsov commented Sep 4, 2020

Thanks!

@dvandersluis
Copy link
Member Author

Thanks guys! 🙌

@dvandersluis dvandersluis deleted the fix-ci-caching branch January 18, 2021 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rubocop cache broken in CI
4 participants