Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow find_files does not use Include configuration #8646

Closed
tleish opened this issue Sep 4, 2020 · 8 comments
Closed

Slow find_files does not use Include configuration #8646

tleish opened this issue Sep 4, 2020 · 8 comments

Comments

@tleish
Copy link
Contributor

tleish commented Sep 4, 2020

A project with inline Rails engines can become slow with rubocops initial scan. For example, a project that looks like:

.
├── app
├── bin
├── config
├── db
├── engines
│   └── my_engine
│       ├── app
│       ├── bin
│       ├── config
│       ├── db
│       ├── lib
│       ├── log
│       ├── node_modules
│       ├── spec
│       ├── tmp
│       └── vendor
├── lib
├── log
├── node_modules
├── public
├── spec
├── tmp
└── vendor

With the following configuration file:

AllCops:
  Include:
    - 'app/**/*.rb'
    - 'config/**/*.rb'
    - 'lib/**/*.rb'
    - 'lib/**/*.rake'
    - 'spec/**/*.rb'
    - 'vendor/*/**/*.rb'
    - 'Rakefile*'
    - 'engines/*/app/**/*.rb'
    - 'engines/*/config/**/*.rb'
    - 'engines/*/lib/**/*.rb'
    - 'engines/*/lib/**/*.rake'
    - 'engines/*/spec/**/*.rb'
    - 'engines/*/vendor/*/**/*.rb'
    - 'engines/*/Rakefile*'
  Exclude:
    - 'bin/*'
    - 'app/assets/**/*'
    - 'app/javascript/**/*'
    - 'app/views/**/*'
    - 'tmp/**/*'
    - 'public/**/*'
    - 'node_modules/**/*'
    - 'vendor/**/*'
    - 'db/*'
    - 'engines/*/bin/*'
    - 'engines/*/app/assets/**/*'
    - 'engines/*/app/javascript/**/*'
    - 'engines/*/app/views/**/*'
    - 'engines/*/tmp/**/*'
    - 'engines/*/node_modules/**/*'
    - 'engines/*/db/**/*'

Expected behavior

I would expect that the file scanning would scan the "Include" directories, and "Exclude" the directories after before analyzing.

Actual behavior

What actually happens is rubocop scans top level root directories, then filters for Exclude patterns

  1. gets a list of all directories in the root folder at one level
  2. subtracts any one-level directories listed in the Exclude config
  3. scans all files
  4. filters based on Ruby patterns

The problem is, node_modules folder will often have thousands of files. The root node_modules can be skipped from the initial scan, but the engines/my_engine/node_modules cannot be skipped in the initial filescan (based on the configuration).

Because of this, in the project I'm testing the scanning takes 7 seconds just to get the list of files. If I scan by only

Steps to reproduce the problem

See description

RuboCop version

0.77.0 (using Parser 2.6.5.0, running on ruby 2.6.5 x86_64-darwin18)

@marcandre
Copy link
Contributor

Sorry about this.

There are multiple issues open about include and exclude... (e.g. #8004 ) 😢

@tleish
Copy link
Contributor Author

tleish commented Sep 4, 2020

Ah, thanks for the heads up. I briefly looked before but didn't dig deep enough. I found several now. I'll close this one.

@tleish tleish closed this as completed Sep 4, 2020
@tleish
Copy link
Contributor Author

tleish commented Sep 25, 2020

@marcandre - how would you feel about adding an optional parameter to fine-tune the initial search and loading of files for rubocop? Something like:

AllCops:
  GlobSearchPatterns:
    - '*'
    - 'engines/*/*'
    - '**/*.{rb,rake}'

Then, update TargetFinder#find_files with the following changes:

def find_files(base_dir, flags)
  pattern = files_config_pattern(base_dir) || files_default_pattern(base_dir)
  Dir.glob(pattern, flags | File::FNM_EXTGLOB).select { |path| FileTest.file?(path) }
end

def files_config_pattern(base_dir)
  all_cops_config = @config_store.for(base_dir).for_all_cops
  glob_search_patterns = all_cops_config['GlobSearchPatterns']
  if glob_search_patterns
    glob_search_patterns.map { |pattern| File.join(base_dir, pattern) }
  end
end

def files_default_pattern(base_dir)
  wanted_toplevel_dirs = toplevel_dirs(base_dir, flags) -
                         excluded_dirs(base_dir)
  wanted_toplevel_dirs.map! { |dir| dir << '/**/*' }
  
  if wanted_toplevel_dirs.empty?
    # We need this special case to avoid creating the pattern
    # /**/* which searches the entire file system.
    ["#{base_dir}/**/*"]
  else
    # Search the non-excluded top directories, but also add files
    # on the top level, which would otherwise not be found.
    wanted_toplevel_dirs.unshift("#{base_dir}/*")
  end
end

Testing this in a large project with multiple engines and files in node_modules directories.

#find_files BEFORE:

Found Files: 117,812
Target Files: 413
Benchmark: 2.65 sec

#find_files AFTER:

Found Files: 592
Target Files: 413
Benchmark: 0.33 sec

Using Glob in the name indicates it only takes glob parameters (and not regex).

Thoughts?

@bbatsov
Copy link
Collaborator

bbatsov commented Sep 25, 2020

I'm wary of adding more parameters without very careful consideration, as already the configuration is way more complicated than what I had in mind initially. :-) I can see the merit of what you're proposing, but I can only imagine it would add further confusion for some people. Might be best to create some meta-issue collecting the common problems with include/exclude, so we can discuss solutions in a more focused manner.

I would expect that the file scanning would scan the "Include" directories, and "Exclude" the directories after before analyzing.

We can also consider changing the behaviour in this manner, as I don't think it's going to break a lot of things. First we only had a notion of exclusion, but then added the inclusion to allow people to specify more precise filesets on which to operate. I'm assuming that part of the inefficiency in the current implementation might be related to this. @jonas054 might remember more.

@tleish
Copy link
Contributor Author

tleish commented Sep 25, 2020

Ruby Extension Strategy

So, why does rubocop search for all files initially, and not just files with known ruby extensions? (i.e. *.rb, *.rake, Rakefile, etc)

Gitignore Strategy

So, digging into this a little further, I believe Dir.glob is not what is taking so long, it's evaluating the results of Dir.glob that takes the majority of the time.

What are you thoughts on the following approach?

  1. Get a recursive list of all directories in the root_dir
  2. Remove any directories ignored by [root_dir]/.gitignore (if exists)
  3. Glob all files from filtered directories

Some initial experimentation of this shows this as promising. Often projects exclude node_modules and tmp directories, which can have thousands of files. Using the .gitignore (if in the root) can drastically decrease the number of files evaluated. In the project I tested, it took the files from 189,247 down to 2,344. Doing this approach added 1 ruby gem for evaluating the directory agains .gitignore and was overall faster. This could be done without adding additional options to .rubocop.yml.

@jonas054
Copy link
Collaborator

Remove any directories ignored by [root_dir]/.gitignore (if exists)

We have always stayed away from assuming any specific version handling system, which is why #7920 was added to fix #595. Is it possible to solve the current problem using only Exclude properties, possibly with ERB pre-processing, rather than parsing .gitignore?

@tleish
Copy link
Contributor Author

tleish commented Sep 27, 2020

@jonas054 - LOL! I was so deep into the weeds I didn't think to use Exclude to apply to directories?

This could work. In the process of mocking it up I discovered it should use glob/string patterns only and not regex. With regex, could could mistakenly leave out an entire directory.

For example, a pattern to say exclude all files in the config folder except config/routes.rb

  Exclude:
    - !ruby/regexp /\/config\/(?:(?!routes.rb).)*$/

When applied to a /path/to/config/ directory, the above pattern would exclude the entire directory.

Also, using just the glob exclude patterns (and not the regex patterns) is good enough. The benchmarks I have with this change are:

Before:

$ time bundle exec rubocop -L 
...
________________________________________________________
Executed in    6.41 secs   fish           external 
   usr time    5.16 secs   88.00 micros    5.16 secs 
   sys time    1.25 secs  476.00 micros    1.25 secs 

After:

$ time bundle exec rubocop -L 
...
________________________________________________________
Executed in    2.31 secs   fish           external 
   usr time    1.36 secs  131.00 micros    1.36 secs 
   sys time    0.91 secs  668.00 micros    0.91 secs 

Executed almost 1/3 of the time, same list of files.

@tleish
Copy link
Contributor Author

tleish commented Sep 28, 2020

Created an MR.

Github displays a diff a little oddly, a clearer diff would look like this.

def find_files(base_dir, flags)
-  wanted_toplevel_dirs = toplevel_dirs(base_dir, flags) -
-                            excluded_dirs(base_dir)
-
-  wanted_toplevel_dirs.map! { |dir| dir << "/**/*" }
-  pattern = if wanted_toplevel_dirs.empty?
-              # We need this special case to avoid creating the pattern
-              # /**/* which searches the entire file system.
-              ["#{base_dir}/**/*"]
-            else
-              # Search the non-excluded top directories, but also add files
-              # on the top level, which would otherwise not be found.
-              wanted_toplevel_dirs.unshift("#{base_dir}/*")
-            end
+  patterns = wanted_dir_patterns(base_dir, flags)
+  # We need this special case to avoid creating the pattern
+  # /**/* which searches the entire file system.
+  patterns = ["#{base_dir}/**/*"] if patterns.empty?

-  Dir.glob(pattern, flags | File::FNM_EXTGLOB).select { |path| FileTest.file?(path) }
+  Dir.glob(patterns, flags | File::FNM_EXTGLOB).select { |path| FileTest.file?(path) }
end

- def toplevel_dirs(base_dir, flags)
-   Dir.glob(File.join(base_dir, '*'), flags).select do |dir|
-     File.directory?(dir) && !dir.end_with?('/.', '/..')
-   end
- end
-
- def excluded_dirs(base_dir)
-   all_cops_config = @config_store.for(base_dir).for_all_cops
-   dir_tree_excludes = all_cops_config['Exclude'].select do |pattern|
-     pattern.is_a?(String) && pattern.end_with?("/**/*")
-   end
-   dir_tree_excludes.map { |pattern| pattern.sub(%r{/\*\*/\*$}, '') }
- end

+ def wanted_dir_patterns(base_dir, flags)
+   exclude_pattern = combined_exclude_glob_patterns(base_dir)
+   flags = flags | File::FNM_PATHNAME | File::FNM_EXTGLOB | File::FNM_DOTMATCH
+   Dir.glob("#{base_dir}/**/", flags)
+   .map { |dir| dir << '*' } # add file glob pattern to end of each dir
+   .reject { |dir| File.fnmatch?(exclude_pattern, dir, flags) }
+ end
+
+ def combined_exclude_glob_patterns(base_dir)
+   all_cops_config = @config_store.for(base_dir).for_all_cops
+   patterns = all_cops_config['Exclude'].select { |pattern| pattern.is_a? String }
+              .map { |pattern| pattern.sub("#{base_dir}/", '') }
+   "#{base_dir}/{#{patterns.join(',')}}"
+ end

@tleish tleish reopened this Sep 28, 2020
tleish added a commit to tleish/rubocop that referenced this issue Sep 29, 2020
…ies first and then apply Rubocop Exclude on directories before finding files
tleish added a commit to tleish/rubocop that referenced this issue Sep 29, 2020
…ies first and then apply Rubocop Exclude on directories before finding files
bbatsov pushed a commit that referenced this issue Sep 30, 2020
After sleeping on it, I realized there was a simpler (less code) and slightly faster version of the previous #8806.

By _recursively_ getting all directories and then at each directory level testing Rubocop Exclude on directories.  Once directories are found at all levels, we search for files.

In addition, I replaced many of the `base_dir` string concatenations with the safer `File#join`.

Benchmarks on a large project removes an additional 500+ milliseconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants