Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Find of hidden files #8650

Closed
tleish opened this issue Sep 4, 2020 · 5 comments
Closed

Faster Find of hidden files #8650

tleish opened this issue Sep 4, 2020 · 5 comments

Comments

@tleish
Copy link
Contributor

tleish commented Sep 4, 2020

In profiling the delay between rubocop startup and the beginning of scanning files I discovered that rubocop will traverse the project directories/files twice. Once for all_files and then again to find hidden_files.

# lib/rubocop/target_finder.rb
def target_files_in_dir(base_dir = Dir.pwd)
  # ...
  all_files = find_files(base_dir, File::FNM_DOTMATCH)
  hidden_files = Set.new(all_files - find_files(base_dir, 0))
  #...
end

#...
def find_files(base_dir, flags)
  # ...
  Dir.glob(pattern, flags).select { |path| FileTest.file?(path) }
end

In the code above, #find_files is called twice which calls Dir.glob. Each time with a different flag. This is ineficient as it's touching the local disk twice for ever non-hidden file and it's checking every entry0for FileTest.file?(path) twice .

Describe the solution you'd like

A more efficient solution would be to only call #find_files once and then filter for hidden files.

- hidden_files = Set.new(all_files - find_files(base_dir, 0))
+ hidden_files = all_files.select {|file|file =~ /[\/|\\]\./ }

The regex looks for any file paths which include a '/.' (Linux) or ''\.'' (Windows)

Benchmarks

I tested this code change in a medium size Rails project.

Project Stats Stats:

all_files: 123,727
hidden_files: 2,553
target_files: 405

Note: The majority of the files are in engines/my-engine/node_modules directory (which is a separate issue).

require "benchmark/ips"

Benchmark.ips(7) do |x|
  x.report("glob:") { hidden_files = Set.new(all_files - find_files(base_dir, 0)) }
  x.report("select:") { hidden_files = all_files.select {|file|file =~ /\/\./ } }
  x.compare!
end

Results:

Comparison:
              select:       23.6 i/s
                glob:        1.0 i/s - 23.99x  (± 0.00) slower

The above IPS benchmark shows that on this project the 'select' code can ran 23 times within 1 second, whereas the glob could only run once within a second. In other words, glob is 23.99x slower.

Additional context

If this code change is desired, I can create an MR.

@fatkodima
Copy link
Contributor

Please, open a PR.

hidden_files = all_files.select {|file|file =~ /[\/|\\]\./ }

can be also improved to

hidden_files = all_files.select {|file| file.include?("#{File::SEPARATOR}.") }

@tleish
Copy link
Contributor Author

tleish commented Sep 24, 2020

Adding to the suggestion from @fatkodima, I realized we are creating a new string and/or regex with each loop. Testing the following code change:

+ HIDDEN_FILE_PATTERN = "#{File::SEPARATOR}."

- hidden_files = Set.new(all_files - find_files(base_dir, 0))
+ hidden_files = all_files.select { |file| file.include?(HIDDEN_FILE_PATTERN) }

Benchmarks now show a 57x improvement in my project:

Comparison:
             select:       50.6 i/s
               glob:        0.9 i/s - 57.38x  (± 0.00) slower

I'll create the PR.

@tleish
Copy link
Contributor Author

tleish commented Sep 24, 2020

FYI, I found a slightly better refactoring for hidden files which (for my project) saves an additional 30% inside #target_files_in_dir (from 3.9 sec down to 2.7 sec).

lib/rubocop/path_util.rb

+ HIDDEN_FILE_PATTERN = "#{File::SEPARATOR}."

+ def self.hidden_path?(path)
+   path.include?(HIDDEN_PATH_PATTERN)
+ end

module_function

def hidden_file_in_not_hidden_dir?(pattern, path)
+    PathUtil.hidden_path?(path) &&
      File.fnmatch?(
        pattern, path,
        File::FNM_PATHNAME | File::FNM_EXTGLOB | File::FNM_DOTMATCH
      ) &&
      !hidden_dir?(path)
end

def hidden_dir?(path)
-  File.dirname(path).split(File::SEPARATOR).any? do |dir|
-    dir.start_with?('.')
-  end
+  PathUtil.hidden_path?(File.dirname(path))
end

lib/rubocop/target_finder.rb

- hidden_files = Set.new(all_files - find_files(base_dir, 0))
+ hidden_files = all_files.select { |file| PathUtil.hidden_path?(file) }

So, the overall gain on finding hidden files and directories is faster. Do you want me to update the PR to use this approach?

@bbatsov
Copy link
Collaborator

bbatsov commented Sep 24, 2020

Fine by me. I don't mind more speed improvements.

@tleish
Copy link
Contributor Author

tleish commented Sep 24, 2020

Nevermind with the larger change. After running tests, the changes were breaking... so leaving the existing changes as is.

tleish added a commit to tleish/rubocop that referenced this issue Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants