Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip-gitignore: use allow list, not deny list #1900

Merged
merged 3 commits into from May 11, 2022

Conversation

bmalehorn
Copy link
Contributor

Improve the performance of --skip-gitignore by enumerating all tracked files instead of all ignored files.

Before:

  • call os.walk for every target directory
  • for each git repo it encounters:
    • call os.walk to walk over all tracked and ignored files
    • call git check-ignore on all those files to narrow it to only ignored files
  • for each file it encounters, skip it if it's in the list of ignored files for that repo

After:

  • call os.walk for every target directory
  • for each git repo it encounters:
    • call git ls-files to get a list of all tracked files
  • for each file it encounters, skip it if it's not in the list of tracked files for that repo

Closes #1895.

@richardebeling
Copy link

Struggeling with the same problem you described in #1895 (~90s run time an hour ago for ~10k lines of code across 107 files), thus this post: This reduces runtime a bit, but doesn't fully solve it for me.

On my codebase, with skip_gitignore enabled in a config file, after cleaning out some old venv directories, 5.10.1 takes ~16s for isort --show-files .
Your branch here takes ~7.2s
However, doing isort --show-files . > isort_files.txt and then cat isort_files.txt | time xargs isort only takes 1.1s, which is what I had hoped to be isort's overall performance.
For comparison, black runs in 3.3 seconds, and has arguably more stuff to do.

Using strace isort --show-files . 2>&1 | grep stat, I still get a huge amount of file status queries, mostly on files that are in folders ignored in .gitignore. I noticed, however, that I no longer get queries on node_modules, which is in the defaults for skip.

It looks to me as if the following execution flow doesn't ignore a .gitignored directory:

            # git_ls_files are good files you should parse. If you're not in the allow list, skip.
            if (
                git_folder
                and not file_path.is_dir()
                and str(file_path.resolve()) not in self.git_ls_files[git_folder]
            ):
                return True
        return False

so file_path.is_dir() is true, so this returns false, and os.walk will still descend into the directory.

@bmalehorn
Copy link
Contributor Author

bmalehorn commented Mar 22, 2022

Yup @he3lixxx you're right that this PR does not completely fix the problem.

I wrote this fix as a performance improvement that specifically targeted my use case:
a. tracked files in whole repo: ~30k
b. untracked files in whole repo: ~2m
c. tracked files in directory I'm running isort on: ~100
d. untracked files in directory I'm running isort on: ~1k

Before my PR, the runtime was O(a + b). With this PR it's O(a + c + d). That's a lot faster for me! But in an ideal world, it would only be O(c). And if you're running it on the whole repo like you are it's not much faster at all.


It looks to me as if the following execution flow doesn't ignore a .gitignored directory:

You're right. The reason I wrote it that way is that git does not have an easy way to get a list of ignored / tracked directories, only tracked files.

A more correct way would be:

  • save a sorted list of all tracked files
  • if you're checking a directory, binary search the list of tracked files
  • if the file right after this directory has the same prefix, descend:
   foo/c/bar.txt
-> foo/d/ would go here
   foo/d/buzz.txt
  • Otherwise, skip it:
   foo/c/bar.txt
-> foo/d/ would go here
   foo/e/buzz.txt

That would probably work & have good performance. It would skip ignored directories which is the main issue.

The most correct solution would be to remove os.walk entirely in favor of running git ls-files, but that might conflict with other custom rules around isort skipping.


@he3lixxx does that make sense? I could update this PR to include that. But I kind of want to get the maintainers to take a look at it & be ok with the general idea before trying to optimize (& obfuscate) it further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

skip-gitignore walks all files, including ignored ones
3 participants