Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip-gitignore walks all files, including ignored ones #1895

Closed
bmalehorn opened this issue Feb 11, 2022 · 2 comments · Fixed by #1900
Closed

skip-gitignore walks all files, including ignored ones #1895

bmalehorn opened this issue Feb 11, 2022 · 2 comments · Fixed by #1900

Comments

@bmalehorn
Copy link
Contributor

bmalehorn commented Feb 11, 2022

Hi, I'm looking into a performance issue with isort.

It seems that when you pass in --skip-ignore and a directory, it will:

  1. enumerate every file, including ignored files, in the repo
  2. query git to see which of those files should be ignored
  3. iterate through the folder
  4. ignore files from step 2

isort/isort/settings.py

Lines 555 to 586 in c6a4196

def _check_folder_gitignore(self, folder: str) -> Optional[Path]:
env = {**os.environ, "LANG": "C.UTF-8"}
try:
topfolder_result = subprocess.check_output( # nosec # skipcq: PYL-W1510
["git", "-C", folder, "rev-parse", "--show-toplevel"], encoding="utf-8", env=env
)
except subprocess.CalledProcessError:
return None
git_folder = Path(topfolder_result.rstrip()).resolve()
files: List[str] = []
# don't check symlinks; either part of the repo and would be checked
# twice, or is external to the repo and git won't know anything about it
for root, _dirs, git_files in os.walk(git_folder, followlinks=False):
if ".git" in _dirs:
_dirs.remove(".git")
for git_file in git_files:
files.append(os.path.join(root, git_file))
git_options = ["-C", str(git_folder), "-c", "core.quotePath="]
try:
ignored = subprocess.check_output( # nosec # skipcq: PYL-W1510
["git", *git_options, "check-ignore", "-z", "--stdin", "--no-index"],
encoding="utf-8",
env=env,
input="\0".join(files),
)
except subprocess.CalledProcessError:
return None
self.git_ignore[git_folder] = {Path(f) for f in ignored.rstrip("\0").split("\0")}
return git_folder

In my use case, my subdirectory has 8 files. But before checking those, isort will first enumerate all 8 million untracked files in the entire repo, which takes several minutes!

There are a couple potential fixes:

  1. stop calling os.walk, and instead run git ls-files. This gets all tracked files (which there's a lot less of) instead of getting all untracked files and filtering
  2. only call os.walk in the subdirectory you're running in, instead of the entire git repo

I'd be happy to prep a PR fixing this issue, but I want to see if that makes sense or if there's other considerations I'm not taking into account.

Reproduction

  1. modify isort to print each file it's walking in _check_folder_gitignore:
            for git_file in git_files:
                print("walking: ", os.path.join(root, git_file))
                files.append(os.path.join(root, git_file))
  1. run this script:
#!/bin/bash

set -xeuo pipefail

cd ~/tmp/
rm -rf isort-repro/
git init isort-repro
cd isort-repro
echo 'build/' > .gitignore
mkdir build/ src/
touch build/1.py build/2.py build/3.py # these should be ignored
touch src/main.py                      # this should be included
isort src/ --skip-gitignore

Actual output:

walking:  /Users/brian.malehorn/tmp/isort-repro/.gitignore
walking:  /Users/brian.malehorn/tmp/isort-repro/build/2.py
walking:  /Users/brian.malehorn/tmp/isort-repro/build/3.py
walking:  /Users/brian.malehorn/tmp/isort-repro/build/1.py
walking:  /Users/brian.malehorn/tmp/isort-repro/src/main.py

Expected output:

walking:  /Users/brian.malehorn/tmp/isort-repro/.gitignore
walking:  /Users/brian.malehorn/tmp/isort-repro/src/main.py

or maybe even this, since I only passed in src:

walking:  /Users/brian.malehorn/tmp/isort-repro/src/main.py

Related

These PRs are related to inclusion rules, but are specifically about symlinks: #1772, #1788

@anirudnits
Copy link
Collaborator

Hello @bmalehorn thanks for such a detailed ticket!

To be honest, I don't have much context on why this was done like this but if I have to argue for the present implementation it would be that get_files is a generic function which is being used with all configuration settings with _check_git_ignore being the extra check performed specifically for the skip-ignore flag. This seems to me a better design paradigm with base functions being used in the general case augmented with particular functionality when needed. The instance you shared with 8 million files will fall in the minority and then again I'll guess that symlinks would constitute a major chunk of these files which could be avoided using the dont-follow-links flag. Not to mention the extra dependency of the user having git installed in the system for git ls-files to work without which isort would anyways have to revert back to os.walk

Please raise if there's anything I might have gotten wrong and I would love to discuss this further and hear any other perspectives on this.

~ Aniruddha

@bmalehorn
Copy link
Contributor Author

Hey @anirudnits, let me elaborate a bit more now that I have a better understanding of the code.

First, ignoring symlinks is a useful heuristic for some repos, but common package managers like poetry and npm create a lot of files with no symlinks.

About os.walk vs git, my solution would still only require git if you ran --skip-gitignore. Some of the confusion stems from having two places where os.walk is called.

Here's how it works currently:

  • call os.walk for every target directory
  • for each git repo it encounters:
    • call os.walk to walk over all tracked and ignored files
    • call git check-ignore on all those files to narrow it to only ignored files
  • for each file it encounters, skip it if it's in the list of ignored files for that repo

The plan for how to implement it would be:

  • call os.walk for every target directory
  • for each git repo it encounters:
    • call git ls-files to get a list of all tracked files
  • for each file it encounters, skip it if it's not in the list of tracked files for that repo

I made a pull request here to show you what it would look like: #1900

Does that all make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants