Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking directories via match_file() does not work on Path objects #65

Closed
yschroeder opened this issue Nov 15, 2022 · 5 comments
Closed

Comments

@yschroeder
Copy link

yschroeder commented Nov 15, 2022

I am currently investigating a problem in the black formatter which uses pathspec to figure out ignored files by parsing .gitignore files. black uses this library and tries to check if directories are ignored before checking their contents.

It boils down to this minimum example:

from pathlib import Path

import pathspec

spec = pathspec.PathSpec.from_lines('gitwildmatch', """
/important/
""".splitlines())

assert spec.match_file("important/bar.log")
assert spec.match_file("/important/")
assert spec.match_file("important/")
assert spec.match_file("important")  # does not match
assert spec.match_file(Path("important/"))  # does not match as pathlib removes trailing slash

It matches the file important/bar.log as expected, however, it does not match the folder when it has no trailing slash.

This becomes especially problematic when using Path objects, as the trailing slash is removed by pathlib.

I don't know where this needs to be fixed. What do you think? Is black using it wrong? Should the match_file() method check if the given Path object is a directory and add the trailing slash on its own before checking?

It would certainly be possible to implement a workaround in black and check with Path.is_dir(), convert to str and append the slash...

FYI, the place where the check happens in black is here. relative_path is a pathlib.Path and can be a directory.

@yschroeder
Copy link
Author

If you acknowledge that this needs to addressed by pathspec, I think the place to fix this would be util.normalize_file().

When it creates the str. It should check if it currently handles a Path and if yes, append the trailing slash if Path.is_dir().

@cpburnz
Copy link
Owner

cpburnz commented Nov 16, 2022

@yschroeder I'm undecided on this at the moment. I'm leaning towards supporting it but it's kind of inconsistent.

Supporting Path.is_dir() would be convenient but its behavior is not immediately apparent unless you're looking at the file-system. Path.is_dir() is only true when the directory exists on the file-system. It's false otherwise, even if constructed as Path('important/'). This makes sense when you think of it as an analog to os.path.isdir(). PurePath.is_dir() does not exist because it represents a virtual path that doesn't necessarily exist on the file-system. It would be nice if Path and PurePath maintained trailing slashes.

Does black always use Path with real file-system paths as opposed to PurePath or paths that don't exist?

@yschroeder
Copy link
Author

Yes, I am also not too sure, as Path.is_dir() requires that the path actually exists in the file system. pathspec only works on strings and doesn't care if the path exists or not, which is a good thing. I realized this problem when I tried to create a test case for pathspec which would require a mock for is_dir() as otherwise the test would need real files and folders.

Disclaimer: I am not a developer of black but only a user, so I might be completely wrong here.

What black does is:

  1. Find the root directory of the repository
  2. Create a PathSpec for the found .gitignore
  3. Iterate over all elements of the root directory via Path.iter_dir()
  4. Check the paths from step 3 with match_file() of the spec from step 2
  5. If step 4 did not match, format that file or decend into the directory
  6. Repeat above steps in the subdirectory

In step 4, the match_file() fails and we do not decend into the subdirectory in step 5 although we should.

TL;DR: black uses Path with real file-system paths which really exist (and tries to check directories with match_file()).

@cpburnz
Copy link
Owner

cpburnz commented Nov 22, 2022

@yschroeder

Thanks for the summary. After thinking about this, I don't think the right behavior is to treat Path with .is_dir() specially because its behavior is so tied to the file-system. What may be useful is I can add a utility function, say append_dir_sep(), which appends/adds the trailing path separator when Path.is_dir() is True. It would essentially be:

def append_dir_sep(path: pathlib.Path) -> str:
    out_path = str(path)
    if path.is_dir():
        out_path += os.sep
    return out_path

Then to use it you can do:

from pathspec.util import append_dir_sep

relative_path = append_dir_sep(relative_path)
if pattern.match_file(relative_path):
    pass

On a side note, looking the black's usage that you pointed out, here, it looks like relative_path is already a str when it's passed to pattern.match_file() because normalize_path_maybe_ignore() returns a str. With this in mind, the normalize_path_maybe_ignore() function would have to be modified to either use append_dir_sep() or add the dir separator itself.

cpburnz added a commit that referenced this issue Dec 10, 2022
@cpburnz
Copy link
Owner

cpburnz commented Dec 10, 2022

I've added the pathspec.util.append_dir_sep() function in the new release, v0.10.3. I believe that is all that I can address for this issue due to limitations with Path and the implementation of black.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants