Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

black crashes when internal function get_gitignore tries to use gbk encoding to read a .gitignore file containing certain Chinese characters #1537

Closed
MapleCCC opened this issue Jul 7, 2020 · 5 comments
Labels
C: configuration CLI and configuration T: bug Something isn't working

Comments

@MapleCCC
Copy link

MapleCCC commented Jul 7, 2020

Describe the bug
black crashes when internal function get_gitignore tries to use gbk encoding to read a .gitignore file containing certain Chinese characters.

To Reproduce
The issue is related to Chinese language platform. So reproduction may not work in other language platform. On Chinese language platform, the builtin open() function defaults to use gbk encoding, and this is where the problem originates.

  1. Create a new directory for bug reproduction.
mkdir playground
cd playground
  1. Initialize a Git repository. This is necessary because black reads the .gitignore file only when the the directory containing the .gitignore file is a Git repository.
git init
  1. Create a .gitignore file and put the four Chinese characters "屏幕截图" inside it.
touch .gitignore
echo 屏幕截图 > .gitignore
  1. Run Black on the folder containing the .gitignore file.
black .
# Mind the dot at the end
  1. See error
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 8: illegal multibyte sequence

Expected behavior
Black should not crash when encountering .gitignore files that contain Chinese characters.

Environment (please complete the following information):

  • Version: 19.10b0
  • OS and Python version: Windows CPython 3.8.3rc1

Does this bug also happen on master?
Yes.

Possible solution
black's internal function get_gitignore could explicitly use UTF-8 encoding when reading from .gitignore files. Then nothing should crash. This issue is similar to the one in Gita nosarthur/gita#74.

@MapleCCC MapleCCC added the T: bug Something isn't working label Jul 7, 2020
@JelleZijlstra
Copy link
Collaborator

Even if we use UTF-8, we would still crash if the file contains invalid UTF-8. Is .gitignore specified to be UTF-8-encoded? https://git-scm.com/docs/gitignore doesn't say anything about encoding.

Maybe the solution should be to silently ignore the gitignore file if we can't decode it.

@zsol
Copy link
Collaborator

zsol commented Jul 7, 2020

I think to be fully correct we should be opening that file in binary mode, and using os.fsdecode on its contents (possibly after splitting it on \n) before passing it to PathSpec here: https://github.com/psf/black/blob/master/src/black/__init__.py#L5759

@MapleCCC
Copy link
Author

MapleCCC commented Jul 9, 2020

I agree with @JelleZijlstra . Even if black can't decode .gitignore, it should just keep going instead of crashing. This should not be a fatal exception.

@MapleCCC
Copy link
Author

MapleCCC commented Jul 13, 2020

Another option to workaround this might be to expose a command line option --encoding, to let user sepcify an encoding scheme used for opening file. This is how the pipreqs library handles the problem. bndr/pipreqs#125

@ichard26 ichard26 added the C: configuration CLI and configuration label May 24, 2021
@ichard26
Copy link
Collaborator

We ended up going the "use UTF-8 only" route in PR #2229 so this should have been fixed. Thank you @MapleCCC for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: configuration CLI and configuration T: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants