Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some encoding warnings in Python 3.10 (PEP 597) #1614

Merged
merged 4 commits into from
Oct 7, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions piptools/cache.py
Expand Up @@ -41,7 +41,7 @@ def __str__(self) -> str:


def read_cache_file(cache_file_path: str) -> CacheDict:
with open(cache_file_path) as cache_file:
with open(cache_file_path, encoding="ascii") as cache_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why ascii and not utf-8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because by default, json.dump has ensure_ascii=True1, which guarantees all non-ASCII characters in the output will be escaped.

Footnotes

  1. https://docs.python.org/3/library/json.html#json.dump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atugushev WDYT about this? Sounds ok? I do thing that the alternative would be to change the dump and disable ensure_ascii and enforce use of utf8.

I am personally a big supporter of utf-8 everywhere, but I am not sure which approach is better here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ambivalent. The cache is for the legacy resolver which will be deprecated and removed later, so I'm not sure if should we improve the legacy code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> json.dumps({'ok': '👌'}, ensure_ascii=False)
'{"ok": "👌"}'

>>> json.dumps({'ok': '👌'}, ensure_ascii=True)
'{"ok": "\\ud83d\\udc4c"}'

>>> json.dumps({'ok': '👌'})
'{"ok": "\\ud83d\\udc4c"}'

While I don't expect non-ASCII symbols in the dependency cache, however, would be nice to have an explicit encoding='utf-8' param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. utf-8 also looks good enough for me.

OTOH, I worry about another problem -- There are some codecs not compatible with ascii:

{'cp037',
 'cp1026',
 'cp1140',
 'cp273',
 'cp424',
 'cp500',
 'cp875',
 'hz',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_7',
 'utf_8_sig'}

If one of them is the default codec on the user's environment, both encoding='ascii' and encoding='utf-8' may raise a UnicodeDecodeError. They are rare, but they exist.

Is it a real problem that we should worry about? Should we catch UnicodeDecodeError and raise a CorruptCacheError for this case, or use encoding='locale' for the best compatibility?

code
# https://docs.python.org/3/library/codecs.html#standard-encodings
codec_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

non_compatible = set()

for codec in codec_list:
    for i in range(128):
        c = chr(i)
        try:
            if c.encode(codec) != c.encode("ascii"):
                print(f"{codec=}, {i=}, {c=}, {c.encode(codec) = }")
                non_compatible.add(codec)
        except UnicodeEncodeError:
            pass

print(non_compatible)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache is corrupted you'll see the following error:

$ pip-compile 
The dependency cache seems to have been corrupted.
Inspect, or delete, the following file:
  /Users/albert/Library/Caches/pip-tools/depcache-cp3.10.json

The workaround is to delete the file as suggested above or use --rebuild flag which clears any caches upfront.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeDecodeError is not caught here, it will crash and print a traceback, I think it's not a good UX for users. (Try adding b"\xA2".decode("ascii") after doc = json.load(cache_file).)

try:
doc = json.load(cache_file)
except json.JSONDecodeError:
Expand Down Expand Up @@ -108,7 +108,7 @@ def as_cache_key(self, ireq: InstallRequirement) -> CacheKey:
def write_cache(self) -> None:
"""Writes the cache to disk as JSON."""
doc = {"__format__": 1, "dependencies": self._cache}
with open(self._cache_file, "w") as f:
with open(self._cache_file, "w", encoding="ascii") as f:
json.dump(doc, f, sort_keys=True)

def clear(self) -> None:
Expand Down