Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some encoding warnings in Python 3.10 (PEP 597) #1614

Merged
merged 4 commits into from Oct 7, 2022

Conversation

GalaxySnail
Copy link
Contributor

@GalaxySnail GalaxySnail commented Apr 17, 2022

See:

[1] https://docs.python.org/3/whatsnew/3.10.html#optional-encodingwarning-and-encoding-locale-option

[2] https://peps.python.org/pep-0597/

Contributor checklist
  • Provided the tests for the changes.
  • Assure PR title is short, clear, and good to be included in the user-oriented changelog
Maintainer checklist
  • Assure one of these labels is present: backwards incompatible, feature, enhancement, deprecation, bug, dependency, docs or skip-changelog as they determine changelog listing.
  • Assign the PR to an existing or new milestone for the target version (following Semantic Versioning).

@@ -41,7 +41,7 @@ def __str__(self) -> str:


def read_cache_file(cache_file_path: str) -> CacheDict:
with open(cache_file_path) as cache_file:
with open(cache_file_path, encoding="ascii") as cache_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why ascii and not utf-8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because by default, json.dump has ensure_ascii=True1, which guarantees all non-ASCII characters in the output will be escaped.

Footnotes

  1. https://docs.python.org/3/library/json.html#json.dump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atugushev WDYT about this? Sounds ok? I do thing that the alternative would be to change the dump and disable ensure_ascii and enforce use of utf8.

I am personally a big supporter of utf-8 everywhere, but I am not sure which approach is better here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ambivalent. The cache is for the legacy resolver which will be deprecated and removed later, so I'm not sure if should we improve the legacy code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> json.dumps({'ok': '👌'}, ensure_ascii=False)
'{"ok": "👌"}'

>>> json.dumps({'ok': '👌'}, ensure_ascii=True)
'{"ok": "\\ud83d\\udc4c"}'

>>> json.dumps({'ok': '👌'})
'{"ok": "\\ud83d\\udc4c"}'

While I don't expect non-ASCII symbols in the dependency cache, however, would be nice to have an explicit encoding='utf-8' param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. utf-8 also looks good enough for me.

OTOH, I worry about another problem -- There are some codecs not compatible with ascii:

{'cp037',
 'cp1026',
 'cp1140',
 'cp273',
 'cp424',
 'cp500',
 'cp875',
 'hz',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_7',
 'utf_8_sig'}

If one of them is the default codec on the user's environment, both encoding='ascii' and encoding='utf-8' may raise a UnicodeDecodeError. They are rare, but they exist.

Is it a real problem that we should worry about? Should we catch UnicodeDecodeError and raise a CorruptCacheError for this case, or use encoding='locale' for the best compatibility?

code
# https://docs.python.org/3/library/codecs.html#standard-encodings
codec_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

non_compatible = set()

for codec in codec_list:
    for i in range(128):
        c = chr(i)
        try:
            if c.encode(codec) != c.encode("ascii"):
                print(f"{codec=}, {i=}, {c=}, {c.encode(codec) = }")
                non_compatible.add(codec)
        except UnicodeEncodeError:
            pass

print(non_compatible)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache is corrupted you'll see the following error:

$ pip-compile 
The dependency cache seems to have been corrupted.
Inspect, or delete, the following file:
  /Users/albert/Library/Caches/pip-tools/depcache-cp3.10.json

The workaround is to delete the file as suggested above or use --rebuild flag which clears any caches upfront.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeDecodeError is not caught here, it will crash and print a traceback, I think it's not a good UX for users. (Try adding b"\xA2".decode("ascii") after doc = json.load(cache_file).)

@ssbarnea ssbarnea added the bug Something is not working label Oct 6, 2022
@ssbarnea ssbarnea added this to the 6.10.0 milestone Oct 6, 2022
Copy link
Member

@atugushev atugushev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix some EncodingWarnings

Just out of curiosity, how to get those warnings? I've tried on 3.10 but didn't see any.

@GalaxySnail
Copy link
Contributor Author

Just out of curiosity, how to get those warnings? I've tried on 3.10 but didn't see any.

According to PEP 597 and the documentation, EncodingWarning is disabled by default on python 3.10, and can be enabled by a -X warn_default_encoding command-line option or a PYTHONWARNDEFAULTENCODING=1 environment variable.

@ssbarnea
Copy link
Member

ssbarnea commented Oct 7, 2022

Please force use of utf-8 everywhere and lets close this. There is a pylint rule that identifies the lack of explicit encoding, probably we should enable pylint.

We already did this everywhere as this ensures that the files are portable, regardless current locale of the system where they are running. I even run pip-compile from different platforms on the same network share/mount.

Copy link
Member

@atugushev atugushev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@atugushev atugushev enabled auto-merge (squash) October 7, 2022 13:02
@atugushev atugushev merged commit 7ecc951 into jazzband:master Oct 7, 2022
@atugushev atugushev changed the title Fix some EncodingWarnings in python 3.10 (PEP 597) Fix some EncodingWarnings in Python 3.10 (PEP 597) Nov 10, 2022
@atugushev atugushev changed the title Fix some EncodingWarnings in Python 3.10 (PEP 597) Fix some encoding warnings in Python 3.10 (PEP 597) Nov 10, 2022
@atugushev atugushev added the cache Related to dependency cache label Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working cache Related to dependency cache
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants