Python's gb18030 decoder is not the same as w3c's #76

HyperHCl · 2016-10-09T06:21:17Z

https://www.w3.org/TR/encoding/#gb18030-decoder specifies a single-byte special case 0x80 → U+20AC for gbk compatibility, but Python's decoder does not perform this translation.

redapple · 2016-11-10T10:19:28Z

@HyperHCl , I'm not sure this is the right place to report this decoding issue.
Have you submitted the issue to the Python Core developers?

HyperHCl · 2016-11-10T14:07:42Z

Well it's nearly clear that Python upstream will not accept this issue: they usually try to support the original national standard, not a w3c/whatwg web-standard. Python's codecs are quite pedantic, cf. ftfy "sloppy" encodings. To Python this problem is just the world doing things The Wrong Way, but to make codecs useful for them people have to make it as wrong as the rest of the world.

redapple · 2016-11-14T16:20:17Z

@HyperHCl , I see. But where does this fit w3lib?

HyperHCl · 2016-11-14T16:51:40Z

By Googling for "whatwg encoding python" I found an implementation for that standard called webencodings. ~~I haven't actually verified how well it works (or whether it works at all) though.~~ Uh oops... It only provides a table of aliases that still points to Python's windows-1252 and gb18030. Sounds like time to invent a wheel -- say, w3lib.codecs or just a separate w3codecs.

Implementations for each codec in question:

Single-byte windows code pages: should be similar to ftfy's sloppy codecs.
gb18030 and gbk can be wrappers around Python's fast, native one:
- gb18030 decoder:
  - as valid GBK/GB18030 text does not use 0x80 for anything but that single-byte euro sign, consider inputbytes.translate(bytes.maketrans(b'\x80', b'\xA2\xE3')). The same property may be used to construct a stream decoder and finally a complete one. alternatively,
  - wrap an error handler that handles 0x80 and carries on.
- gbk encoder:
  - use a gb18030 encoder wrap that screams on seeing four-byte GB18030 UTF. alternatively:
  - an error handler around the gbk encoder that handles u'\u20AC' → b'\x80'
Haven't looked into other MBCS's yet.

openandclose · 2020-05-12T19:49:39Z

Since this thread is labeled as discussion...

I think many Python web applications face this problem.

That is, since Pyhton codecs follow unicode.org spec,
each developper has to invent how to support web's 'sloppy' encodings.

ftfy solves part of the problems,
but just creating codecs following encoding.spec.whatwg seems the obvious solution,
and actually ftfy author himself @rspeer proposed including them in stdlib.
https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

But aside from stdlib discussion,
I couldn't find any other 3rd party libraries, popular solutions,
or document or evidence that says it's not worth it if it is so.
(At least w3lib doesn't do anything about it).

What people are thinking and doing?

HyperHCl changed the title ~~Python~~ Python's gb18030 decoder is not the same as w3c's Oct 9, 2016

Gallaecio added the enhancement label May 9, 2019

Gallaecio added the discuss label Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python's gb18030 decoder is not the same as w3c's #76

Python's gb18030 decoder is not the same as w3c's #76

HyperHCl commented Oct 9, 2016 •

edited

redapple commented Nov 10, 2016

HyperHCl commented Nov 10, 2016 •

edited

redapple commented Nov 14, 2016

HyperHCl commented Nov 14, 2016 •

edited

openandclose commented May 12, 2020

Python's gb18030 decoder is not the same as w3c's #76

Python's gb18030 decoder is not the same as w3c's #76

Comments

HyperHCl commented Oct 9, 2016 • edited

redapple commented Nov 10, 2016

HyperHCl commented Nov 10, 2016 • edited

redapple commented Nov 14, 2016

HyperHCl commented Nov 14, 2016 • edited

openandclose commented May 12, 2020

HyperHCl commented Oct 9, 2016 •

edited

HyperHCl commented Nov 10, 2016 •

edited

HyperHCl commented Nov 14, 2016 •

edited