Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python's gb18030 decoder is not the same as w3c's #76

Open
HyperHCl opened this issue Oct 9, 2016 · 5 comments
Open

Python's gb18030 decoder is not the same as w3c's #76

HyperHCl opened this issue Oct 9, 2016 · 5 comments

Comments

@HyperHCl
Copy link

HyperHCl commented Oct 9, 2016

https://www.w3.org/TR/encoding/#gb18030-decoder specifies a single-byte special case 0x80 → U+20AC for gbk compatibility, but Python's decoder does not perform this translation.

@HyperHCl HyperHCl changed the title Python Python's gb18030 decoder is not the same as w3c's Oct 9, 2016
@redapple
Copy link
Contributor

@HyperHCl , I'm not sure this is the right place to report this decoding issue.
Have you submitted the issue to the Python Core developers?

@HyperHCl
Copy link
Author

HyperHCl commented Nov 10, 2016

Well it's nearly clear that Python upstream will not accept this issue: they usually try to support the original national standard, not a w3c/whatwg web-standard. Python's codecs are quite pedantic, cf. ftfy "sloppy" encodings. To Python this problem is just the world doing things The Wrong Way, but to make codecs useful for them people have to make it as wrong as the rest of the world.

@redapple
Copy link
Contributor

@HyperHCl , I see. But where does this fit w3lib?

@HyperHCl
Copy link
Author

HyperHCl commented Nov 14, 2016

By Googling for "whatwg encoding python" I found an implementation for that standard called webencodings. I haven't actually verified how well it works (or whether it works at all) though. Uh oops... It only provides a table of aliases that still points to Python's windows-1252 and gb18030. Sounds like time to invent a wheel -- say, w3lib.codecs or just a separate w3codecs.

Implementations for each codec in question:

  • Single-byte windows code pages: should be similar to ftfy's sloppy codecs.
  • gb18030 and gbk can be wrappers around Python's fast, native one:
    • gb18030 decoder:
      • as valid GBK/GB18030 text does not use 0x80 for anything but that single-byte euro sign, consider inputbytes.translate(bytes.maketrans(b'\x80', b'\xA2\xE3')). The same property may be used to construct a stream decoder and finally a complete one. alternatively,
      • wrap an error handler that handles 0x80 and carries on.
    • gbk encoder:
      • use a gb18030 encoder wrap that screams on seeing four-byte GB18030 UTF. alternatively:
      • an error handler around the gbk encoder that handles u'\u20AC'b'\x80'
  • Haven't looked into other MBCS's yet.

@openandclose
Copy link

Since this thread is labeled as discussion...

I think many Python web applications face this problem.

That is, since Pyhton codecs follow unicode.org spec,
each developper has to invent how to support web's 'sloppy' encodings.

ftfy solves part of the problems,
but just creating codecs following encoding.spec.whatwg seems the obvious solution,
and actually ftfy author himself @rspeer proposed including them in stdlib.
https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

But aside from stdlib discussion,
I couldn't find any other 3rd party libraries, popular solutions,
or document or evidence that says it's not worth it if it is so.
(At least w3lib doesn't do anything about it).

What people are thinking and doing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants