New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596
Conversation
DRY up handling of null encoding and our special UTF-8-BOM encoding
…#6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser
This looks like something worth backporting to 3.8, perhaps? |
Yes, I think so. Apparently the UTF-16LE encoding is not uncommon in the Windows ecosystem, even though most of the rest of the world principally uses UTF-8. The core fix for #6595 is simply adding the second parameter to |
* Refactor InputStreamReader handling DRY up handling of null encoding and our special UTF-8-BOM encoding * Fix user override of encoding guessing * Fix encoding guess for UTF-16LE & UTF-16BE with BOM. Fixes #6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser * Use UTF-8 instead of US-ASCII for test fixtures
Fixes #6595. Also fixes #6527
Changes proposed in this pull request: