Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

tfmorris · 2024-05-10T02:51:05Z

Fixes #6595. Also fixes #6527

Changes proposed in this pull request:

refactor InputStreamReader handling to DRY up handling of null encoding and our fake UTF-8 + BOM encoding (CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527)
fix user override of file encoding
fix encoding guessing by not skipping BOM for non-UTF-8 cases so that it is still available for character encoding guesser (Encoding issue regression for files imported into version 3.8.0 #6595)
Add tests for UTF-16LE and UTF-16BE with BOMs
Refactor encoding guesser so that it can be used for a single file
Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser

DRY up handling of null encoding and our special UTF-8-BOM encoding

…#6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser

wetneb · 2024-05-13T08:39:07Z

This looks like something worth backporting to 3.8, perhaps?

tfmorris · 2024-05-13T15:53:59Z

Yes, I think so. Apparently the UTF-16LE encoding is not uncommon in the Windows ecosystem, even though most of the rest of the world principally uses UTF-8.

The core fix for #6595 is simply adding the second parameter to new UnicodeBOMInputStream(is, true), so I could generate a more minimal fix, but I think all 3 fixes are useful and a bunch of the changed code in the PR are things like additional tests, added TODOs, etc.

* Refactor InputStreamReader handling DRY up handling of null encoding and our special UTF-8-BOM encoding * Fix user override of encoding guessing * Fix encoding guess for UTF-16LE & UTF-16BE with BOM. Fixes #6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser * Use UTF-8 instead of US-ASCII for test fixtures

tfmorris added 3 commits May 9, 2024 22:31

Refactor InputStreamReader handling

076f836

DRY up handling of null encoding and our special UTF-8-BOM encoding

Fix user override of encoding guessing

71cfd1b

github-actions bot added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Priority: Critical Highlights issues that demand immediate action. encoding Selection of encoding at import time, or encoding issues in data cleaning labels May 10, 2024

Back out unrelated change

deb53fe

tfmorris mentioned this pull request May 10, 2024

Handle Byte Order Mark (BOM) correctly for CSVs. Fixes #6527 #6528

Closed

github-actions bot added Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators import About importers in general - add a label for the data format if available labels May 10, 2024

tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label May 10, 2024

UTF-8 instead of US-ASCII for test fixtures

c0d55e5

wetneb approved these changes May 13, 2024

View reviewed changes

tfmorris merged commit 2a38a88 into OpenRefine:master May 13, 2024
19 checks passed

tfmorris deleted the 6595-encoding-bom branch May 13, 2024 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

tfmorris commented May 10, 2024 •

edited

wetneb commented May 13, 2024

tfmorris commented May 13, 2024

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

Conversation

tfmorris commented May 10, 2024 • edited

wetneb commented May 13, 2024

tfmorris commented May 13, 2024

tfmorris commented May 10, 2024 •

edited