Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

Merged
merged 5 commits into from May 13, 2024

Conversation

tfmorris
Copy link
Member

@tfmorris tfmorris commented May 10, 2024

Fixes #6595. Also fixes #6527

Changes proposed in this pull request:

DRY up handling of null encoding and our special UTF-8-BOM encoding
…#6595

- Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser
- Add tests for UTF-16LE and UTF-16BE with BOMs
- Refactor encoding guesser so that it can be used for a single file
- Enhanced format guessing tests to not assume UTF-8 and instead guess encoding
  using refactored encoding guesser
@github-actions github-actions bot added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Priority: Critical Highlights issues that demand immediate action. encoding Selection of encoding at import time, or encoding issues in data cleaning labels May 10, 2024
@github-actions github-actions bot added Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators import About importers in general - add a label for the data format if available labels May 10, 2024
@tfmorris tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label May 10, 2024
@wetneb
Copy link
Sponsor Member

wetneb commented May 13, 2024

This looks like something worth backporting to 3.8, perhaps?

@tfmorris
Copy link
Member Author

Yes, I think so. Apparently the UTF-16LE encoding is not uncommon in the Windows ecosystem, even though most of the rest of the world principally uses UTF-8.

The core fix for #6595 is simply adding the second parameter to new UnicodeBOMInputStream(is, true), so I could generate a more minimal fix, but I think all 3 fixes are useful and a bunch of the changed code in the PR are things like additional tests, added TODOs, etc.

@tfmorris tfmorris merged commit 2a38a88 into OpenRefine:master May 13, 2024
19 checks passed
@tfmorris tfmorris deleted the 6595-encoding-bom branch May 13, 2024 19:52
wetneb pushed a commit that referenced this pull request May 14, 2024
* Refactor InputStreamReader handling

DRY up handling of null encoding and our special UTF-8-BOM encoding

* Fix user override of encoding guessing

* Fix encoding guess for UTF-16LE & UTF-16BE with BOM. Fixes #6595

- Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser
- Add tests for UTF-16LE and UTF-16BE with BOMs
- Refactor encoding guesser so that it can be used for a single file
- Enhanced format guessing tests to not assume UTF-8 and instead guess encoding
  using refactored encoding guesser

* Use UTF-8 instead of US-ASCII for test fixtures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning import About importers in general - add a label for the data format if available Priority: Critical Highlights issues that demand immediate action. Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
2 participants