New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue regression for files imported into version 3.8.0 #6595
Comments
This happens on both macOS and Windows? I'm unable to reproduce it on macOS 14.4.1 (Sonoma) with either the 3.8.0 tag or the current development head. Did it guess UTF-16LE or did you specify it? (It's correct, but I'm curious because my system is guessing Windows-1252) If you try switching to a different encoding (e.g. UTF-8) and then back to UTF-16LE, do the symptoms change? It looks like it might be using the system default encoding instead of the correctly guessed encoding, but I'll have to dig further to understand where things are going awry. |
I just double checked with the binary 3.8.0 kit and bundled JVM (OpenJDK 64-Bit Server VM version 11.0.23+9) to be sure it wasn't some kind of binary kit build issue and got the same results as with my previous two tests using builds from source. |
The "same" results of reproducing the issue? or not reproducing the issue? It did not guess the UTF-16LE encoding (it does on 3.7.9). On version 3.8, it is guessing Windows-1252 when I choose import preview on the ZIP file itself it looks exactly just like this when I did try fresh again, just to double-check: If I change the character encoding to UTF-16LE it refreshes the preview, but the NUL characters still show between the letters as in image above. Also interesting, but not applicable to this issue I think, is that the checkbox for "Use character " to enclose cells containing column separaters" is unchecked by default now in 3.8? |
I've been helping Mike Gordon on this issue, emailing back and forth and finally discovering the real issue coming up in regards to https://forum.openrefine.org/t/trouble-with-sql-export/1485 |
None of my three tests reproduced the issue on macOS. The bug report says:
Did you actually test on both? What version of macOS? Which operating system are the screenshots above from? |
I tested and reproduced on Windows 11. |
Also not reproducible with the binary 3.8.0 kit on macOS Big Sur 11.7.10 (20G1427). I'll see if I can find a Windows machine. |
Actually, it looks like my browsers are rendering things differently than yours and there are actually NULs in the data, so let me dig in more now that I can reproduce it. |
Ah, in the data, just reading the bytes directly out of the compressed file.. I also see NULLs or rather import zipfile
zip_file_path = 'F:\Downloads\IDENTIFIANTS_AIFM.zip'
# read a file from a windows path
with zipfile.ZipFile(zip_file_path, 'r') as myzip:
file_name = 'IDENTIFIANTS_AIFM.csv'
with myzip.open(file_name, 'r') as f:
# read the first 32 bytes of the file and print the Unicode decimal codepoints
f_list = [ord(i) for i in f.read(32).decode('unicode_escape')]
print(f_list) |
There are a number of different problems here:
I have fixes for all of these that I'll put up in a PR. |
…#6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser
* Refactor InputStreamReader handling DRY up handling of null encoding and our special UTF-8-BOM encoding * Fix user override of encoding guessing * Fix encoding guess for UTF-16LE & UTF-16BE with BOM. Fixes #6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser * Use UTF-8 instead of US-ASCII for test fixtures
In OpenRefine 3.8.0, extra NUL chars are being introduced between letters in a dataset, where it is not happening with previous version 3.7.9
To Reproduce
Steps to reproduce the behavior:
Current Results
NUL chars are being introduced erroneously between letters. Appears to be some kind of errant decoding/encoding issue?
Expected Behavior
Should work as it did before with OpenRefine 3.7.9 , without the extra NUL chars being introduced.
Screenshots
With OpenRefine 3.8.0
With OpenRefine 3.7.9
Versions
Datasets
Download the ZIP file from https://www.cssf.lu/en/Document/identifiers-of-aifms/
direct link: https://www.cssf.lu/wp-content/uploads/IDENTIFIANTS_AIFM.zip
Additional context
Opening the ZIP file directly in OpenRefine or extracting it first and then opening in OpenRefine, causes the same encoding issue and appearance of NUL char between letters.
The text was updated successfully, but these errors were encountered: