Encoding issue regression for files imported into version 3.8.0 #6595

thadguidry · 2024-05-08T23:58:30Z

In OpenRefine 3.8.0, extra NUL chars are being introduced between letters in a dataset, where it is not happening with previous version 3.7.9

To Reproduce

Steps to reproduce the behavior:

First, download dataset zip from below
Open using OpenRefine 3.8.0
Notice NUL chars between letters
Now open using OpenRefine 3.7.9
Notice no NUL chars between letters

Current Results

NUL chars are being introduced erroneously between letters. Appears to be some kind of errant decoding/encoding issue?

Expected Behavior

Should work as it did before with OpenRefine 3.7.9 , without the extra NUL chars being introduced.

Screenshots

With OpenRefine 3.8.0

With OpenRefine 3.7.9

Versions

Operating System: Mac OS and Windows 11
Browser Version: Edge and Safari
JRE or JDK Version: embedded
OpenRefine: 3.8.0

Datasets

Download the ZIP file from https://www.cssf.lu/en/Document/identifiers-of-aifms/
direct link: https://www.cssf.lu/wp-content/uploads/IDENTIFIANTS_AIFM.zip

Additional context

Opening the ZIP file directly in OpenRefine or extracting it first and then opening in OpenRefine, causes the same encoding issue and appearance of NUL char between letters.

tfmorris · 2024-05-09T03:17:54Z

This happens on both macOS and Windows? I'm unable to reproduce it on macOS 14.4.1 (Sonoma) with either the 3.8.0 tag or the current development head.

Did it guess UTF-16LE or did you specify it? (It's correct, but I'm curious because my system is guessing Windows-1252)

If you try switching to a different encoding (e.g. UTF-8) and then back to UTF-16LE, do the symptoms change?

It looks like it might be using the system default encoding instead of the correctly guessed encoding, but I'll have to dig further to understand where things are going awry.

tfmorris · 2024-05-09T03:37:37Z

I just double checked with the binary 3.8.0 kit and bundled JVM (OpenJDK 64-Bit Server VM version 11.0.23+9) to be sure it wasn't some kind of binary kit build issue and got the same results as with my previous two tests using builds from source.

thadguidry · 2024-05-09T03:39:56Z

The "same" results of reproducing the issue? or not reproducing the issue?

It did not guess the UTF-16LE encoding (it does on 3.7.9). On version 3.8, it is guessing Windows-1252 when I choose import preview on the ZIP file itself it looks exactly just like this when I did try fresh again, just to double-check:

If I change the character encoding to UTF-16LE it refreshes the preview, but the NUL characters still show between the letters as in image above.

Also interesting, but not applicable to this issue I think, is that the checkbox for "Use character " to enclose cells containing column separaters" is unchecked by default now in 3.8?

thadguidry · 2024-05-09T03:53:01Z

I've been helping Mike Gordon on this issue, emailing back and forth and finally discovering the real issue coming up in regards to https://forum.openrefine.org/t/trouble-with-sql-export/1485

tfmorris · 2024-05-09T04:50:51Z

The "same" results of reproducing the issue? or not reproducing the issue?

None of my three tests reproduced the issue on macOS.

The bug report says:

Operating System: Mac OS and Windows 11

Did you actually test on both? What version of macOS? Which operating system are the screenshots above from?

thadguidry · 2024-05-09T05:14:58Z

I tested and reproduced on Windows 11.
Mike Gordon in the forum is on a MacBook Pro mid 2012 on macOS Catalina running OpenRefine 3.8.

tfmorris · 2024-05-09T17:34:45Z

Also not reproducible with the binary 3.8.0 kit on macOS Big Sur 11.7.10 (20G1427). I'll see if I can find a Windows machine.

tfmorris · 2024-05-09T18:01:23Z

Actually, it looks like my browsers are rendering things differently than yours and there are actually NULs in the data, so let me dig in more now that I can reproduce it.

thadguidry · 2024-05-10T01:28:39Z

Ah, in the data, just reading the bytes directly out of the compressed file.. I also see NULLs or rather \u0000 or char(0).
For example, I see the gaps within all the char(78) which is "N" where it should be NNNNNNNN.
Since NULL is a non-printable character, you have to use unicode_escape to print any non-printable characters.

import zipfile

zip_file_path = 'F:\Downloads\IDENTIFIANTS_AIFM.zip'
# read a file from a windows path
with zipfile.ZipFile(zip_file_path, 'r') as myzip:
    file_name = 'IDENTIFIANTS_AIFM.csv'
    with myzip.open(file_name, 'r') as f:
        # read the first 32 bytes of the file and print the Unicode decimal codepoints
        f_list = [ord(i) for i in f.read(32).decode('unicode_escape')]
        print(f_list)

tfmorris · 2024-05-10T01:57:02Z

There are a number of different problems here:

The fix (CSV Byte Order Mark (BOM) support. Fixes #1241 #6025) for the wonky Microsoft UTF-8 + BOM problem (CSV files with BOM don't correctly strip BOM on import #1241) messed up the encoding guessing for other BOM-based encoding (including the UTF-16LE used in this file), because all BOMs were being stripped, making them unavailable for the character encoding guesser.
The manual character encoding override is being ignored. This might have been broken for a long time, but wasn't critical as long as the encoding guessing was working correctly.
There are no tests for encodings with BOM, other than the recently added UTF-8 + BOM test, so there was no test coverage for this failure.

I have fixes for all of these that I'll put up in a PR.

…#6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser

* Refactor InputStreamReader handling DRY up handling of null encoding and our special UTF-8-BOM encoding * Fix user override of encoding guessing * Fix encoding guess for UTF-16LE & UTF-16BE with BOM. Fixes #6595 - Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser - Add tests for UTF-16LE and UTF-16BE with BOMs - Refactor encoding guesser so that it can be used for a single file - Enhanced format guessing tests to not assume UTF-8 and instead guess encoding using refactored encoding guesser * Use UTF-8 instead of US-ASCII for test fixtures

thadguidry added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Priority: Critical Highlights issues that demand immediate action. encoding Selection of encoding at import time, or encoding issues in data cleaning labels May 8, 2024

tfmorris self-assigned this May 10, 2024

tfmorris mentioned this issue May 10, 2024

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

Merged

tfmorris closed this as completed in #6596 May 13, 2024

tfmorris closed this as completed in 2a38a88 May 13, 2024

tfmorris added this to the 3.8 milestone May 13, 2024

tfmorris mentioned this issue May 15, 2024

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue regression for files imported into version 3.8.0 #6595

Encoding issue regression for files imported into version 3.8.0 #6595

thadguidry commented May 8, 2024 •

edited by tfmorris

tfmorris commented May 9, 2024 •

edited

tfmorris commented May 9, 2024 •

edited

thadguidry commented May 9, 2024 •

edited

thadguidry commented May 9, 2024

tfmorris commented May 9, 2024

thadguidry commented May 9, 2024 •

edited

tfmorris commented May 9, 2024

tfmorris commented May 9, 2024

thadguidry commented May 10, 2024 •

edited

tfmorris commented May 10, 2024

Encoding issue regression for files imported into version 3.8.0 #6595

Encoding issue regression for files imported into version 3.8.0 #6595

Comments

thadguidry commented May 8, 2024 • edited by tfmorris

To Reproduce

Current Results

Expected Behavior

Screenshots

Versions

Datasets

Additional context

tfmorris commented May 9, 2024 • edited

tfmorris commented May 9, 2024 • edited

thadguidry commented May 9, 2024 • edited

thadguidry commented May 9, 2024

tfmorris commented May 9, 2024

thadguidry commented May 9, 2024 • edited

tfmorris commented May 9, 2024

tfmorris commented May 9, 2024

thadguidry commented May 10, 2024 • edited

tfmorris commented May 10, 2024

thadguidry commented May 8, 2024 •

edited by tfmorris

tfmorris commented May 9, 2024 •

edited

tfmorris commented May 9, 2024 •

edited

thadguidry commented May 9, 2024 •

edited

thadguidry commented May 9, 2024 •

edited

thadguidry commented May 10, 2024 •

edited