Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue regression for files imported into version 3.8.0 #6595

Closed
thadguidry opened this issue May 8, 2024 · 10 comments · Fixed by #6596
Closed

Encoding issue regression for files imported into version 3.8.0 #6595

thadguidry opened this issue May 8, 2024 · 10 comments · Fixed by #6596
Assignees
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning Priority: Critical Highlights issues that demand immediate action. Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@thadguidry
Copy link
Member

thadguidry commented May 8, 2024

In OpenRefine 3.8.0, extra NUL chars are being introduced between letters in a dataset, where it is not happening with previous version 3.7.9

To Reproduce

Steps to reproduce the behavior:

  1. First, download dataset zip from below
  2. Open using OpenRefine 3.8.0
  3. Notice NUL chars between letters
  4. Now open using OpenRefine 3.7.9
  5. Notice no NUL chars between letters

Current Results

NUL chars are being introduced erroneously between letters. Appears to be some kind of errant decoding/encoding issue?

Expected Behavior

Should work as it did before with OpenRefine 3.7.9 , without the extra NUL chars being introduced.

Screenshots

With OpenRefine 3.8.0
image
With OpenRefine 3.7.9
image

Versions

  • Operating System: Mac OS and Windows 11
  • Browser Version: Edge and Safari
  • JRE or JDK Version: embedded
  • OpenRefine: 3.8.0

Datasets

Download the ZIP file from https://www.cssf.lu/en/Document/identifiers-of-aifms/
direct link: https://www.cssf.lu/wp-content/uploads/IDENTIFIANTS_AIFM.zip

Additional context

Opening the ZIP file directly in OpenRefine or extracting it first and then opening in OpenRefine, causes the same encoding issue and appearance of NUL char between letters.

@thadguidry thadguidry added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Priority: Critical Highlights issues that demand immediate action. encoding Selection of encoding at import time, or encoding issues in data cleaning labels May 8, 2024
@tfmorris
Copy link
Member

tfmorris commented May 9, 2024

This happens on both macOS and Windows? I'm unable to reproduce it on macOS 14.4.1 (Sonoma) with either the 3.8.0 tag or the current development head.

Did it guess UTF-16LE or did you specify it? (It's correct, but I'm curious because my system is guessing Windows-1252)

If you try switching to a different encoding (e.g. UTF-8) and then back to UTF-16LE, do the symptoms change?

It looks like it might be using the system default encoding instead of the correctly guessed encoding, but I'll have to dig further to understand where things are going awry.

@tfmorris
Copy link
Member

tfmorris commented May 9, 2024

I just double checked with the binary 3.8.0 kit and bundled JVM (OpenJDK 64-Bit Server VM version 11.0.23+9) to be sure it wasn't some kind of binary kit build issue and got the same results as with my previous two tests using builds from source.

@thadguidry
Copy link
Member Author

thadguidry commented May 9, 2024

The "same" results of reproducing the issue? or not reproducing the issue?

It did not guess the UTF-16LE encoding (it does on 3.7.9). On version 3.8, it is guessing Windows-1252 when I choose import preview on the ZIP file itself it looks exactly just like this when I did try fresh again, just to double-check:

image

If I change the character encoding to UTF-16LE it refreshes the preview, but the NUL characters still show between the letters as in image above.

Also interesting, but not applicable to this issue I think, is that the checkbox for "Use character " to enclose cells containing column separaters" is unchecked by default now in 3.8?

@thadguidry
Copy link
Member Author

I've been helping Mike Gordon on this issue, emailing back and forth and finally discovering the real issue coming up in regards to https://forum.openrefine.org/t/trouble-with-sql-export/1485

@tfmorris
Copy link
Member

tfmorris commented May 9, 2024

The "same" results of reproducing the issue? or not reproducing the issue?

None of my three tests reproduced the issue on macOS.

The bug report says:

Operating System: Mac OS and Windows 11

Did you actually test on both? What version of macOS? Which operating system are the screenshots above from?

@thadguidry
Copy link
Member Author

thadguidry commented May 9, 2024

I tested and reproduced on Windows 11.
Mike Gordon in the forum is on a MacBook Pro mid 2012 on macOS Catalina running OpenRefine 3.8.

@tfmorris
Copy link
Member

tfmorris commented May 9, 2024

Also not reproducible with the binary 3.8.0 kit on macOS Big Sur 11.7.10 (20G1427). I'll see if I can find a Windows machine.

@tfmorris
Copy link
Member

tfmorris commented May 9, 2024

Actually, it looks like my browsers are rendering things differently than yours and there are actually NULs in the data, so let me dig in more now that I can reproduce it.

@thadguidry
Copy link
Member Author

thadguidry commented May 10, 2024

Ah, in the data, just reading the bytes directly out of the compressed file.. I also see NULLs or rather \u0000 or char(0).
For example, I see the gaps within all the char(78) which is "N" where it should be NNNNNNNN.
Since NULL is a non-printable character, you have to use unicode_escape to print any non-printable characters.

import zipfile

zip_file_path = 'F:\Downloads\IDENTIFIANTS_AIFM.zip'
# read a file from a windows path
with zipfile.ZipFile(zip_file_path, 'r') as myzip:
    file_name = 'IDENTIFIANTS_AIFM.csv'
    with myzip.open(file_name, 'r') as f:
        # read the first 32 bytes of the file and print the Unicode decimal codepoints
        f_list = [ord(i) for i in f.read(32).decode('unicode_escape')]
        print(f_list)

image

@tfmorris
Copy link
Member

There are a number of different problems here:

  1. The fix (CSV Byte Order Mark (BOM) support. Fixes #1241 #6025) for the wonky Microsoft UTF-8 + BOM problem (CSV files with BOM don't correctly strip BOM on import #1241) messed up the encoding guessing for other BOM-based encoding (including the UTF-16LE used in this file), because all BOMs were being stripped, making them unavailable for the character encoding guesser.
  2. The manual character encoding override is being ignored. This might have been broken for a long time, but wasn't critical as long as the encoding guessing was working correctly.
  3. There are no tests for encodings with BOM, other than the recently added UTF-8 + BOM test, so there was no test coverage for this failure.

I have fixes for all of these that I'll put up in a PR.

@tfmorris tfmorris self-assigned this May 10, 2024
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue May 10, 2024
…#6595

- Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser
- Add tests for UTF-16LE and UTF-16BE with BOMs
- Refactor encoding guesser so that it can be used for a single file
- Enhanced format guessing tests to not assume UTF-8 and instead guess encoding
  using refactored encoding guesser
@tfmorris tfmorris added this to the 3.8 milestone May 13, 2024
wetneb pushed a commit that referenced this issue May 14, 2024
* Refactor InputStreamReader handling

DRY up handling of null encoding and our special UTF-8-BOM encoding

* Fix user override of encoding guessing

* Fix encoding guess for UTF-16LE & UTF-16BE with BOM. Fixes #6595

- Don't skip BOM for non-UTF-8 cases so that it is still available for character encoding guesser
- Add tests for UTF-16LE and UTF-16BE with BOMs
- Refactor encoding guesser so that it can be used for a single file
- Enhanced format guessing tests to not assume UTF-8 and instead guess encoding
  using refactored encoding guesser

* Use UTF-8 instead of US-ASCII for test fixtures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning Priority: Critical Highlights issues that demand immediate action. Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants