CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

tfmorris · 2024-04-11T02:45:03Z

It looks like the fix for #1241 is incomplete in that not all places which need to handle the new pseudo encoding which was introduced to handle UTF-8 with Byte Order Marks (BOM) were updated.

To Reproduce

Steps to reproduce the behavior:

Create a project using this url: http://www.biodiversitylibrary.org/data/TSV/hosted/creator.txt

Current Results

An exception is logged on the console for an unsupported character encoding and the separator guessing process aborts.

By inspection, the fixed width importer is also susceptible to the same problem.

Expected Behavior

The separators are guessed correctly for CSV/TSV files

jquartel · 2024-05-15T07:08:17Z

I'm not certain if this is the same issue - I'm finding that my CSVs with BOMs are not raising exceptions but the records are not being separated (i.e. each line is considered a single element). Here is a simple example
SimpleBOM.csv.zip

In this example you can see that it incorrectly guessed tabs as the separator, but if you manually change to commas then it does separate the columns but the elements in the rows are left with their quotes on. You can fix this, too, by manually checking the "Use character to enclose cells..." option but I should think this should be guessed as it clearly did for the column headers line.

thadguidry · 2024-05-15T10:07:02Z

Agreed, the quote guessing doesn't seem to work as good anymore compared to 3.5 (I think? or sometime around there). But that might all be the deeper issue that @tfmorris is trying to fix here generally with the encoding guessing (or not guessing or applying correctly).

Here's my 3 test files:
TEST_WINDOWS-UTF-BOM.zip

jquartel · 2024-05-15T10:43:04Z

Incidentally, I notice that Excel exports of CSV files are no longer including the BOM, so it seems Microsoft might be coming to their senses.

thadguidry · 2024-05-15T11:28:49Z

@tfmorris I double checked those test files and I think I see the issue now. It's more of a case of "mixed column separators" rather than the encoding guessing? or it's a whole mixed bag of it all?

, U+002C : COMMA {decimal separator}
， U+FF0C : FULLWIDTH COMMA

Egad ! So the main issue I have is that of sometimes Chinese input files that sometimes have Full Width Commas!
And the other issue is that of my keyboard flipping back and forth between Chinese keyboard layout (Microsoft Pinyin) (that I need for workflows at times) which it seems by default will use Full Width Comma if in Pinyin input and Comma if in English keyboard input. The Full Width Comma now makes sense to me where it likely avoids display issues and confusion with a Comma appearing too close to a Mandarian glyph and potentially making one type of glyph look like it's a completely different one. Mandarian calligraphy being made of lots of dots, small slashes, small lines, etc... where there is indeed several that look like they have a comma after them sorta. So the extra spacing provided with Full Width Comma's ... makes complete sense to me now.

Hmm, now the question is:

Do we expand our support on parsing options instead of only 1 column separator pattern, going for up to 2 custom column separator patterns? 2 multibyte or only 2 single?
~~2. Or we punt and tell users "Import using Line-based and than perform multiple column split operations." ?~~

OK, forget 2 as it is just a PITA of an option actually. Like pain in the ARSEnic bottle from hell. So option 3?

We modify and enhance the "Split into several columns" operation as a multi-pass operation, since currently it sucks when there might be mixed separators, even if there are just 2 like Comma and Full Width Comma? We could add an OR() for the Separator input?

thadguidry · 2024-05-15T12:43:09Z

Nevermind. I'm an idiot sometimes. Blanked out completely on the existing RegEx option... works beautifully.

Then I just used ALL -> Edit all columns -> Replace smart quotes with ASCII
ALL -> Transform -> value.replace('"','')
DONE.

thadguidry · 2024-05-15T12:54:44Z

So thus back to Question 1. I had... does it make sense to have some sort of better option that can

support on parsing options instead of only 1 column separator pattern, going for up to 2 custom column separator patterns? 2 multibyte or only 2 single?
Use character " or Smart Quotes to enclose cells containing column separators?

tfmorris · 2024-05-15T17:40:12Z

@jquartel I can reproduce your results with a build from the 3.8.0 tag, but using the head of the 3.8 branch (which will become 3.8.1) things work correctly. The fix for this issue was bundled with the fix for #6595 and backported to the 3.8 stream. The 3.8 milestone is a little ambiguous because it represents 3.8.0, 3.8.1, 3.8.2, etc.

@thadguidry I'm going to close this again since the fix has been merged (and backported). If you can reproduce your issues with the head of master (or the head of 3.8 branch), please create a new issue with a concise description of the problem(s) you are seeing.

tfmorris self-assigned this Apr 11, 2024

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Apr 11, 2024

Handle Byte Order Mark (BOM) correctly for CSVs. Fixes OpenRefine#6527

c801e46

tfmorris mentioned this issue Apr 11, 2024

Handle Byte Order Mark (BOM) correctly for CSVs. Fixes #6527 #6528

Closed

tfmorris mentioned this issue May 10, 2024

Fix encoding guesser for non-UTF-8 BOM-based files. Fixes #6595 #6596

Merged

tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label May 10, 2024

tfmorris closed this as completed in #6596 May 13, 2024

tfmorris added this to the 3.8 milestone May 13, 2024

thadguidry reopened this May 15, 2024

tfmorris closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

tfmorris commented Apr 11, 2024

jquartel commented May 15, 2024 •

edited

thadguidry commented May 15, 2024

jquartel commented May 15, 2024

thadguidry commented May 15, 2024 •

edited

thadguidry commented May 15, 2024 •

edited

thadguidry commented May 15, 2024 •

edited

tfmorris commented May 15, 2024

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

Comments

tfmorris commented Apr 11, 2024

To Reproduce

Current Results

Expected Behavior

jquartel commented May 15, 2024 • edited

thadguidry commented May 15, 2024

jquartel commented May 15, 2024

thadguidry commented May 15, 2024 • edited

thadguidry commented May 15, 2024 • edited

thadguidry commented May 15, 2024 • edited

tfmorris commented May 15, 2024

jquartel commented May 15, 2024 •

edited

thadguidry commented May 15, 2024 •

edited

thadguidry commented May 15, 2024 •

edited

thadguidry commented May 15, 2024 •

edited