Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

Closed
tfmorris opened this issue Apr 11, 2024 · 7 comments · Fixed by #6596
Closed

CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527

tfmorris opened this issue Apr 11, 2024 · 7 comments · Fixed by #6596
Assignees
Labels
import About importers in general - add a label for the data format if available Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@tfmorris
Copy link
Member

It looks like the fix for #1241 is incomplete in that not all places which need to handle the new pseudo encoding which was introduced to handle UTF-8 with Byte Order Marks (BOM) were updated.

To Reproduce

Steps to reproduce the behavior:

  1. Create a project using this url: http://www.biodiversitylibrary.org/data/TSV/hosted/creator.txt

Current Results

An exception is logged on the console for an unsupported character encoding and the separator guessing process aborts.

By inspection, the fixed width importer is also susceptible to the same problem.

Expected Behavior

The separators are guessed correctly for CSV/TSV files

@tfmorris tfmorris added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators import About importers in general - add a label for the data format if available labels Apr 11, 2024
@tfmorris tfmorris self-assigned this Apr 11, 2024
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Apr 11, 2024
@tfmorris tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label May 10, 2024
@tfmorris tfmorris added this to the 3.8 milestone May 13, 2024
@jquartel
Copy link

jquartel commented May 15, 2024

I'm not certain if this is the same issue - I'm finding that my CSVs with BOMs are not raising exceptions but the records are not being separated (i.e. each line is considered a single element). Here is a simple example
SimpleBOM.csv.zip
Screenshot 2024-05-15 at 07 54 57
In this example you can see that it incorrectly guessed tabs as the separator, but if you manually change to commas then it does separate the columns but the elements in the rows are left with their quotes on. You can fix this, too, by manually checking the "Use character to enclose cells..." option but I should think this should be guessed as it clearly did for the column headers line.

@thadguidry
Copy link
Member

Agreed, the quote guessing doesn't seem to work as good anymore compared to 3.5 (I think? or sometime around there). But that might all be the deeper issue that @tfmorris is trying to fix here generally with the encoding guessing (or not guessing or applying correctly).

Here's my 3 test files:
TEST_WINDOWS-UTF-BOM.zip

@thadguidry thadguidry reopened this May 15, 2024
@jquartel
Copy link

Incidentally, I notice that Excel exports of CSV files are no longer including the BOM, so it seems Microsoft might be coming to their senses.

@thadguidry
Copy link
Member

thadguidry commented May 15, 2024

@tfmorris I double checked those test files and I think I see the issue now. It's more of a case of "mixed column separators" rather than the encoding guessing? or it's a whole mixed bag of it all?

, U+002C : COMMA {decimal separator}
U+FF0C : FULLWIDTH COMMA

Egad ! So the main issue I have is that of sometimes Chinese input files that sometimes have Full Width Commas!
And the other issue is that of my keyboard flipping back and forth between Chinese keyboard layout (Microsoft Pinyin) (that I need for workflows at times) which it seems by default will use Full Width Comma if in Pinyin input and Comma if in English keyboard input. The Full Width Comma now makes sense to me where it likely avoids display issues and confusion with a Comma appearing too close to a Mandarian glyph and potentially making one type of glyph look like it's a completely different one. Mandarian calligraphy being made of lots of dots, small slashes, small lines, etc... where there is indeed several that look like they have a comma after them sorta. So the extra spacing provided with Full Width Comma's ... makes complete sense to me now.

Hmm, now the question is:

  1. Do we expand our support on parsing options instead of only 1 column separator pattern, going for up to 2 custom column separator patterns? 2 multibyte or only 2 single?
    2. Or we punt and tell users "Import using Line-based and than perform multiple column split operations." ?

OK, forget 2 as it is just a PITA of an option actually. Like pain in the ARSEnic bottle from hell. So option 3?

  1. We modify and enhance the "Split into several columns" operation as a multi-pass operation, since currently it sucks when there might be mixed separators, even if there are just 2 like Comma and Full Width Comma? We could add an OR() for the Separator input?
    image

Screenshot 2024-05-15 200115

@thadguidry
Copy link
Member

thadguidry commented May 15, 2024

Nevermind. I'm an idiot sometimes. Blanked out completely on the existing RegEx option... works beautifully.

image
image

Then I just used ALL -> Edit all columns -> Replace smart quotes with ASCII
ALL -> Transform -> value.replace('"','')
DONE.

@thadguidry
Copy link
Member

thadguidry commented May 15, 2024

So thus back to Question 1. I had... does it make sense to have some sort of better option that can

  • support on parsing options instead of only 1 column separator pattern, going for up to 2 custom column separator patterns? 2 multibyte or only 2 single?
  • Use character " or Smart Quotes to enclose cells containing column separators?

@tfmorris
Copy link
Member Author

@jquartel I can reproduce your results with a build from the 3.8.0 tag, but using the head of the 3.8 branch (which will become 3.8.1) things work correctly. The fix for this issue was bundled with the fix for #6595 and backported to the 3.8 stream. The 3.8 milestone is a little ambiguous because it represents 3.8.0, 3.8.1, 3.8.2, etc.

@thadguidry I'm going to close this again since the fix has been merged (and backported). If you can reproduce your issues with the head of master (or the head of 3.8 branch), please create a new issue with a concise description of the problem(s) you are seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import About importers in general - add a label for the data format if available Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
3 participants