New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV/TSV separators not guessed correctly for files with Byte Order Marks #6527
Comments
I'm not certain if this is the same issue - I'm finding that my CSVs with BOMs are not raising exceptions but the records are not being separated (i.e. each line is considered a single element). Here is a simple example |
Agreed, the quote guessing doesn't seem to work as good anymore compared to 3.5 (I think? or sometime around there). But that might all be the deeper issue that @tfmorris is trying to fix here generally with the encoding guessing (or not guessing or applying correctly). Here's my 3 test files: |
Incidentally, I notice that Excel exports of CSV files are no longer including the BOM, so it seems Microsoft might be coming to their senses. |
@tfmorris I double checked those test files and I think I see the issue now. It's more of a case of "mixed column separators" rather than the encoding guessing? or it's a whole mixed bag of it all?
Egad ! So the main issue I have is that of sometimes Chinese input files that sometimes have Full Width Commas! Hmm, now the question is:
OK, forget 2 as it is just a PITA of an option actually. Like pain in the ARSEnic bottle from hell. So option 3? |
So thus back to Question 1. I had... does it make sense to have some sort of better option that can
|
@jquartel I can reproduce your results with a build from the 3.8.0 tag, but using the head of the 3.8 branch (which will become 3.8.1) things work correctly. The fix for this issue was bundled with the fix for #6595 and backported to the 3.8 stream. The 3.8 milestone is a little ambiguous because it represents 3.8.0, 3.8.1, 3.8.2, etc. @thadguidry I'm going to close this again since the fix has been merged (and backported). If you can reproduce your issues with the head of |
It looks like the fix for #1241 is incomplete in that not all places which need to handle the new pseudo encoding which was introduced to handle UTF-8 with Byte Order Marks (BOM) were updated.
To Reproduce
Steps to reproduce the behavior:
Current Results
An exception is logged on the console for an unsupported character encoding and the separator guessing process aborts.
By inspection, the fixed width importer is also susceptible to the same problem.
Expected Behavior
The separators are guessed correctly for CSV/TSV files
The text was updated successfully, but these errors were encountered: