New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add detection for MacRoman encoding #5
Conversation
MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will output in MacRoman by default. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.
I'm not an authority here, but could you give some live examples we can test? |
@rspeer, if you can provide some example documents for testing, I'd gladly merge this. |
Since this detector is for archaic technology, and it's given the lowest priority in the universal detector, it seems like this can just go in. Maybe with a warning (or a setting to disable the detector by default) if necessary? Documentation and examples have been pending for almost 2 years so I don't see that ever happening. |
@adamn you're right. This hasn't changed in almost 4 years and the pull request doesn't merge cleanly. I'm going to close this unless someone revives it in a new pull request. (We've also had no requests (other than this PR) for this encoding.) |
I don't think that's what adamn meant, sigmavirus24, but oh well. MacRoman is the default encoding that Office for Mac uses for export, and is a very frequently used and frequently mis-detected encoding. People who need MacRoman detection don't know they need MacRoman detection, they just know "chardet doesn't work". Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it? Sorry for not hand-holding this PR to completion, but I lost interest in fixing chardet. |
Yeah, I agree with @rspeer. chardet is essentially a tool for dealing with archaic encodings. |
Found in the wild: http://eclipse.gsfc.nasa.gov/5MCSE/5MKSEcatalog.txt The only non-ascii present is four occurrences of "\xa1". Decoded as Mac-Roman, these are "degree" symbols, as in latitude/longitude. chardet.detect (using cchardet 1.1.1) returns {'confidence': 0.8844350576400757, 'encoding': u'WINDOWS-1252'}. |
@jbrockmendel Thanks for the example! I will likely just add MacRoman to the set of encodings supported by several Western languages in #99 and not use this PR's approach, but I'll keep it open for now until we decide. |
I would love to see this fixed. Sadly any mac user dealing with CSV files in Excel will end up with MacRoman encoding when they save. |
@alichur could we use this behavior to generate a thorough set of samples? |
@jbrockmendel yes I believe so. Open any CSV file in excel (on a mac) and when you save it the file encoding will be mac Roman. |
A good use case can be a CSV file saved with Mac Excel. The typical examples are Mac Roman variants of the single apostrophes (left 0xD4, right 0xD5, bottom 0xE2), the double quotes (left 0xD2, right 0xD3, bottom 0xE3) and dash variants (0xD0, 0xD1). Those hit me all the!!! https://en.wikipedia.org/wiki/Mac_OS_Roman |
To detect these cases (0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xE2, 0xE3) though in my custom code I check for the surrounding characters if they are in the standard ASCII range to make sure I won't deal with some UTF-8 sequence. |
Late to the game here.. we were all set to use chardet, even implemented it then realized that mac_roman isn't supported. As of April 2022: Asking
macOS's
On the same computer, |
@YesThatAllen thanks for the tip about |
MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.
This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.
I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.