Add detection for MacRoman encoding #5

rspeer · 2012-11-16T18:11:49Z

MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.

This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.

MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will output in MacRoman by default. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

puzzlet · 2012-12-02T10:30:49Z

I'm not an authority here, but could you give some live examples we can test?

@byroot

Fix BOM detection #4 Thanks @byroot

dan-blanchard · 2013-12-17T19:31:00Z

@rspeer, if you can provide some example documents for testing, I'd gladly merge this.

adamn · 2016-05-17T13:47:38Z

Since this detector is for archaic technology, and it's given the lowest priority in the universal detector, it seems like this can just go in. Maybe with a warning (or a setting to disable the detector by default) if necessary?

Documentation and examples have been pending for almost 2 years so I don't see that ever happening.

sigmavirus24 · 2016-05-17T14:22:51Z

@adamn you're right. This hasn't changed in almost 4 years and the pull request doesn't merge cleanly. I'm going to close this unless someone revives it in a new pull request. (We've also had no requests (other than this PR) for this encoding.)

rspeer · 2016-05-24T16:41:53Z

I don't think that's what adamn meant, sigmavirus24, but oh well.

MacRoman is the default encoding that Office for Mac uses for export, and is a very frequently used and frequently mis-detected encoding. People who need MacRoman detection don't know they need MacRoman detection, they just know "chardet doesn't work". Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it?

Sorry for not hand-holding this PR to completion, but I lost interest in fixing chardet.

dan-blanchard · 2016-05-24T16:44:32Z

Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it?

Yeah, I agree with @rspeer. chardet is essentially a tool for dealing with archaic encodings.

jbrockmendel · 2016-12-08T06:36:31Z

if you can provide some example documents for testing, I'd gladly merge this.

Found in the wild: http://eclipse.gsfc.nasa.gov/5MCSE/5MKSEcatalog.txt

The only non-ascii present is four occurrences of "\xa1". Decoded as Mac-Roman, these are "degree" symbols, as in latitude/longitude. chardet.detect (using cchardet 1.1.1) returns {'confidence': 0.8844350576400757, 'encoding': u'WINDOWS-1252'}.

dan-blanchard · 2017-04-20T13:59:41Z

@jbrockmendel Thanks for the example! I will likely just add MacRoman to the set of encodings supported by several Western languages in #99 and not use this PR's approach, but I'll keep it open for now until we decide.

alichur · 2017-07-20T12:56:18Z

I would love to see this fixed. Sadly any mac user dealing with CSV files in Excel will end up with MacRoman encoding when they save.

jbrockmendel · 2017-07-20T18:25:11Z

@alichur could we use this behavior to generate a thorough set of samples?

alichur · 2017-07-25T12:56:41Z

@jbrockmendel yes I believe so. Open any CSV file in excel (on a mac) and when you save it the file encoding will be mac Roman.

MrCsabaToth · 2018-12-07T17:46:47Z

A good use case can be a CSV file saved with Mac Excel. The typical examples are Mac Roman variants of the single apostrophes (left 0xD4, right 0xD5, bottom 0xE2), the double quotes (left 0xD2, right 0xD3, bottom 0xE3) and dash variants (0xD0, 0xD1). Those hit me all the!!! https://en.wikipedia.org/wiki/Mac_OS_Roman

MrCsabaToth · 2018-12-07T17:49:58Z

To detect these cases (0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xE2, 0xE3) though in my custom code I check for the surrounding characters if they are in the standard ASCII range to make sure I won't deal with some UTF-8 sequence.

YesThatAllen · 2022-04-14T21:57:18Z

Late to the game here.. we were all set to use chardet, even implemented it then realized that mac_roman isn't supported.

As of April 2022:

Asking finger for info via Popen will give UTF-8 data in almost all cases, and return mac_roman when double byte characters are in the response, sigh.

Could you give some live examples we can test?

macOS's ioreg command will vary in its output.

/usr/sbin/ioreg -l will respond using mac_roman encoding if there's an apostrophe in the name of a bluetooth mouse/trackpad: "Product"="Allen’s Trackpad"

On the same computer, /usr/sbin/ioreg -rd1 -c IOPlatformExpertDevice will not include the pointing device, and so ioreg will respond with utf-8

dan-blanchard · 2022-06-28T15:04:50Z

@YesThatAllen thanks for the tip about ioreg. I had no idea we finally had a way to generate MacRoman data. I've added a test for this, and revived this very old PR and manually merged it via c292b52.

dan-blanchard pushed a commit that referenced this pull request Dec 15, 2013

Merge pull request #5 from byroot/fix-bom-detection

0e70614

Fix BOM detection #4 Thanks @byroot

dan-blanchard closed this Dec 2, 2014

dan-blanchard reopened this Dec 2, 2014

sigmavirus24 closed this May 17, 2016

dan-blanchard reopened this May 24, 2016

rstm-sf mentioned this pull request Nov 9, 2019

Add detection for encoding 'x-mac-romanian' CharsetDetector/UTF-unknown#83

Open

dan-blanchard closed this Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detection for MacRoman encoding #5

Add detection for MacRoman encoding #5

rspeer commented Nov 16, 2012

puzzlet commented Dec 2, 2012

dan-blanchard commented Dec 17, 2013

adamn commented May 17, 2016

sigmavirus24 commented May 17, 2016

rspeer commented May 24, 2016 •

edited

dan-blanchard commented May 24, 2016

jbrockmendel commented Dec 8, 2016

dan-blanchard commented Apr 20, 2017

alichur commented Jul 20, 2017

jbrockmendel commented Jul 20, 2017

alichur commented Jul 25, 2017

MrCsabaToth commented Dec 7, 2018

MrCsabaToth commented Dec 7, 2018

YesThatAllen commented Apr 14, 2022

dan-blanchard commented Jun 28, 2022

Add detection for MacRoman encoding #5

Add detection for MacRoman encoding #5

Conversation

rspeer commented Nov 16, 2012

puzzlet commented Dec 2, 2012

dan-blanchard commented Dec 17, 2013

adamn commented May 17, 2016

sigmavirus24 commented May 17, 2016

rspeer commented May 24, 2016 • edited

dan-blanchard commented May 24, 2016

jbrockmendel commented Dec 8, 2016

dan-blanchard commented Apr 20, 2017

alichur commented Jul 20, 2017

jbrockmendel commented Jul 20, 2017

alichur commented Jul 25, 2017

MrCsabaToth commented Dec 7, 2018

MrCsabaToth commented Dec 7, 2018

YesThatAllen commented Apr 14, 2022

dan-blanchard commented Jun 28, 2022

rspeer commented May 24, 2016 •

edited