Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detection for MacRoman encoding #5

Closed
wants to merge 1 commit into from
Closed

Add detection for MacRoman encoding #5

wants to merge 1 commit into from

Conversation

rspeer
Copy link

@rspeer rspeer commented Nov 16, 2012

MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.

This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.

MacRoman is not in particularly common use anymore, as it has been
deprecated by Mac OS for over a decade. However, there are programs such
as Microsoft Office for Mac that didn't get the memo, and will output in
MacRoman by default.

The MacRoman detector works similarly to the Latin-1 detector, but
starts at a lower probability.
@puzzlet
Copy link
Contributor

puzzlet commented Dec 2, 2012

I'm not an authority here, but could you give some live examples we can test?

dan-blanchard pushed a commit that referenced this pull request Dec 15, 2013
@dan-blanchard
Copy link
Member

@rspeer, if you can provide some example documents for testing, I'd gladly merge this.

@adamn
Copy link

adamn commented May 17, 2016

Since this detector is for archaic technology, and it's given the lowest priority in the universal detector, it seems like this can just go in. Maybe with a warning (or a setting to disable the detector by default) if necessary?

Documentation and examples have been pending for almost 2 years so I don't see that ever happening.

@sigmavirus24
Copy link
Member

@adamn you're right. This hasn't changed in almost 4 years and the pull request doesn't merge cleanly. I'm going to close this unless someone revives it in a new pull request. (We've also had no requests (other than this PR) for this encoding.)

@rspeer
Copy link
Author

rspeer commented May 24, 2016

I don't think that's what adamn meant, sigmavirus24, but oh well.

MacRoman is the default encoding that Office for Mac uses for export, and is a very frequently used and frequently mis-detected encoding. People who need MacRoman detection don't know they need MacRoman detection, they just know "chardet doesn't work". Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it?

Sorry for not hand-holding this PR to completion, but I lost interest in fixing chardet.

@dan-blanchard
Copy link
Member

Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it?

Yeah, I agree with @rspeer. chardet is essentially a tool for dealing with archaic encodings.

@dan-blanchard dan-blanchard reopened this May 24, 2016
@jbrockmendel
Copy link

if you can provide some example documents for testing, I'd gladly merge this.

Found in the wild: http://eclipse.gsfc.nasa.gov/5MCSE/5MKSEcatalog.txt

The only non-ascii present is four occurrences of "\xa1". Decoded as Mac-Roman, these are "degree" symbols, as in latitude/longitude. chardet.detect (using cchardet 1.1.1) returns {'confidence': 0.8844350576400757, 'encoding': u'WINDOWS-1252'}.

@dan-blanchard
Copy link
Member

@jbrockmendel Thanks for the example! I will likely just add MacRoman to the set of encodings supported by several Western languages in #99 and not use this PR's approach, but I'll keep it open for now until we decide.

@alichur
Copy link

alichur commented Jul 20, 2017

I would love to see this fixed. Sadly any mac user dealing with CSV files in Excel will end up with MacRoman encoding when they save.

@jbrockmendel
Copy link

@alichur could we use this behavior to generate a thorough set of samples?

@alichur
Copy link

alichur commented Jul 25, 2017

@jbrockmendel yes I believe so. Open any CSV file in excel (on a mac) and when you save it the file encoding will be mac Roman.

@MrCsabaToth
Copy link

A good use case can be a CSV file saved with Mac Excel. The typical examples are Mac Roman variants of the single apostrophes (left 0xD4, right 0xD5, bottom 0xE2), the double quotes (left 0xD2, right 0xD3, bottom 0xE3) and dash variants (0xD0, 0xD1). Those hit me all the!!! https://en.wikipedia.org/wiki/Mac_OS_Roman

@MrCsabaToth
Copy link

To detect these cases (0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xE2, 0xE3) though in my custom code I check for the surrounding characters if they are in the standard ASCII range to make sure I won't deal with some UTF-8 sequence.

@YesThatAllen
Copy link

Late to the game here.. we were all set to use chardet, even implemented it then realized that mac_roman isn't supported.

As of April 2022:

Asking finger for info via Popen will give UTF-8 data in almost all cases, and return mac_roman when double byte characters are in the response, sigh.

Could you give some live examples we can test?

macOS's ioreg command will vary in its output.

/usr/sbin/ioreg -l will respond using mac_roman encoding if there's an apostrophe in the name of a bluetooth mouse/trackpad: "Product"="Allen’s Trackpad"

On the same computer, /usr/sbin/ioreg -rd1 -c IOPlatformExpertDevice will not include the pointing device, and so ioreg will respond with utf-8

@dan-blanchard
Copy link
Member

@YesThatAllen thanks for the tip about ioreg. I had no idea we finally had a way to generate MacRoman data. I've added a test for this, and revived this very old PR and manually merged it via c292b52.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants