Support for EBCDIC detection #122

nealmcb · 2017-04-24T23:45:07Z

When recovering old scientific datasets, which were often stored in fixed-length blocks of characters, it is important to know whether the data is in EBCDIC (e.g. cp500) or ASCII.
Support in chardet for EBCDIC would be great.

sigmavirus24 · 2017-04-25T11:21:14Z

Do you have access to data that we could use to train a detection algorithm for EBCDIC?

nealmcb · 2017-04-25T12:21:02Z

Thanks! Maybe. I'm working with thousands of old NASA data files from lots of different experiments at http://spdf.gsfc.nasa.gov/pub/data/
I haven't looked for a broader set.
What do you need for training?

dan-blanchard · 2017-04-25T16:01:31Z

Since they're just single-byte character encodings from the looks of it, we could just add them to the list of encodings we train models for from Wikipedia data in #99. We would only support the ones that are built into Python though: 037, 500, and 1140. Having the NASA data available for test data would be really helpful though.

nealmcb · 2017-04-25T16:31:15Z

Cool! Note that the NASA data has no newlines and appears to be one really long string, until you figure out the "physical block size" that was used. Some text-processing programs expect them and don't like "really long lines", but I don't know if that's an issue with your training pipeline.
I still need to sort out the EBCDIC files from the others, since there is also a mix of ASCII and binary. What is your timing?
Distinguishing the different code pages could be a challenge.

dan-blanchard · 2017-04-25T18:10:49Z

What is your timing?

Well, that's not something I like to promise these days, as I only have about 2 days per month to work on chardet. It does seem like it should be able to be added to the big PR I've been working on for over a year though (#99). I've recently started splitting off parts from that to make it not such a monster and get us some forward progress, so I'm optimistic it will be in the next few months.

jkyeung · 2019-03-26T17:32:48Z

I'm definitely +1 on EBCDIC support. It's not just for "old data". There are still IBM mainframe and minicomputers alive and kicking, generating brand-new data every day in EBCDIC. Indeed, the midrange IBM i is undergoing something of an open source renaissance (with a lot of Unix-style software), and so it is quite common for the same machine to generate ASCII-based and EBCDIC-based data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for EBCDIC detection #122

Support for EBCDIC detection #122

nealmcb commented Apr 24, 2017

sigmavirus24 commented Apr 25, 2017

nealmcb commented Apr 25, 2017

dan-blanchard commented Apr 25, 2017 •

edited

nealmcb commented Apr 25, 2017 •

edited

dan-blanchard commented Apr 25, 2017

jkyeung commented Mar 26, 2019

Support for EBCDIC detection #122

Support for EBCDIC detection #122

Comments

nealmcb commented Apr 24, 2017

sigmavirus24 commented Apr 25, 2017

nealmcb commented Apr 25, 2017

dan-blanchard commented Apr 25, 2017 • edited

nealmcb commented Apr 25, 2017 • edited

dan-blanchard commented Apr 25, 2017

jkyeung commented Mar 26, 2019

dan-blanchard commented Apr 25, 2017 •

edited

nealmcb commented Apr 25, 2017 •

edited