Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for EBCDIC detection #122

Open
nealmcb opened this issue Apr 24, 2017 · 6 comments
Open

Support for EBCDIC detection #122

nealmcb opened this issue Apr 24, 2017 · 6 comments

Comments

@nealmcb
Copy link

nealmcb commented Apr 24, 2017

When recovering old scientific datasets, which were often stored in fixed-length blocks of characters, it is important to know whether the data is in EBCDIC (e.g. cp500) or ASCII.
Support in chardet for EBCDIC would be great.

@sigmavirus24
Copy link
Member

Do you have access to data that we could use to train a detection algorithm for EBCDIC?

@nealmcb
Copy link
Author

nealmcb commented Apr 25, 2017

Thanks! Maybe. I'm working with thousands of old NASA data files from lots of different experiments at http://spdf.gsfc.nasa.gov/pub/data/
I haven't looked for a broader set.
What do you need for training?

@dan-blanchard
Copy link
Member

dan-blanchard commented Apr 25, 2017

Since they're just single-byte character encodings from the looks of it, we could just add them to the list of encodings we train models for from Wikipedia data in #99. We would only support the ones that are built into Python though: 037, 500, and 1140. Having the NASA data available for test data would be really helpful though.

@nealmcb
Copy link
Author

nealmcb commented Apr 25, 2017

Cool! Note that the NASA data has no newlines and appears to be one really long string, until you figure out the "physical block size" that was used. Some text-processing programs expect them and don't like "really long lines", but I don't know if that's an issue with your training pipeline.
I still need to sort out the EBCDIC files from the others, since there is also a mix of ASCII and binary. What is your timing?
Distinguishing the different code pages could be a challenge.

@dan-blanchard
Copy link
Member

What is your timing?

Well, that's not something I like to promise these days, as I only have about 2 days per month to work on chardet. It does seem like it should be able to be added to the big PR I've been working on for over a year though (#99). I've recently started splitting off parts from that to make it not such a monster and get us some forward progress, so I'm optimistic it will be in the next few months.

@jkyeung
Copy link

jkyeung commented Mar 26, 2019

I'm definitely +1 on EBCDIC support. It's not just for "old data". There are still IBM mainframe and minicomputers alive and kicking, generating brand-new data every day in EBCDIC. Indeed, the midrange IBM i is undergoing something of an open source renaissance (with a lot of Unix-style software), and so it is quite common for the same machine to generate ASCII-based and EBCDIC-based data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants