New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for EBCDIC detection #122
Comments
Do you have access to data that we could use to train a detection algorithm for EBCDIC? |
Thanks! Maybe. I'm working with thousands of old NASA data files from lots of different experiments at http://spdf.gsfc.nasa.gov/pub/data/ |
Since they're just single-byte character encodings from the looks of it, we could just add them to the list of encodings we train models for from Wikipedia data in #99. We would only support the ones that are built into Python though: 037, 500, and 1140. Having the NASA data available for test data would be really helpful though. |
Cool! Note that the NASA data has no newlines and appears to be one really long string, until you figure out the "physical block size" that was used. Some text-processing programs expect them and don't like "really long lines", but I don't know if that's an issue with your training pipeline. |
Well, that's not something I like to promise these days, as I only have about 2 days per month to work on chardet. It does seem like it should be able to be added to the big PR I've been working on for over a year though (#99). I've recently started splitting off parts from that to make it not such a monster and get us some forward progress, so I'm optimistic it will be in the next few months. |
I'm definitely +1 on EBCDIC support. It's not just for "old data". There are still IBM mainframe and minicomputers alive and kicking, generating brand-new data every day in EBCDIC. Indeed, the midrange IBM i is undergoing something of an open source renaissance (with a lot of Unix-style software), and so it is quite common for the same machine to generate ASCII-based and EBCDIC-based data. |
When recovering old scientific datasets, which were often stored in fixed-length blocks of characters, it is important to know whether the data is in EBCDIC (e.g. cp500) or ASCII.
Support in chardet for EBCDIC would be great.
The text was updated successfully, but these errors were encountered: