New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retraining and storing data #48
Comments
uchardet-enhanced actually has tools for retraining the C code, so we might not need to do this is we switch to just being a CFFI wrapper around that. That said, uchardet-enhanced hasn't been touched in almost 3 years, so I'm a bit torn about relying on that. I'd also like to see the data files stored in a language-agnostic format, but that might come second to speed for most users. |
Well, so much for that thought. I'm not convinced their updated tables are actually correct. If I swap cChardet in for chardet and run all of our unit tests, there are actually 53 test failures (vs our 1), so it looks like we're much slower but more accurate at this point. |
I've created a new branch that makes chardet work like cChardet/uchardet-enhanced, but in pure Python. It's called feature/uchardet-enhanced-upstream. It performs worse than our current version, so it probably wasn't worth the effort. Oh well. |
I have created 7 new language models for Central Europe countries. The romanian and hungarian language models can't distinguish between cp1250 and latin2 because all letters in romanian or hungarian alphabet have same place in both tables :-( . All new .py files with the modified sbcsgroupprober.py are in my repository and my fork. |
The language models are intentionally language-specific (and not encoding-specific), so it's actually the character-to-order maps that index into the language model tables. Anyway, I don't know much about what subset of CP1250 is used for Hungarian and Romanian, but according to the table at the top of the Wikipedia article, it looks like at a minimum the Euro symbol and the quotation marks are in different places. Therefore, we should be able to differentiate between the two based on at least those characters. If you have updated versions of the language models that work better in your fork, please create a pull request. |
From wikipedia: |
I've created small script 'create_language_model.py' for building new python's language model files. |
#99 still uses Python language model files, but at least moves us in the right direction of allowing us to retrain at all. |
Currently we have a ton of encoding-specific data stored as constants all over the place. This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.
The problems I see with the current approach are:
So if we're agreement that the current approach is bad, how do we want to fix it?
I propose that we:
setup.py install
process, convert the files to pickled dictionaries.chardet.detect
to cache itsUniversalDetector
object so that we don't constantly create new prober objects and reload the pickles.The only problem I see with this approach is that it will slow down
import chardet
, but loading pickles is usually pretty fast.@sigmavirus24, what do you think?
The text was updated successfully, but these errors were encountered: