Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining and storing data #48

Open
dan-blanchard opened this issue Jan 6, 2015 · 8 comments
Open

Retraining and storing data #48

dan-blanchard opened this issue Jan 6, 2015 · 8 comments
Milestone

Comments

@dan-blanchard
Copy link
Member

dan-blanchard commented Jan 6, 2015

Currently we have a ton of encoding-specific data stored as constants all over the place. This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.

The problems I see with the current approach are:

  1. Storing large amounts of data in code makes it much more difficult to read and separate out which files are just data from those that contain actual encoding/prober-specific code.
  2. Retraining the models we have (which are currently based on data from the late 90s) is difficult, because we would have to write a script that generates Python code. Yuck.
  3. It makes the barrier to entry for adding support for new encodings higher than it should be. We should be able to have a tool that takes a bunch of text of a given encoding and generates the tables we need and determines things like the typical "positive ratio" (which is really the ratio of the token frequency of the 512 most common character bigram types to the total number of bigram tokens in a "typical" corpus) automatically. The current layout of the code is very confusing to a new contributor (see point 1).
  4. Because retraining is difficult, chardet is going to get less accurate over time. Speaking as an NLP researcher, I can confidently say that the genre of a text plays a big role in how likely certain character sequences are, and as time goes on the typical web text we see looks less and less like it did when Mozilla collected their original data. Also, our accuracy for text that isn't from webpages is probably not that great.

So if we're agreement that the current approach is bad, how do we want to fix it?

I propose that we:

  1. Store the data in either JSON or YAML formats in the GitHub repository. This would potentially allow us to share our data with chardet ports written in other languages (if they wanted to support our format).
  2. As part of the setup.py install process, convert the files to pickled dictionaries.
  3. Modify the prober initializers to take a path to either a pickled dictionary or a JSON/YAML file and load up that data at run-time. Supporting both types of file would simplify development, since we could play around with models without having to constantly convert them to pickles.
  4. Modify chardet.detect to cache its UniversalDetector object so that we don't constantly create new prober objects and reload the pickles.

The only problem I see with this approach is that it will slow down import chardet, but loading pickles is usually pretty fast.

@sigmavirus24, what do you think?

@dan-blanchard
Copy link
Member Author

uchardet-enhanced actually has tools for retraining the C code, so we might not need to do this is we switch to just being a CFFI wrapper around that.

That said, uchardet-enhanced hasn't been touched in almost 3 years, so I'm a bit torn about relying on that. I'd also like to see the data files stored in a language-agnostic format, but that might come second to speed for most users.

@dan-blanchard
Copy link
Member Author

Well, so much for that thought. I'm not convinced their updated tables are actually correct. If I swap cChardet in for chardet and run all of our unit tests, there are actually 53 test failures (vs our 1), so it looks like we're much slower but more accurate at this point.

@dan-blanchard
Copy link
Member Author

I've created a new branch that makes chardet work like cChardet/uchardet-enhanced, but in pure Python. It's called feature/uchardet-enhanced-upstream. It performs worse than our current version, so it probably wasn't worth the effort. Oh well.

@ghost
Copy link

ghost commented Feb 9, 2015

I have created 7 new language models for Central Europe countries. The romanian and hungarian language models can't distinguish between cp1250 and latin2 because all letters in romanian or hungarian alphabet have same place in both tables :-( . All new .py files with the modified sbcsgroupprober.py are in my repository and my fork.

@dan-blanchard
Copy link
Member Author

The language models are intentionally language-specific (and not encoding-specific), so it's actually the character-to-order maps that index into the language model tables. Anyway, I don't know much about what subset of CP1250 is used for Hungarian and Romanian, but according to the table at the top of the Wikipedia article, it looks like at a minimum the Euro symbol and the quotation marks are in different places. Therefore, we should be able to differentiate between the two based on at least those characters.

If you have updated versions of the language models that work better in your fork, please create a pull request.

@ghost
Copy link

ghost commented Feb 10, 2015

From wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more.
It means that ISO-8859-2 doesn't contains "special" quotation marks (U+201E,U+201C,U+201D) or Euro symbol, etc. Without these "more characters" in an analyzed text you can't distinguish between Latin2 and Cp1250. Simply said, the ISO-8859-2 is subset of the CP1250 and some chars are rearranged (big thanks M$). It means that every text in Romanian or Hungarian language (it isn't true for other Central Europen languages) which contains only iso-8859-2 characters can be treated as the Cp1250.
In my opinion, for such text is simple better to consider any iso-8859-2 as the cp1250, or create a test (in the sbcharsetprober.py) if the detected text contains any char from the "more characters", because the language models are based on bigrams (twochars sequences) testing which contains only the 64 most frequent letters (=no symbols).

@ghost
Copy link

ghost commented Feb 11, 2015

I've created small script 'create_language_model.py' for building new python's language model files.
It needs 'CharsetsTabs.txt' file for correct running. Please read comments in the header.

@dan-blanchard
Copy link
Member Author

#99 still uses Python language model files, but at least moves us in the right direction of allowing us to retrain at all.

@dan-blanchard dan-blanchard modified the milestones: 3.0, 4.0 Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant