Retraining and storing data #48

dan-blanchard · 2015-01-06T16:42:49Z

Currently we have a ton of encoding-specific data stored as constants all over the place. This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.

The problems I see with the current approach are:

Storing large amounts of data in code makes it much more difficult to read and separate out which files are just data from those that contain actual encoding/prober-specific code.
Retraining the models we have (which are currently based on data from the late 90s) is difficult, because we would have to write a script that generates Python code. Yuck.
It makes the barrier to entry for adding support for new encodings higher than it should be. We should be able to have a tool that takes a bunch of text of a given encoding and generates the tables we need and determines things like the typical "positive ratio" (which is really the ratio of the token frequency of the 512 most common character bigram types to the total number of bigram tokens in a "typical" corpus) automatically. The current layout of the code is very confusing to a new contributor (see point 1).
Because retraining is difficult, chardet is going to get less accurate over time. Speaking as an NLP researcher, I can confidently say that the genre of a text plays a big role in how likely certain character sequences are, and as time goes on the typical web text we see looks less and less like it did when Mozilla collected their original data. Also, our accuracy for text that isn't from webpages is probably not that great.

So if we're agreement that the current approach is bad, how do we want to fix it?

I propose that we:

Store the data in either JSON or YAML formats in the GitHub repository. This would potentially allow us to share our data with chardet ports written in other languages (if they wanted to support our format).
As part of the setup.py install process, convert the files to pickled dictionaries.
Modify the prober initializers to take a path to either a pickled dictionary or a JSON/YAML file and load up that data at run-time. Supporting both types of file would simplify development, since we could play around with models without having to constantly convert them to pickles.
Modify chardet.detect to cache its UniversalDetector object so that we don't constantly create new prober objects and reload the pickles.

The only problem I see with this approach is that it will slow down import chardet, but loading pickles is usually pretty fast.

@sigmavirus24, what do you think?

The text was updated successfully, but these errors were encountered:

dan-blanchard · 2015-01-09T16:55:47Z

uchardet-enhanced actually has tools for retraining the C code, so we might not need to do this is we switch to just being a CFFI wrapper around that.

That said, uchardet-enhanced hasn't been touched in almost 3 years, so I'm a bit torn about relying on that. I'd also like to see the data files stored in a language-agnostic format, but that might come second to speed for most users.

dan-blanchard · 2015-01-10T04:55:41Z

Well, so much for that thought. I'm not convinced their updated tables are actually correct. If I swap cChardet in for chardet and run all of our unit tests, there are actually 53 test failures (vs our 1), so it looks like we're much slower but more accurate at this point.

dan-blanchard · 2015-01-11T03:41:58Z

I've created a new branch that makes chardet work like cChardet/uchardet-enhanced, but in pure Python. It's called feature/uchardet-enhanced-upstream. It performs worse than our current version, so it probably wasn't worth the effort. Oh well.

ghost · 2015-02-09T10:57:25Z

I have created 7 new language models for Central Europe countries. The romanian and hungarian language models can't distinguish between cp1250 and latin2 because all letters in romanian or hungarian alphabet have same place in both tables :-( . All new .py files with the modified sbcsgroupprober.py are in my repository and my fork.

dan-blanchard · 2015-02-10T14:35:49Z

The language models are intentionally language-specific (and not encoding-specific), so it's actually the character-to-order maps that index into the language model tables. Anyway, I don't know much about what subset of CP1250 is used for Hungarian and Romanian, but according to the table at the top of the Wikipedia article, it looks like at a minimum the Euro symbol and the quotation marks are in different places. Therefore, we should be able to differentiate between the two based on at least those characters.

If you have updated versions of the language models that work better in your fork, please create a pull request.

ghost · 2015-02-10T19:47:38Z

From wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more.
It means that ISO-8859-2 doesn't contains "special" quotation marks (U+201E,U+201C,U+201D) or Euro symbol, etc. Without these "more characters" in an analyzed text you can't distinguish between Latin2 and Cp1250. Simply said, the ISO-8859-2 is subset of the CP1250 and some chars are rearranged (big thanks M$). It means that every text in Romanian or Hungarian language (it isn't true for other Central Europen languages) which contains only iso-8859-2 characters can be treated as the Cp1250.
In my opinion, for such text is simple better to consider any iso-8859-2 as the cp1250, or create a test (in the sbcharsetprober.py) if the detected text contains any char from the "more characters", because the language models are based on bigrams (twochars sequences) testing which contains only the 64 most frequent letters (=no symbols).

ghost · 2015-02-11T12:40:29Z

I've created small script 'create_language_model.py' for building new python's language model files.
It needs 'CharsetsTabs.txt' file for correct running. Please read comments in the header.

dan-blanchard · 2017-04-11T20:17:52Z

#99 still uses Python language model files, but at least moves us in the right direction of allowing us to retrain at all.

dan-blanchard added the enhancement label Jan 6, 2015

dan-blanchard added this to the 3.0 milestone Jan 6, 2015

dan-blanchard mentioned this issue Jan 6, 2015

Merge with cChardet? #19

Open

dan-blanchard mentioned this issue Feb 5, 2015

Old error in the SBCharSetProber.cpp (or .py) of the Universal Charset Detector #50

Closed

dan-blanchard modified the milestones: 3.0, 4.0 Apr 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retraining and storing data #48

Retraining and storing data #48

dan-blanchard commented Jan 6, 2015 •

edited

dan-blanchard commented Jan 9, 2015

dan-blanchard commented Jan 10, 2015

dan-blanchard commented Jan 11, 2015

ghost commented Feb 9, 2015

dan-blanchard commented Feb 10, 2015

ghost commented Feb 10, 2015

ghost commented Feb 11, 2015

dan-blanchard commented Apr 11, 2017

Retraining and storing data #48

Retraining and storing data #48

Comments

dan-blanchard commented Jan 6, 2015 • edited

dan-blanchard commented Jan 9, 2015

dan-blanchard commented Jan 10, 2015

dan-blanchard commented Jan 11, 2015

ghost commented Feb 9, 2015

dan-blanchard commented Feb 10, 2015

ghost commented Feb 10, 2015

ghost commented Feb 11, 2015

dan-blanchard commented Apr 11, 2017

dan-blanchard commented Jan 6, 2015 •

edited