[WIP] Retrain SBCS Models and some refactoring #99

dan-blanchard · 2017-04-10T15:48:51Z

This branch is not ready to go yet because I still have several test failures, but I opened this WIP PR so that people can see what I'm working on (instead of me reiterating it in comments on issues).

The main changes are:

Cleans up abandoned PR New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52
Adds SBCS language model training script that can train from text files or wikipedia data
Adds support for several languages we were misisng (will enumerate them all when the WIP tag is removed from this)
Makes test.py test all languages and encodings that we have data for, since we now have models and for them.
Retrains all SBCS models, and even adds support for an English language model that we might be able to use to get rid of the latin-1 specific prober (more testing is needed here).
Fix a bug in the XML tag filter where parts of the XML tags themselves would be retained.
Adds language to UniversalDetector output.
Eliminates wrap_ord usage, which provides a nice speedup.
All SBCS models are now stored as dicts of dicts, because that is way faster than storing them as giant lists. The model files are much longer (and a bit harder to read), but no one really needs to look through them manually except when you're retraining them anyway.
Adds a languages metadata module that contains the information necessary for training all of the SBCS models (language name, supported encodings, alphabet, does it use ASCII, etc.).

I am well aware that this monstrosity is very hard to review given its size, so I may try to pull some parts out of it into separate PRs as possible. For example, the change that recapitalizes all the enum attributes (since they're class attributes and we're not using the Python-3-style enums because of the extra dependency that would require us to add) could certainly be pulled out.

dan-blanchard · 2017-04-10T17:30:33Z

NOTES.rst

+-  unlikely = occurred at least 3 times in training data
+-  negative = did not occur at least 3 times in training data
+
+We should probably allow tweaking these thresholds when training models, as 64


These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.

dan-blanchard · 2020-12-09T14:09:38Z

State update on this PR: I got a bit discouraged by my initial work on this PR not panning out, because it turns out that the retrained models cause nearly all the unit tests to fail (meaning we fail to detect most encodings). In picking this up again about a month ago, I figured out that there were some bugs in the training code that were including some bad characters in the training data. I retrained them again with that fixed and... all the tests still failed. Obviously, there's something I'm missing here, but I haven't been able to figure it out quite yet.

So, if anyone wants to help with this PR, looking into the test failures and proposing hypotheses for what's wrong with the new models would go along way. The fact that I only speak English has also hindered some of my progress here, as it's hard for me to look at a language model for a foreign language and immediately recognize problems. For example, if the English model said that "qm" was a highly likely character bigram, I would know that that was wrong, but I don't have that same innate knowledge of phonotactic patterns for other languages.

…ge_model This is in case we encouter some really crazy article with millions of links, but it's also nice for debugging.

…ction issues

…ssing them

…pedia text so we do not have to download it a bunch of times

dan-blanchard changed the title ~~-Feature/retrain sbcs models~~ [WIP] Retrain SBCS Models and some refactoring Apr 10, 2017

dan-blanchard added the enhancement label Apr 10, 2017

dan-blanchard mentioned this pull request Apr 10, 2017

New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52

Closed

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from e87d31d to 8e62a21 Compare April 10, 2017 15:58

This was referenced Apr 10, 2017

Fix test.py import of hypothesis.settings #97

Merged

treat spaces as single-char bytes instead of strings for python 2/3 #92

Closed

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from 8e62a21 to c4a53fb Compare April 10, 2017 16:10

dan-blanchard mentioned this pull request Apr 10, 2017

Report back Windows encodings only when we have evidence #100

Merged

dan-blanchard force-pushed the feature/retrain_sbcs_models branch 2 times, most recently from e5f1b41 to e1c3712 Compare April 10, 2017 17:28

dan-blanchard commented Apr 10, 2017

View reviewed changes

dan-blanchard force-pushed the feature/retrain_sbcs_models branch 9 times, most recently from 1e048ab to 8358159 Compare April 10, 2017 21:04

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from f5b31aa to 47e7f3b Compare April 11, 2017 20:49

This was referenced Apr 20, 2017

Add detection for MacRoman encoding #5

Closed

Convert single-byte charset probers to use nested dicts for language models #121

Merged

Support for EBCDIC detection #122

Open

dan-blanchard force-pushed the feature/retrain_sbcs_models branch 4 times, most recently from 3c1569a to 7c9b4b5 Compare December 12, 2020 02:14

dan-blanchard mentioned this pull request Dec 12, 2020

Always remove XML tags when detecting single-byte charset encodings that use ASCII letters #208

Merged

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from 7c9b4b5 to dca2072 Compare December 12, 2020 02:52

dan-blanchard added 15 commits December 11, 2020 22:55

Add first working version of SBCS model training code

5d91d93

Greatly improved language model training code. Can now use Wikipedia

c530b58

Add SingleByteCharSetModel for use by new model training code

b72e0d3

Fix a bunch of minor issues with create_language_model

36cdf8c

Don't crawl more than 20k wiki articles per language in create_langua…

51c13a2

…ge_model This is in case we encouter some really crazy article with millions of links, but it's also nice for debugging.

Retrain all single byte charset models and fix some training and dete…

d0f72fc

…ction issues

Remove codecs from MISSING_ENCODINGS in test.py now that we're not mi…

ef79d4c

…ssing them

Fix potential crash in create_language_model.py

382ad9a

Add helour's Latin-1 prober from abandoned PR, possibly temporarily

41732cd

Temporarily comment out Czech model because it seems wrong.

b0de2d6

Fix a bad merge conflict resolution

67bc4bb

Make create_language_model a bit more reliable, and make it dump wiki…

7a3636c

…pedia text so we do not have to download it a bunch of times

Try retraining again after some bugfixes to training script

d45f781

Count symbols and digits for total characters in SBCS prober

4503b74

Black formatting

eac7414

dan-blanchard force-pushed the feature/retrain_sbcs_models branch from 5a1821e to eac7414 Compare December 12, 2020 03:55

This was referenced Mar 8, 2021

Bump chardet from 3.0.4 to 4.0.0 thermondo/stanley#816

Closed

build(deps): bump chardet from 3.0.4 to 4.0.0 negillett/exodus-gw#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Retrain SBCS Models and some refactoring #99

[WIP] Retrain SBCS Models and some refactoring #99

dan-blanchard commented Apr 10, 2017 •

edited

dan-blanchard Apr 10, 2017

dan-blanchard commented Dec 9, 2020

[WIP] Retrain SBCS Models and some refactoring #99

Are you sure you want to change the base?

[WIP] Retrain SBCS Models and some refactoring #99

Conversation

dan-blanchard commented Apr 10, 2017 • edited

dan-blanchard Apr 10, 2017

Choose a reason for hiding this comment

dan-blanchard commented Dec 9, 2020

dan-blanchard commented Apr 10, 2017 •

edited