Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Retrain SBCS Models and some refactoring #99

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
51 changes: 40 additions & 11 deletions NOTES.rst
Expand Up @@ -21,7 +21,27 @@ contains the same SingleByteCharSetProbers.
SingleByteCharSetProber
-----------------------
A CharSetProber that is used for detecting single-byte encodings by using
a "precedence matrix" (i.e., a character bigram model).
a "precedence matrix" (i.e., a character bigram model). The weird thing about
this precedence matrix is that it's not actually based on all sequences of
characters, but rather just the 64 most frequent letters, numbers, and control
characters. (We should probably have control characters not count here like in
https://github.com/BYVoid/uchardet/commit/55b4f23971db61c9ed93be6630c79a50bda9b.)
To look things up in the language model, we actually look up by "frequency order"
(as in CharDistributionAnalysis), so that we can use one language model for multiple encodings.

Furthermore, when calculating the confidence what we actually are counting is
the number of sequences we've seen of different likelihoods:

- positive = in the 512 most frequent sequences
- likely = in the 1024 most frequent sequences
- unlikely = occurred at least 3 times in training data
- negative = did not occur at least 3 times in training data

We should probably allow tweaking these thresholds when training models, as 64
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.

is completely arbitrary. Also, there's no real reason we're storing things by
"frequency order" here, since we could just store things by Unicode code points.
This is leftover from the original C++ code.


MBCSGroupProber
---------------
Expand All @@ -42,16 +62,17 @@ byte sequences or sequences that only occur for that particular encoding.

CharDistributionAnalysis
------------------------
Used for character unigram distribution encoding detection. Takes a mapping
from characters to a "frequency order" (i.e., what frequency rank that byte has
in the given encoding) and a "typical distribution ratio", which is the number
of occurrences of the 512 most frequently used characters divided by the number
of occurrences of the rest of the characters for a typical document.
The "characters" in this case are 2-byte sequences and they are first converted
to an "order" (name comes from ord() function, I believe). This "order" is used
to index into the frequency order table to determine the frequency rank of that
byte sequence. The reason this extra step is necessary is that the frequency
rank table is language-specific (and not encoding-specific).
Used for 2-byte character unigram distribution encoding detection. Takes a
mapping from characters to a "frequency order" (i.e., what frequency rank that
2-byte sequence has in the given encoding) and a "typical distribution ratio",
which is the number of occurrences of the 512 most frequently used characters
divided by the number of occurrences of the rest of the characters for a typical
document. The "characters" in this case are 2-byte sequences and they are first
converted to an "order" (name comes from ord() function, I believe). This
"order" is used to index into the frequency order table to determine the
frequency rank of that byte sequence. The reason this extra step is necessary
is that the frequency rank table is language-specific (and not encoding-
specific).


What's where
Expand All @@ -64,11 +85,19 @@ Bigram files
- ``hebrewprober.py``
- ``jpcntxprober.py``
- ``langbulgarianmodel.py``
- ``langcroatianmodel.py``
- ``langcyrillicmodel.py``
- ``langczechmodel.py``
- ``langgermanmodel.py``
- ``langgreekmodel.py``
- ``langhebrewmodel.py``
- ``langhungarianmodel.py``
- ``langpolishmodel.py``
- ``langromanianmodel.py``
- ``langslovakmodel.py``
- ``langslovenemodel.py``
- ``langthaimodel.py``
- ``langturkishmodel.py``
- ``latin1prober.py``
- ``sbcharsetprober.py``
- ``sbcsgroupprober.py``
Expand Down
59 changes: 45 additions & 14 deletions README.rst
Expand Up @@ -17,20 +17,51 @@ Chardet: The Universal Character Encoding Detector


Detects
- ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
- EUC-KR, ISO-2022-KR, Johab (Korean)
- KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
- ISO-8859-5, windows-1251 (Bulgarian)
- ISO-8859-1, windows-1252 (Western European languages)
- ISO-8859-7, windows-1253 (Greek)
- ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
- TIS-620 (Thai)

.. note::
Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily
disabled until we can retrain the models.

- ``ASCII``
- ``Big5`` (Traditional Chinese)
- ``CP720`` (Arabic)
- ``CP855``/``IBM855`` (Bulgarian, Macedonian, Russian, Serbian)
- ``CP864`` (Arabic)
- ``CP866``/``IBM866`` (Belarusian, Russian)
- ``CP874`` (Thai)
- ``CP932`` (Japanese)
- ``EUC-JP`` (Japanese)
- ``EUC-KR`` (Korean)
- ``EUC-TW`` (Traditional Chinese)
- ``GB2312`` (Simplified Chinese)
- ``HZ-GB-2312`` (Simplified Chinese)
- ``ISO-2022-CN`` (Traditional and Simplified Chinese)
- ``ISO-2022-JP`` (Japanese)
- ``ISO-2022-KR`` (Korean)
- ``ISO-8859-1`` (Dutch, English, Finnish, French, German, Italian, Portuguese, Spanish)
- ``ISO-8859-2`` (Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene)
- ``ISO-8859-3`` (Esperanto)
- ``ISO-8859-4`` (Estonian, Latvian, Lithuanian)
- ``ISO-8859-5`` (Belarusian, Bulgarian, Macedonian, Russian, Serbian)
- ``ISO-8859-6`` (Arabic)
- ``ISO-8859-7`` (Greek)
- ``ISO-8859-8`` (Visual and Logical Hebrew)
- ``ISO-8859-9`` (Turkish)
- ``ISO-8859-11`` (Thai)
- ``ISO-8859-13`` (Estonian, Latvian, Lithuanian)
- ``ISO-8859-15`` (Danish, Finnish, French, Italian, Portuguese, Spanish)
- ``Johab`` (Korean)
- ``MacCyrillic`` (Belarusian, Macedonian, Russian, Serbian)
- ``SHIFT_JIS`` (Japanese)
- ``TIS-620`` (Thai)
- ``UTF-8``
- ``UTF-16`` (2 variants)
- ``UTF-32`` (4 variants)
- ``Windows-1250`` (Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene)
- ``Windows-1251`` (Belarusian, Bulgarian, Macedonian, Russian, Serbian)
- ``Windows-1252`` (Dutch, English, Finnish, French, German, Italian, Portuguese, Spanish)
- ``Windows-1253`` (Greek)
- ``Windows-1254`` (Turkish)
- ``Windows-1255`` (Visual and Logical Hebrew)
- ``Windows-1256`` (Arabic)
- ``Windows-1257`` (Estonian, Latvian, Lithuanian)


Requires Python 3.6+.

Expand Down
2 changes: 1 addition & 1 deletion chardet/hebrewprober.py
Expand Up @@ -150,7 +150,7 @@ class HebrewProber(CharSetProber):
MIN_MODEL_DISTANCE = 0.01

VISUAL_HEBREW_NAME = "ISO-8859-8"
LOGICAL_HEBREW_NAME = "windows-1255"
LOGICAL_HEBREW_NAME = "WINDOWS-1255"

def __init__(self):
super().__init__()
Expand Down