chardet · dan-blanchard · Aug 15, 2016 · Aug 29, 2016 · Sep 12, 2016 · Jan 23, 2017
diff --git a/NOTES.rst b/NOTES.rst
@@ -21,7 +21,27 @@ contains the same SingleByteCharSetProbers.
 SingleByteCharSetProber
 -----------------------
 A CharSetProber that is used for detecting single-byte encodings by using
-a "precedence matrix" (i.e., a character bigram model).
+a "precedence matrix" (i.e., a character bigram model).  The weird thing about
+this precedence matrix is that it's not actually based on all sequences of
+characters, but rather just the 64 most frequent letters, numbers, and control
+characters.  (We should probably have control characters not count here like in
+https://github.com/BYVoid/uchardet/commit/55b4f23971db61c9ed93be6630c79a50bda9b.)
+To look things up in the language model, we actually look up by "frequency order"
+(as in CharDistributionAnalysis), so that we can use one language model for multiple encodings.
+
+Furthermore, when calculating the confidence what we actually are counting is
+the number of sequences we've seen of different likelihoods:
+
+-  positive = in the 512 most frequent sequences
+-  likely = in the 1024 most frequent sequences
+-  unlikely = occurred at least 3 times in training data
+-  negative = did not occur at least 3 times in training data
+
+We should probably allow tweaking these thresholds when training models, as 64
+is completely arbitrary.  Also, there's no real reason we're storing things by
+"frequency order" here, since we could just store things by Unicode code points.
+This is leftover from the original C++ code.
+
 
 MBCSGroupProber
 ---------------
@@ -42,16 +62,17 @@ byte sequences or sequences that only occur for that particular encoding.
 
 CharDistributionAnalysis
 ------------------------
-Used for character unigram distribution encoding detection.  Takes a mapping
-from characters to a "frequency order" (i.e., what frequency rank that byte has
-in the given encoding) and a "typical distribution ratio", which is the number
-of occurrences of the 512 most frequently used characters divided by the number
-of occurrences of the rest of the characters for a typical document.
-The "characters" in this case are 2-byte sequences and they are first converted
-to an "order" (name comes from ord() function, I believe). This "order" is used
-to index into the frequency order table to determine the frequency rank of that
-byte sequence.  The reason this extra step is necessary is that the frequency
-rank table is language-specific (and not encoding-specific).
+Used for 2-byte character unigram distribution encoding detection.  Takes a
+mapping from characters to a "frequency order" (i.e., what frequency rank that
+2-byte sequence has in the given encoding) and a "typical distribution ratio",
+which is the number of occurrences of the 512 most frequently used characters
+divided by the number of occurrences of the rest of the characters for a typical
+document. The "characters" in this case are 2-byte sequences and they are first
+converted to an "order" (name comes from ord() function, I believe). This
+"order" is used to index into the frequency order table to determine the
+frequency rank of that byte sequence.  The reason this extra step is necessary
+is that the frequency rank table is language-specific (and not encoding-
+specific).
 
 
 What's where
@@ -64,11 +85,19 @@ Bigram files
 - ``hebrewprober.py``
 - ``jpcntxprober.py``
 - ``langbulgarianmodel.py``
+- ``langcroatianmodel.py``
 - ``langcyrillicmodel.py``
+- ``langczechmodel.py``
+- ``langgermanmodel.py``
 - ``langgreekmodel.py``
 - ``langhebrewmodel.py``
 - ``langhungarianmodel.py``
+- ``langpolishmodel.py``
+- ``langromanianmodel.py``
+- ``langslovakmodel.py``
+- ``langslovenemodel.py``
 - ``langthaimodel.py``
+- ``langturkishmodel.py``
 - ``latin1prober.py``
 - ``sbcharsetprober.py``
 - ``sbcsgroupprober.py``

diff --git a/README.rst b/README.rst
@@ -17,20 +17,51 @@ Chardet: The Universal Character Encoding Detector
 
 
 Detects
- - ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- - Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- - EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
- - EUC-KR, ISO-2022-KR, Johab (Korean)
- - KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
- - ISO-8859-5, windows-1251 (Bulgarian)
- - ISO-8859-1, windows-1252 (Western European languages)
- - ISO-8859-7, windows-1253 (Greek)
- - ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
- - TIS-620 (Thai)
-
-.. note::
-   Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily
-   disabled until we can retrain the models.
+
+ - ``ASCII``
+ - ``Big5`` (Traditional Chinese)
+ - ``CP720`` (Arabic)
+ - ``CP855``/``IBM855`` (Bulgarian, Macedonian, Russian, Serbian)
+ - ``CP864`` (Arabic)
+ - ``CP866``/``IBM866`` (Belarusian, Russian)
+ - ``CP874`` (Thai)
+ - ``CP932`` (Japanese)
+ - ``EUC-JP`` (Japanese)
+ - ``EUC-KR`` (Korean)
+ - ``EUC-TW`` (Traditional Chinese)
+ - ``GB2312`` (Simplified Chinese)
+ - ``HZ-GB-2312`` (Simplified Chinese)
+ - ``ISO-2022-CN`` (Traditional and Simplified Chinese)
+ - ``ISO-2022-JP`` (Japanese)
+ - ``ISO-2022-KR`` (Korean)
+ - ``ISO-8859-1`` (Dutch, English, Finnish, French, German, Italian, Portuguese, Spanish)
+ - ``ISO-8859-2`` (Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene)
+ - ``ISO-8859-3`` (Esperanto)
+ - ``ISO-8859-4`` (Estonian, Latvian, Lithuanian)
+ - ``ISO-8859-5`` (Belarusian, Bulgarian, Macedonian, Russian, Serbian)
+ - ``ISO-8859-6`` (Arabic)
+ - ``ISO-8859-7`` (Greek)
+ - ``ISO-8859-8`` (Visual and Logical Hebrew)
+ - ``ISO-8859-9`` (Turkish)
+ - ``ISO-8859-11`` (Thai)
+ - ``ISO-8859-13`` (Estonian, Latvian, Lithuanian)
+ - ``ISO-8859-15`` (Danish, Finnish, French, Italian, Portuguese, Spanish)
+ - ``Johab`` (Korean)
+ - ``MacCyrillic`` (Belarusian, Macedonian, Russian, Serbian)
+ - ``SHIFT_JIS`` (Japanese)
+ - ``TIS-620`` (Thai)
+ - ``UTF-8``
+ - ``UTF-16`` (2 variants)
+ - ``UTF-32`` (4 variants)
+ - ``Windows-1250`` (Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene)
+ - ``Windows-1251`` (Belarusian, Bulgarian, Macedonian, Russian, Serbian)
+ - ``Windows-1252`` (Dutch, English, Finnish, French, German, Italian, Portuguese, Spanish)
+ - ``Windows-1253`` (Greek)
+ - ``Windows-1254`` (Turkish)
+ - ``Windows-1255`` (Visual and Logical Hebrew)
+ - ``Windows-1256`` (Arabic)
+ - ``Windows-1257`` (Estonian, Latvian, Lithuanian)
+
 
 Requires Python 3.6+.
 

diff --git a/chardet/hebrewprober.py b/chardet/hebrewprober.py
@@ -150,7 +150,7 @@ class HebrewProber(CharSetProber):
     MIN_MODEL_DISTANCE = 0.01
 
     VISUAL_HEBREW_NAME = "ISO-8859-8"
-    LOGICAL_HEBREW_NAME = "windows-1255"
+    LOGICAL_HEBREW_NAME = "WINDOWS-1255"
 
     def __init__(self):
         super().__init__()