Release 1.0.0 (#11)

* Adjustement in frequencies.json about Chinese Remove latin based char in it * Added the possibility to list encoding aliases for a match Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855. * Added submatch in match list of submatch that produce the EXACT same output as a match * Changes in docs + comment unused code. * Add param in doc ProbeChaos giveup_threshold * Doc improvement in unicode.py * Add static method list_by_range in unicode.py Sort letters by unicode range in a dict * ProbeCoherence reliability improved Can now probe & sort by alphabet used or unicode range. * Added coherence_non_latin method in NormalizerMatch Verify if a non latin based language got verified by probe coherence * CLI is now more verbose * More tests, yay ! * bump 1.0.0 * readme upd8
jawah · Sep 17, 2019 · d3996ce · d3996ce
1 parent 232a574
commit d3996ce
Show file tree

Hide file tree

Showing 12 changed files with 312 additions and 74 deletions.
diff --git a/README.md b/README.md
@@ -18,13 +18,15 @@
   </a>
 </p>
 
-> Library that help you read text from unknown charset encoding.<br /> Project motivated by `chardet`, I'm trying to resolve the issue by taking another approach.
+> Library that help you read text from unknown charset encoding.<br /> Project motivated by `chardet`, 
+> I'm trying to resolve the issue by taking another approach.
+> All IANA character set names for which the Python core library provides codecs are supported.
 
 This project offer you a alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
 
 | Feature       | [Chardet](https://github.com/chardet/chardet)       | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
 | ------------- | :-------------: | :------------------: | :------------------: |
-| `Fast`         | ❌<br>          | ❌<br>             | ✅ <br>⚡ |
+| `Fast`         | ❌<br>          | ✅<br>             | ✅ <br>⚡ |
 | `Universal**`     | ❌            | ✅                 | ❌ |
 | `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
 | `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
@@ -91,6 +93,8 @@ except IOError as e:
 from charset_normalizer import detect
 ```
 
+Above code will behave the same as **chardet**.
+
 See wiki for advanced usages. *Todo, not yet available.*
 
 ## 😇 Why
@@ -119,9 +123,16 @@ In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
  I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to 
  improve or rewrite it.
 
- *Coherence :* For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that
+*Coherence :* For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that
  those intel are worth something here. So I use those records against decoded text to check if I can detect intelligent design.
-
+
+## ⚡ Known limitations
+
+  - Not intended to work on non (human) speakable language text content. eg. crypted text.
+  - When provided trust encoding in headers first. (XML, HTML, HTTP, etc..)
+  - Language detection is unreliable when text contain more than 1 language that are sharing identical letters.
+  - Not well tested with tiny content
+
 ## 👤 Contributing
 
 Contributions, issues and feature requests are very much welcome.<br />

diff --git a/charset_normalizer/assets/frequencies.json b/charset_normalizer/assets/frequencies.json
@@ -310,30 +310,19 @@
     "Chinese": [
         "\u7684",
         "\u5e74",
-        "a",
-        "e",
         "\u4e00",
         "\u5728",
         "\u662f",
         "\u4e2d",
-        "i",
-        "o",
-        "r",
-        "n",
-        "t",
         "\u4eba",
-        "s",
         "\u5927",
         "\u6709",
-        "l",
         "\u70ba",
         "\u548c",
         "\u4ee5",
-        "c",
         "\u65e5",
         "\u4e86",
-        "\u6708",
-        "m"
+        "\u6708"
     ],
     "Catalan": [
         "e",
@@ -920,8 +909,7 @@
         "\u05e6",
         "\u05df",
         "\u05d6",
-        "\u05da",
-        "e"
+        "\u05da"
     ],
     "Bulgarian": [
         "\u0430",
@@ -1312,7 +1300,6 @@
         "\u0446",
         "\u0436",
         "\u0444",
-        "a",
         "\u045a"
     ],
     "Serbocroatian": [
@@ -1367,9 +1354,7 @@
         "\u0b8e",
         "\u0b89",
         "\u0b92",
-        "\u0bb8",
-        "a",
-        "e"
+        "\u0bb8"
     ],
     "Classical Chinese": [
         "\u4e4b",
@@ -1386,7 +1371,6 @@
         "\u4e8c",
         "\u5341",
         "\u65bc",
-        "a",
         "\u66f0",
         "\u4e09",
         "\u4e0d",
@@ -1395,8 +1379,6 @@
         "\u5b50",
         "\u4e2d",
         "\u4e94",
-        "o",
-        "\u56db",
-        "r"
+        "\u56db"
     ]
 }
diff --git a/charset_normalizer/cli/normalizer.py b/charset_normalizer/cli/normalizer.py
@@ -42,6 +42,11 @@ def query_yes_no(question, default="yes"):
 
 
 def cli_detect(argv=None):
+    """
+    CLI assistant using ARGV and ArgumentParser
+    :param argv:
+    :return: 0 if everything is fine, anything else equal trouble
+    """
     parser = argparse.ArgumentParser(
         description="The Real First Universal Charset Detector. "
                     "Discover originating encoding used on text file. "
@@ -87,7 +92,7 @@ def cli_detect(argv=None):
         )
 
         if len(matches) == 0:
-            print('Unable to identify originating encoding for "{}".'.format(my_file.name), file=sys.stderr)
+            print('Unable to identify originating encoding for "{}". {}'.format(my_file.name, 'Maybe try increasing maximum amount of chaos.' if args.threshold < 1. else ''), file=sys.stderr)
             if my_file.closed is False:
                 my_file.close()
             continue
@@ -125,8 +130,14 @@ def cli_detect(argv=None):
         print(x_)
 
         if args.verbose is True:
-            print('"{}" could be also originating from {}.'.format(my_file.name, ','.join(r_.could_be_from_charset)))
-            print('"{}" could be also be written in {}.'.format(my_file.name, ' or '.join(p_.languages)))
+            if len(r_.could_be_from_charset) > 1:
+                print('"{}" could be also originating from {}.'.format(my_file.name, ','.join(r_.could_be_from_charset)))
+            if len(p_.could_be_from_charset) > 1:
+                print('"{}" produce the EXACT same output with those encoding : {}.'.format(my_file.name, ' OR '.join(p_.could_be_from_charset)))
+            if len(p_.languages) > 1:
+                print('"{}" could be also be written in {}.'.format(my_file.name, ' or '.join(p_.languages)))
+            if p_.byte_order_mark is True:
+                print('"{}" has a signature or byte order mark (BOM) in it.'.format(my_file.name))
 
         if args.normalize is True:
 
@@ -154,6 +165,7 @@ def cli_detect(argv=None):
                     fp.write(
                         str(p_)
                     )
+                print('"{}" has been successfully written to disk.'.format('.'.join(o_)))
             except IOError as e:
                 print(str(e), file=sys.stderr)
                 if my_file.closed is False:

diff --git a/charset_normalizer/constant.py b/charset_normalizer/constant.py
@@ -569,6 +569,7 @@
   "Variation Selectors Supplement"
 ]
 
+# List of keyword that indicate a secondary unicode range
 UNICODE_SECONDARY_RANGE_KEYWORD = [
     'Supplement',
     'Extended',
@@ -587,6 +588,7 @@
     'Tags'
 ]
 
+# Contain for each eligible encoding a list of/item bytes SIG/BOM
 BYTE_ORDER_MARK = {
     'utf_8': BOM_UTF8,
     'utf_7': [
@@ -603,6 +605,14 @@
     'utf_16_le': BOM_UTF16_LE
 }
 
+COHERENCE_ALPHABET_COVERED_IF = 0.8
+COHERENCE_PICKING_LETTER_MIN_APPEARANCE = 0.003
+COHERENCE_MIN_LETTER_NEEDED = 10
+COHERENCE_MAXIMUM_UNAVAILABLE_LETTER = 0.4
+COHERENCE_MAXIMUM_NOT_RESPECTED_RANK = 0.5
+COHERENCE_ACCEPTED_MARGIN_LETTER_RANK = 3
+
+# Construct for each unicode range (Name; Range) from UNICODE_RANGES_NAMES and UNICODE_RANGES
 UNICODE_RANGES_ZIP = dict(
     zip(
         UNICODE_RANGES_NAMES,

diff --git a/charset_normalizer/normalizer.py b/charset_normalizer/normalizer.py
@@ -13,15 +13,19 @@
 from charset_normalizer.probe_coherence import ProbeCoherence, HashableCounter
 
 
+from hashlib import sha256
+
+
 class CharsetNormalizerMatch:
 
     RE_NOT_PRINTABLE_LETTER = re.compile(r'[0-9\W\n\r\t]+')
 
-    def __init__(self, b_content, guessed_source_encoding, chaos_ratio, ranges, has_bom=False):
+    def __init__(self, b_content, guessed_source_encoding, chaos_ratio, ranges, has_bom=False, submatch=None):
         """
         :param bytes b_content: Raw binary content
         :param str guessed_source_encoding: Guessed source encoding accessible by Python
         :param float chaos_ratio: Coefficient of previously detected mess in decoded content
+        :param list[CharsetNormalizerMatch] submatch: list of submatch that produce the EXACT same output as this one
         """
 
         self._raw = b_content
@@ -36,10 +40,27 @@ def __init__(self, b_content, guessed_source_encoding, chaos_ratio, ranges, has_
 
         self.ranges = ranges
 
+        self._submatch = submatch or list()  # type: list[CharsetNormalizerMatch]
+
     @cached_property
     def w_counter(self):
+        """
+        By 'word' we consider output of split() method *with no args*
+        :return: For each 'word' in string, associated occurrence as provided by collection.Counter
+        :rtype: collections.Counter
+        """
         return collections.Counter(self._string_printable_only.split())
 
+    @property
+    def submatch(self):
+        """
+        Return a list of submatch that produce the EXACT same output as this one.
+        This return a list of CharsetNormalizerMatch and NOT a CharsetNormalizerMatches
+        :return: list of submatch
+        :rtype: list[CharsetNormalizerMatch]
+        """
+        return self._submatch
+
     @cached_property
     def alphabets(self):
         """
@@ -56,14 +77,14 @@ def could_be_from_charset(self):
         :return: list of encoding
         :rtype: list[str]
         """
-        return [self.encoding]
+        return [self.encoding] + [el.encoding for el in self._submatch]
 
     def __eq__(self, other):
         """
         :param CharsetNormalizerMatch other:
         :return:
         """
-        return self.chaos == other.chaos and len(self.raw) == len(other.raw) and self.encoding == other.encoding
+        return self.fingerprint == other.fingerprint and self.encoding == other.encoding
 
     @cached_property
     def coherence(self):
@@ -76,6 +97,10 @@ def coherence(self):
         """
         return ProbeCoherence(self.char_counter).ratio
 
+    @cached_property
+    def coherence_non_latin(self):
+        return ProbeCoherence(self.char_counter).non_latin_covered_any
+
     @cached_property
     def languages(self):
         """
@@ -115,9 +140,10 @@ def chaos(self):
     def chaos_secondary_pass(self):
         """
         Check once again chaos in decoded text, except this time, with full content.
-        :return:
+        :return: Same as chaos property expect it's about all content
+        :rtype: float
         """
-        return ProbeChaos(str(self))
+        return ProbeChaos(str(self)).ratio
 
     @property
     def encoding(self):
@@ -127,11 +153,26 @@ def encoding(self):
         """
         return self._encoding
 
+    @property
+    def encoding_aliases(self):
+        """
+        Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855.
+        :return: List of encoding aliases
+        :rtype: list[str]
+        """
+        also_known_as = list()
+        for u, p in aliases.items():
+            if self.encoding == u:
+                also_known_as.append(p)
+            elif self.encoding == p:
+                also_known_as.append(u)
+        return also_known_as
+
     @property
     def bom(self):
         """
-        Precise if file has a valid bom associated with discovered encoding
-        :return: True if a byte order mark was discovered
+        Precise if file has a valid bom or sig associated with discovered encoding
+        :return: True if a byte order mark or sig was discovered
         :rtype: bool
         """
         return self._bom
@@ -147,6 +188,11 @@ def byte_order_mark(self):
 
     @property
     def raw(self):
+        """
+        Get untouched bytes content
+        :return: Original bytes sequence
+        :rtype: bytes
+        """
         return self._raw
 
     def first(self):
@@ -168,6 +214,14 @@ def best(self):
     def __str__(self):
         return self._string
 
+    @cached_property
+    def fingerprint(self):
+        """
+        Generate sha256 checksum of encoded unicode self
+        :return:
+        """
+        return sha256(self.output()).hexdigest()
+
     def output(self, encoding='utf-8'):
         """
         :param encoding:
@@ -302,8 +356,8 @@ def from_bytes(sequences, steps=10, chunk_size=512, threshold=0.20):
 
             chaos_means = statistics.mean(ratios)
             chaos_median = statistics.median(ratios)
-            chaos_min = min(ratios)
-            chaos_max = max(ratios)
+            # chaos_min = min(ratios)
+            # chaos_max = max(ratios)
 
             if (len(r_) >= 4 and nb_gave_up > len(r_) / 4) or chaos_median > threshold:
                 # print(p, 'is too much chaos for decoded input !')
@@ -319,17 +373,30 @@ def from_bytes(sequences, steps=10, chunk_size=512, threshold=0.20):
 
             # print(p, 'U RANGES', encountered_unicode_range_occurrences)
 
-            matches.append(
-                CharsetNormalizerMatch(
-                    sequences if not bom_available else sequences[bom_len:],
-                    p,
-                    chaos_means,
-                    encountered_unicode_range_occurrences,
-                    bom_available
-                )
+            cnm = CharsetNormalizerMatch(
+                sequences if not bom_available else sequences[bom_len:],
+                p,
+                chaos_means,
+                encountered_unicode_range_occurrences,
+                bom_available
             )
 
-            # print(p, nb_gave_up, chaos_means, chaos_median, chaos_min, chaos_max, matches[-1].coherence, matches[-1].language)
+            fingerprint_tests = [el.fingerprint == cnm.fingerprint for el in matches]
+
+            if any(fingerprint_tests) is True:
+                matches[fingerprint_tests.index(True)].submatch.append(cnm)
+            else:
+                matches.append(
+                    CharsetNormalizerMatch(
+                        sequences if not bom_available else sequences[bom_len:],
+                        p,
+                        chaos_means,
+                        encountered_unicode_range_occurrences,
+                        bom_available
+                    )
+                )
+
+            # print(p, nb_gave_up, chaos_means, chaos_median, chaos_min, chaos_max, matches[-1].coherence, matches[-1].languages,)
 
             if (p == 'ascii' and chaos_median == 0.) or bom_available is True:
                 return CharsetNormalizerMatches([matches[-1]])