Skip to content

Commit

Permalink
Release 1.0.0 (#11)
Browse files Browse the repository at this point in the history
* Adjustement in frequencies.json about Chinese

Remove latin based char in it

* Added the possibility to list encoding aliases for a match

Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855.

* Added submatch in match

list of submatch that produce the EXACT same output as a match

* Changes in docs

+ comment unused code.

* Add param in doc ProbeChaos giveup_threshold

* Doc improvement in unicode.py

* Add static method list_by_range in unicode.py

Sort letters by unicode range in a dict

* ProbeCoherence reliability improved 

Can now probe & sort by alphabet used or unicode range.

* Added coherence_non_latin method in NormalizerMatch

Verify if a non latin based language got verified by probe coherence

* CLI is now more verbose

* More tests, yay !

* bump 1.0.0

* readme upd8
  • Loading branch information
Ousret committed Sep 17, 2019
1 parent 232a574 commit d3996ce
Show file tree
Hide file tree
Showing 12 changed files with 312 additions and 74 deletions.
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,15 @@
</a>
</p>

> Library that help you read text from unknown charset encoding.<br /> Project motivated by `chardet`, I'm trying to resolve the issue by taking another approach.
> Library that help you read text from unknown charset encoding.<br /> Project motivated by `chardet`,
> I'm trying to resolve the issue by taking another approach.
> All IANA character set names for which the Python core library provides codecs are supported.
This project offer you a alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.

| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
| ------------- | :-------------: | :------------------: | :------------------: |
| `Fast` | ❌<br> | <br> | ✅ <br>⚡ |
| `Fast` | ❌<br> | <br> | ✅ <br>⚡ |
| `Universal**` ||||
| `Reliable` **without** distinguishable standards ||||
| `Reliable` **with** distinguishable standards ||||
Expand Down Expand Up @@ -91,6 +93,8 @@ except IOError as e:
from charset_normalizer import detect
```

Above code will behave the same as **chardet**.

See wiki for advanced usages. *Todo, not yet available.*

## 😇 Why
Expand Down Expand Up @@ -119,9 +123,16 @@ In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
improve or rewrite it.

*Coherence :* For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that
*Coherence :* For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that
those intel are worth something here. So I use those records against decoded text to check if I can detect intelligent design.


## ⚡ Known limitations

- Not intended to work on non (human) speakable language text content. eg. crypted text.
- When provided trust encoding in headers first. (XML, HTML, HTTP, etc..)
- Language detection is unreliable when text contain more than 1 language that are sharing identical letters.
- Not well tested with tiny content

## 👤 Contributing

Contributions, issues and feature requests are very much welcome.<br />
Expand Down
26 changes: 4 additions & 22 deletions charset_normalizer/assets/frequencies.json
Original file line number Diff line number Diff line change
Expand Up @@ -310,30 +310,19 @@
"Chinese": [
"\u7684",
"\u5e74",
"a",
"e",
"\u4e00",
"\u5728",
"\u662f",
"\u4e2d",
"i",
"o",
"r",
"n",
"t",
"\u4eba",
"s",
"\u5927",
"\u6709",
"l",
"\u70ba",
"\u548c",
"\u4ee5",
"c",
"\u65e5",
"\u4e86",
"\u6708",
"m"
"\u6708"
],
"Catalan": [
"e",
Expand Down Expand Up @@ -920,8 +909,7 @@
"\u05e6",
"\u05df",
"\u05d6",
"\u05da",
"e"
"\u05da"
],
"Bulgarian": [
"\u0430",
Expand Down Expand Up @@ -1312,7 +1300,6 @@
"\u0446",
"\u0436",
"\u0444",
"a",
"\u045a"
],
"Serbocroatian": [
Expand Down Expand Up @@ -1367,9 +1354,7 @@
"\u0b8e",
"\u0b89",
"\u0b92",
"\u0bb8",
"a",
"e"
"\u0bb8"
],
"Classical Chinese": [
"\u4e4b",
Expand All @@ -1386,7 +1371,6 @@
"\u4e8c",
"\u5341",
"\u65bc",
"a",
"\u66f0",
"\u4e09",
"\u4e0d",
Expand All @@ -1395,8 +1379,6 @@
"\u5b50",
"\u4e2d",
"\u4e94",
"o",
"\u56db",
"r"
"\u56db"
]
}
18 changes: 15 additions & 3 deletions charset_normalizer/cli/normalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ def query_yes_no(question, default="yes"):


def cli_detect(argv=None):
"""
CLI assistant using ARGV and ArgumentParser
:param argv:
:return: 0 if everything is fine, anything else equal trouble
"""
parser = argparse.ArgumentParser(
description="The Real First Universal Charset Detector. "
"Discover originating encoding used on text file. "
Expand Down Expand Up @@ -87,7 +92,7 @@ def cli_detect(argv=None):
)

if len(matches) == 0:
print('Unable to identify originating encoding for "{}".'.format(my_file.name), file=sys.stderr)
print('Unable to identify originating encoding for "{}". {}'.format(my_file.name, 'Maybe try increasing maximum amount of chaos.' if args.threshold < 1. else ''), file=sys.stderr)
if my_file.closed is False:
my_file.close()
continue
Expand Down Expand Up @@ -125,8 +130,14 @@ def cli_detect(argv=None):
print(x_)

if args.verbose is True:
print('"{}" could be also originating from {}.'.format(my_file.name, ','.join(r_.could_be_from_charset)))
print('"{}" could be also be written in {}.'.format(my_file.name, ' or '.join(p_.languages)))
if len(r_.could_be_from_charset) > 1:
print('"{}" could be also originating from {}.'.format(my_file.name, ','.join(r_.could_be_from_charset)))
if len(p_.could_be_from_charset) > 1:
print('"{}" produce the EXACT same output with those encoding : {}.'.format(my_file.name, ' OR '.join(p_.could_be_from_charset)))
if len(p_.languages) > 1:
print('"{}" could be also be written in {}.'.format(my_file.name, ' or '.join(p_.languages)))
if p_.byte_order_mark is True:
print('"{}" has a signature or byte order mark (BOM) in it.'.format(my_file.name))

if args.normalize is True:

Expand Down Expand Up @@ -154,6 +165,7 @@ def cli_detect(argv=None):
fp.write(
str(p_)
)
print('"{}" has been successfully written to disk.'.format('.'.join(o_)))
except IOError as e:
print(str(e), file=sys.stderr)
if my_file.closed is False:
Expand Down
10 changes: 10 additions & 0 deletions charset_normalizer/constant.py
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,7 @@
"Variation Selectors Supplement"
]

# List of keyword that indicate a secondary unicode range
UNICODE_SECONDARY_RANGE_KEYWORD = [
'Supplement',
'Extended',
Expand All @@ -587,6 +588,7 @@
'Tags'
]

# Contain for each eligible encoding a list of/item bytes SIG/BOM
BYTE_ORDER_MARK = {
'utf_8': BOM_UTF8,
'utf_7': [
Expand All @@ -603,6 +605,14 @@
'utf_16_le': BOM_UTF16_LE
}

COHERENCE_ALPHABET_COVERED_IF = 0.8
COHERENCE_PICKING_LETTER_MIN_APPEARANCE = 0.003
COHERENCE_MIN_LETTER_NEEDED = 10
COHERENCE_MAXIMUM_UNAVAILABLE_LETTER = 0.4
COHERENCE_MAXIMUM_NOT_RESPECTED_RANK = 0.5
COHERENCE_ACCEPTED_MARGIN_LETTER_RANK = 3

# Construct for each unicode range (Name; Range) from UNICODE_RANGES_NAMES and UNICODE_RANGES
UNICODE_RANGES_ZIP = dict(
zip(
UNICODE_RANGES_NAMES,
Expand Down
103 changes: 85 additions & 18 deletions charset_normalizer/normalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,19 @@
from charset_normalizer.probe_coherence import ProbeCoherence, HashableCounter


from hashlib import sha256


class CharsetNormalizerMatch:

RE_NOT_PRINTABLE_LETTER = re.compile(r'[0-9\W\n\r\t]+')

def __init__(self, b_content, guessed_source_encoding, chaos_ratio, ranges, has_bom=False):
def __init__(self, b_content, guessed_source_encoding, chaos_ratio, ranges, has_bom=False, submatch=None):
"""
:param bytes b_content: Raw binary content
:param str guessed_source_encoding: Guessed source encoding accessible by Python
:param float chaos_ratio: Coefficient of previously detected mess in decoded content
:param list[CharsetNormalizerMatch] submatch: list of submatch that produce the EXACT same output as this one
"""

self._raw = b_content
Expand All @@ -36,10 +40,27 @@ def __init__(self, b_content, guessed_source_encoding, chaos_ratio, ranges, has_

self.ranges = ranges

self._submatch = submatch or list() # type: list[CharsetNormalizerMatch]

@cached_property
def w_counter(self):
"""
By 'word' we consider output of split() method *with no args*
:return: For each 'word' in string, associated occurrence as provided by collection.Counter
:rtype: collections.Counter
"""
return collections.Counter(self._string_printable_only.split())

@property
def submatch(self):
"""
Return a list of submatch that produce the EXACT same output as this one.
This return a list of CharsetNormalizerMatch and NOT a CharsetNormalizerMatches
:return: list of submatch
:rtype: list[CharsetNormalizerMatch]
"""
return self._submatch

@cached_property
def alphabets(self):
"""
Expand All @@ -56,14 +77,14 @@ def could_be_from_charset(self):
:return: list of encoding
:rtype: list[str]
"""
return [self.encoding]
return [self.encoding] + [el.encoding for el in self._submatch]

def __eq__(self, other):
"""
:param CharsetNormalizerMatch other:
:return:
"""
return self.chaos == other.chaos and len(self.raw) == len(other.raw) and self.encoding == other.encoding
return self.fingerprint == other.fingerprint and self.encoding == other.encoding

@cached_property
def coherence(self):
Expand All @@ -76,6 +97,10 @@ def coherence(self):
"""
return ProbeCoherence(self.char_counter).ratio

@cached_property
def coherence_non_latin(self):
return ProbeCoherence(self.char_counter).non_latin_covered_any

@cached_property
def languages(self):
"""
Expand Down Expand Up @@ -115,9 +140,10 @@ def chaos(self):
def chaos_secondary_pass(self):
"""
Check once again chaos in decoded text, except this time, with full content.
:return:
:return: Same as chaos property expect it's about all content
:rtype: float
"""
return ProbeChaos(str(self))
return ProbeChaos(str(self)).ratio

@property
def encoding(self):
Expand All @@ -127,11 +153,26 @@ def encoding(self):
"""
return self._encoding

@property
def encoding_aliases(self):
"""
Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855.
:return: List of encoding aliases
:rtype: list[str]
"""
also_known_as = list()
for u, p in aliases.items():
if self.encoding == u:
also_known_as.append(p)
elif self.encoding == p:
also_known_as.append(u)
return also_known_as

@property
def bom(self):
"""
Precise if file has a valid bom associated with discovered encoding
:return: True if a byte order mark was discovered
Precise if file has a valid bom or sig associated with discovered encoding
:return: True if a byte order mark or sig was discovered
:rtype: bool
"""
return self._bom
Expand All @@ -147,6 +188,11 @@ def byte_order_mark(self):

@property
def raw(self):
"""
Get untouched bytes content
:return: Original bytes sequence
:rtype: bytes
"""
return self._raw

def first(self):
Expand All @@ -168,6 +214,14 @@ def best(self):
def __str__(self):
return self._string

@cached_property
def fingerprint(self):
"""
Generate sha256 checksum of encoded unicode self
:return:
"""
return sha256(self.output()).hexdigest()

def output(self, encoding='utf-8'):
"""
:param encoding:
Expand Down Expand Up @@ -302,8 +356,8 @@ def from_bytes(sequences, steps=10, chunk_size=512, threshold=0.20):

chaos_means = statistics.mean(ratios)
chaos_median = statistics.median(ratios)
chaos_min = min(ratios)
chaos_max = max(ratios)
# chaos_min = min(ratios)
# chaos_max = max(ratios)

if (len(r_) >= 4 and nb_gave_up > len(r_) / 4) or chaos_median > threshold:
# print(p, 'is too much chaos for decoded input !')
Expand All @@ -319,17 +373,30 @@ def from_bytes(sequences, steps=10, chunk_size=512, threshold=0.20):

# print(p, 'U RANGES', encountered_unicode_range_occurrences)

matches.append(
CharsetNormalizerMatch(
sequences if not bom_available else sequences[bom_len:],
p,
chaos_means,
encountered_unicode_range_occurrences,
bom_available
)
cnm = CharsetNormalizerMatch(
sequences if not bom_available else sequences[bom_len:],
p,
chaos_means,
encountered_unicode_range_occurrences,
bom_available
)

# print(p, nb_gave_up, chaos_means, chaos_median, chaos_min, chaos_max, matches[-1].coherence, matches[-1].language)
fingerprint_tests = [el.fingerprint == cnm.fingerprint for el in matches]

if any(fingerprint_tests) is True:
matches[fingerprint_tests.index(True)].submatch.append(cnm)
else:
matches.append(
CharsetNormalizerMatch(
sequences if not bom_available else sequences[bom_len:],
p,
chaos_means,
encountered_unicode_range_occurrences,
bom_available
)
)

# print(p, nb_gave_up, chaos_means, chaos_median, chaos_min, chaos_max, matches[-1].coherence, matches[-1].languages,)

if (p == 'ascii' and chaos_median == 0.) or bom_available is True:
return CharsetNormalizerMatches([matches[-1]])
Expand Down

0 comments on commit d3996ce

Please sign in to comment.