Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52

Closed
wants to merge 65 commits into from
Closed
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
a50b784
Fix Latin5TurkishModel key names.
dan-blanchard Jan 11, 2015
1db438b
Badges!
dan-blanchard Jan 27, 2015
184761d
HTTPS shields badges
dan-blanchard Jan 27, 2015
e1f1e53
First commit
helour Feb 15, 2015
cff4b4d
First commit
helour Feb 15, 2015
da0e3d5
Update saraspatak.hu.txt
helour Feb 15, 2015
c1840be
Delete my_test.py
helour Feb 15, 2015
d1d9258
Removed print
helour Feb 16, 2015
adfd0b8
Update create_language_model.py
helour Feb 16, 2015
56d8efa
Update test.py
helour Feb 16, 2015
6f9d07f
Delete saraspatak.hu.txt
helour Feb 16, 2015
28e68e9
Changed value from 254 to 253: SYMBOL_CAT_ORDER = 253
helour Feb 17, 2015
80428f0
Delete shamalt.uw.hu.txt
helour Feb 18, 2015
46841a1
Delete auto-apro.hu.txt
helour Feb 18, 2015
c7e69fa
Delete cigartower.hu.txt
helour Feb 18, 2015
4ca74f9
Delete hirtv.hu.txt
helour Feb 18, 2015
b85d055
Delete honositomuhely.hu.txt
helour Feb 18, 2015
83903ee
Delete shamalt.uw.hu.mr.txt
helour Feb 18, 2015
5e79c30
Delete bbc.co.uk.hu.forum.txt
helour Feb 18, 2015
360c82c
Delete bbc.co.uk.hu.learningenglish.txt
helour Feb 18, 2015
ae3cc6a
Delete bbc.co.uk.hu.pressreview.txt
helour Feb 18, 2015
b8b9917
Delete bbc.co.uk.hu.txt
helour Feb 18, 2015
b19c7c0
Delete objektivhir.hu.txt
helour Feb 18, 2015
ff38fc0
chmod -x
helour Feb 18, 2015
8adf4fb
Update
helour Mar 1, 2015
d96229a
Changed charmap list name from C to python style.
helour Mar 4, 2015
2249a0b
Added ISO-8859-1 German language model.
helour Mar 4, 2015
19f7e7e
Renamed ISO-8859-7 Greek language model.
helour Mar 4, 2015
f511d49
Added ISO-8859-2 Hungarian language model.
helour Mar 4, 2015
1c8a86e
Changed charmap list name from C to python style.
helour Mar 4, 2015
b6d48a7
Added ISO-8859-2 Romanian language model.
helour Mar 4, 2015
874711d
Changed charmap list name from C to python style.
helour Mar 4, 2015
88747f0
Added ISO-8859-9 Turkish language model.
helour Mar 4, 2015
0edf7bf
Increased coeficient in the confidence calculation.
helour Mar 4, 2015
89e6a51
Changed formula for confidence value calculation.
helour Mar 4, 2015
9f672e5
Added ISO Hungarian, Romanian, German language models into probers.
helour Mar 4, 2015
55e56ce
Added function which can distinguish between ISO and Windows charsets.
helour Mar 4, 2015
5dbf64d
Removed test for equivalent encodings.
helour Mar 4, 2015
ca8a019
Renamed folder from alias to true ISO name.
helour Mar 4, 2015
f9d2977
Renamed folder from alias to true ISO name.
helour Mar 4, 2015
fe76cff
Updated test text.
helour Mar 4, 2015
36693a0
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
935e269
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
99ff367
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
05ae54a
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
13d8d25
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
94a4246
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
06c9012
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
0bea16f
Added character/symbol which enables to distinguish between ISO and W…
helour Mar 4, 2015
acf12e0
Unnecesary command byte_str += '\n' removed
helour Mar 10, 2015
6166cfe
Added optional flag for text cleanup in the API
helour Mar 11, 2015
28291e8
Improved detection for Potuguesse and Spanish texts.
helour Mar 11, 2015
24d6bd1
Changed test for minimum number of the total sequentions
helour Mar 11, 2015
1a2dd6c
Updated/shortened text test files. Added new text files.
helour Mar 11, 2015
88b686f
Test text updated
helour Mar 12, 2015
0c9fdb1
Updated to python 3.0 compatible
helour Mar 12, 2015
a0c14a7
Changed tested/input string from 'str' to 'bytearray' (python 2.x).
helour Mar 16, 2015
86ee587
Update README.rst
helour Mar 16, 2015
69e7e47
Update NOTES.rst
helour Mar 16, 2015
e129f60
Converted string from 'str' to 'bytearray'. Removed compat.py
helour Mar 17, 2015
1d11c98
Update
helour Mar 17, 2015
1c27668
Rewrited to python 3.x compatible
helour Mar 18, 2015
c423b41
chmod 755 to 644
helour Mar 18, 2015
a052ebf
Removed unnecessary brackets
helour Mar 18, 2015
0b3a6d3
Optimization for the speed
helour Mar 18, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
780 changes: 780 additions & 0 deletions CharsetsTabs.txt

Large diffs are not rendered by default.

9 changes: 8 additions & 1 deletion NOTES.rst
Expand Up @@ -64,11 +64,19 @@ Bigram files
- ``hebrewprober.py``
- ``jpcntxprober.py``
- ``langbulgarianmodel.py``
- ``langcroatianmodel.py``
- ``langcyrillicmodel.py``
- ``langczechmodel.py``
- ``langgermanmodel.py``
- ``langgreekmodel.py``
- ``langhebrewmodel.py``
- ``langhungarianmodel.py``
- ``langpolishmodel.py``
- ``langromanianmodel.py``
- ``langslovakmodel.py``
- ``langslovenemodel.py``
- ``langthaimodel.py``
- ``langturkishmodel.py``
- ``latin1prober.py``
- ``sbcharsetprober.py``
- ``sbcsgroupprober.py``
Expand Down Expand Up @@ -111,7 +119,6 @@ Misc files
----------

- ``__init__.py`` (currently has ``detect`` function in it)
- ``compat.py``
- ``enums.py``
- ``universaldetector.py``
- ``version.py``
Expand Down
27 changes: 22 additions & 5 deletions README.rst
@@ -1,22 +1,39 @@
Chardet: The Universal Character Encoding Detector
--------------------------------------------------

.. image:: https://img.shields.io/travis/chardet/chardet/stable.svg
:alt: Build status
:target: https://travis-ci.org/chardet/chardet

.. image:: https://img.shields.io/coveralls/chardet/chardet/stable.svg
:target: https://coveralls.io/r/chardet/chardet

.. image:: https://img.shields.io/pypi/dm/chardet.svg
:target: https://warehouse.python.org/project/chardet/
:alt: PyPI downloads

.. image:: https://img.shields.io/pypi/v/chardet.svg
:target: https://warehouse.python.org/project/chardet/
:alt: Latest version on PyPI

.. image:: https://img.shields.io/pypi/l/chardet.svg
:alt: License


Detects
- ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
- EUC-KR, ISO-2022-KR (Korean)
- KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
- ISO-8859-2 windows-1250 (Czech, Croatian, Hungarian, Polish, Romanian, Slovak, Slovene)
- ISO-8859-5, windows-1251 (Bulgarian)
- windows-1252 (English)
- ISO-8859-1, windows-1252 (Dutch, English, Finnish, French, German, Italy, Portuguese, Spanish)
- ISO-8859-7, windows-1253 (Greek)
- ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
- ISO-8859-9, windows-1254 (Turkish)
- TIS-620 (Thai)

.. note::
Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily
disabled until we can retrain the models.

Requires Python 2.6 or later

Installation
Expand Down
17 changes: 8 additions & 9 deletions chardet/__init__.py
Expand Up @@ -16,18 +16,17 @@
######################### END LICENSE BLOCK #########################


from .compat import PY2, PY3
import sys
from .universaldetector import UniversalDetector
from .version import __version__, VERSION


def detect(byte_str):
if (PY2 and isinstance(byte_str, unicode)) or (PY3 and
not isinstance(byte_str,
bytes)):
def detect(byte_str, txt_cleanup=True):
PY_VER = 2 if sys.version_info < (3, 0) else 3
if ((PY_VER == 2 and isinstance(byte_str, unicode)) or
(PY_VER == 3 and not isinstance(byte_str, bytes))):
raise ValueError('Expected a bytes object, not a unicode object')

if PY_VER == 2:
byte_str = bytearray(byte_str)
u = UniversalDetector()
u.feed(byte_str)
u.feed(byte_str, txt_cleanup)
u.close()
return u.result
4 changes: 4 additions & 0 deletions chardet/big5prober.py
Expand Up @@ -41,3 +41,7 @@ def __init__(self):
@property
def charset_name(self):
return "Big5"

@property
def language(self):
return "Chinese"
19 changes: 9 additions & 10 deletions chardet/chardistribution.py
Expand Up @@ -35,7 +35,6 @@
BIG5_TYPICAL_DISTRIBUTION_RATIO)
from .jisfreq import (JIS_CHAR_TO_FREQ_ORDER, JIS_TABLE_SIZE,
JIS_TYPICAL_DISTRIBUTION_RATIO)
from .compat import wrap_ord


class CharDistributionAnalysis(object):
Expand Down Expand Up @@ -123,9 +122,9 @@ def get_order(self, byte_str):
# first byte range: 0xc4 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char = wrap_ord(byte_str[0])
first_char = byte_str[0]
if first_char >= 0xC4:
return 94 * (first_char - 0xC4) + wrap_ord(byte_str[1]) - 0xA1
return 94 * (first_char - 0xC4) + byte_str[1] - 0xA1
else:
return -1

Expand All @@ -142,9 +141,9 @@ def get_order(self, byte_str):
# first byte range: 0xb0 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char = wrap_ord(byte_str[0])
first_char = byte_str[0]
if first_char >= 0xB0:
return 94 * (first_char - 0xB0) + wrap_ord(byte_str[1]) - 0xA1
return 94 * (first_char - 0xB0) + byte_str[1] - 0xA1
else:
return -1

Expand All @@ -161,7 +160,7 @@ def get_order(self, byte_str):
# first byte range: 0xb0 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char, second_char = wrap_ord(byte_str[0]), wrap_ord(byte_str[1])
first_char, second_char = byte_str[0], byte_str[1]
if (first_char >= 0xB0) and (second_char >= 0xA1):
return 94 * (first_char - 0xB0) + second_char - 0xA1
else:
Expand All @@ -180,7 +179,7 @@ def get_order(self, byte_str):
# first byte range: 0xa4 -- 0xfe
# second byte range: 0x40 -- 0x7e , 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char, second_char = wrap_ord(byte_str[0]), wrap_ord(byte_str[1])
first_char, second_char = byte_str[0], byte_str[1]
if first_char >= 0xA4:
if second_char >= 0xA1:
return 157 * (first_char - 0xA4) + second_char - 0xA1 + 63
Expand All @@ -202,7 +201,7 @@ def get_order(self, byte_str):
# first byte range: 0x81 -- 0x9f , 0xe0 -- 0xfe
# second byte range: 0x40 -- 0x7e, 0x81 -- oxfe
# no validation needed here. State machine has done that
first_char, second_char = wrap_ord(byte_str[0]), wrap_ord(byte_str[1])
first_char, second_char = byte_str[0], byte_str[1]
if (first_char >= 0x81) and (first_char <= 0x9F):
order = 188 * (first_char - 0x81)
elif (first_char >= 0xE0) and (first_char <= 0xEF):
Expand All @@ -227,8 +226,8 @@ def get_order(self, byte_str):
# first byte range: 0xa0 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
char = wrap_ord(byte_str[0])
char = byte_str[0]
if char >= 0xA0:
return 94 * (char - 0xA1) + wrap_ord(byte_str[1]) - 0xa1
return 94 * (char - 0xA1) + byte_str[1] - 0xa1
else:
return -1
10 changes: 9 additions & 1 deletion chardet/charsetgroupprober.py
Expand Up @@ -54,6 +54,14 @@ def charset_name(self):
return None
return self._best_guess_prober.charset_name

@property
def language(self):
if not self._best_guess_prober:
self.get_confidence()
if not self._best_guess_prober:
return None
return self._best_guess_prober.language

def feed(self, byte_str):
for prober in self.probers:
if not prober:
Expand Down Expand Up @@ -89,7 +97,7 @@ def get_confidence(self):
self.logger.debug('%s not active', prober.charset_name)
continue
conf = prober.get_confidence()
self.logger.debug('%s confidence = %s', prober.charset_name, conf)
self.logger.debug('%s %s confidence = %s', prober.charset_name, prober.language, conf)
if best_conf < conf:
best_conf = conf
self._best_guess_prober = prober
Expand Down
18 changes: 8 additions & 10 deletions chardet/charsetprober.py
Expand Up @@ -28,7 +28,6 @@

import logging
import re
from io import BytesIO

from .enums import ProbingState

Expand Down Expand Up @@ -79,16 +78,16 @@ def filter_international_words(buf):

This filter applies to all scripts which do not use English characters.
"""
filtered = BytesIO()

out = b''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes are not mutable in Python, so switching this to using bytes and then concatenating to it with += means creating lots of temporary strings. BytesIO should be faster (although, feel free to prove me wrong).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are wrong, please try this:

import time

c = []
for i in range(0, 256):
    c.append(bytearray(i))

start = time.time()
from io import BytesIO
filtered = BytesIO()
for i in range(0, 10000000):
    filtered.write(c[i % 255])
ret = filtered.getvalue()
print time.time() - start

start = time.time()
s = b''
for i in range(0, 10000000):
    s += c[i % 255]
ret = s
print time.time() - start

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm... I just ran this and it completely confirmed my suspicions.

The BytesIO part (with Python 3) finished in 3.5 seconds, and the part using bytes with concatenation was running for over 5 minutes before I killed it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suspicion is right for Python 3 but wrong for Python 2.7.
It is not good idea to create one project for various python's versions. There are many problems with compatibility and speed optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree supporting both Python versions is difficult, but I'm not quite willing to leave Python 2 users completely in the dust yet, since there are so many of them. Especially when the hard work for maintaining compatibility has mostly been done already.

That said, I'll definitely target Python 3 for optimizations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this new piece of code (Python 2 and 3 compatible) is what you need:

import time

c = []
for i in range(0, 256):
    c.append(bytearray(i))

start = time.time()
from io import BytesIO
filtered = BytesIO()
for i in range(0, 10000000):
    filtered.write(c[i % 255])
ret = filtered.getvalue()
print(time.time() - start)

start = time.time()
s = bytearray()
for i in range(0, 10000000):
    s.extend(c[i % 255])
ret = s
print(time.time() - start)

BTW the second part is still winner because don't use stream :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the suggestion. 👍

# This regex expression filters out only words that have at-least one
# international character. The word may include one marker character at
# the end.
words = re.findall(
b'[a-zA-Z]*[\x80-\xFF]+[a-zA-Z]*[^a-zA-Z\x80-\xFF]?', buf)

for word in words:
filtered.write(word[:-1])
out += word[:-1]

# If the last character in the word is a marker, replace it with a
# space as markers shouldn't affect our analysis (they are used
Expand All @@ -97,9 +96,9 @@ def filter_international_words(buf):
last_char = word[-1:]
if not last_char.isalpha() and last_char < b'\x80':
last_char = b' '
filtered.write(last_char)
out += last_char

return filtered.getvalue()
return out

@staticmethod
def filter_with_english_letters(buf):
Expand All @@ -113,7 +112,6 @@ def filter_with_english_letters(buf):
characters and extended ASCII characters, but is currently only used by
``Latin1Prober``.
"""
filtered = BytesIO()
in_tag = False
prev = 0

Expand All @@ -132,15 +130,15 @@ def filter_with_english_letters(buf):
if curr > prev and not in_tag:
# Keep everything after last non-extended-ASCII,
# non-alphabetic character
filtered.write(buf[prev:curr])
out += buf[prev:curr]
# Output a space to delimit stretch we kept
filtered.write(b' ')
out += b' '
prev = curr + 1

# If we're not in a tag...
if not in_tag:
# Keep everything after last non-extended-ASCII, non-alphabetic
# character
filtered.write(buf[prev:])
out += buf[prev:]

return filtered.getvalue()
return out
11 changes: 6 additions & 5 deletions chardet/cli/chardetect.py
Expand Up @@ -19,12 +19,11 @@
import sys
from io import open

from chardet import __version__
from chardet.compat import PY2
from chardet.version import __version__
from chardet.universaldetector import UniversalDetector



PY_VER = 2 if sys.version_info < (3, 0) else 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's common practice in Python to include checks like this in a compat.py module, which is what we were doing before. Please change this back to just using PY2 instead of PY_VER == 2.


def description_of(lines, name='stdin'):
"""
Expand All @@ -38,10 +37,12 @@ def description_of(lines, name='stdin'):
"""
u = UniversalDetector()
for line in lines:
if PY_VER == 2:
line = bytearray(line)
u.feed(line)
u.close()
result = u.result
if PY2:
if PY_VER == 2:
name = name.decode(sys.getfilesystemencoding(), 'ignore')
if result['encoding']:
return '{0}: {1} with confidence {2}'.format(name, result['encoding'],
Expand All @@ -66,7 +67,7 @@ def main(argv=None):
help='File whose encoding we would like to determine. \
(default: stdin)',
type=argparse.FileType('rb'), nargs='*',
default=[sys.stdin if PY2 else sys.stdin.buffer])
default=[sys.stdin if PY_VER == 2 else sys.stdin.buffer])
parser.add_argument('--version', action='version',
version='%(prog)s {0}'.format(__version__))
args = parser.parse_args(argv)
Expand Down
3 changes: 1 addition & 2 deletions chardet/codingstatemachine.py
Expand Up @@ -28,7 +28,6 @@
import logging

from .enums import MachineState
from .compat import wrap_ord


class CodingStateMachine(object):
Expand Down Expand Up @@ -67,7 +66,7 @@ def reset(self):
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byte_class = self._model['class_table'][wrap_ord(c)]
byte_class = self._model['class_table'][c]
if self._curr_state == MachineState.start:
self._curr_byte_pos = 0
self._curr_char_len = self._model['char_len_table'][byte_class]
Expand Down
41 changes: 0 additions & 41 deletions chardet/compat.py

This file was deleted.

4 changes: 4 additions & 0 deletions chardet/cp949prober.py
Expand Up @@ -43,3 +43,7 @@ def __init__(self):
@property
def charset_name(self):
return "CP949"

@property
def language(self):
return "Korean"
6 changes: 3 additions & 3 deletions chardet/escprober.py
Expand Up @@ -27,7 +27,6 @@

from .charsetprober import CharSetProber
from .codingstatemachine import CodingStateMachine
from .compat import wrap_ord
from .enums import LanguageFilter, ProbingState, MachineState
from .escsm import (HZ_SM_MODEL, ISO2022CN_SM_MODEL, ISO2022JP_SM_MODEL,
ISO2022KR_SM_MODEL)
Expand Down Expand Up @@ -76,11 +75,12 @@ def get_confidence(self):
return 0.00

def feed(self, byte_str):
for c in byte_str:
num_bytes = len(byte_str)
for i in range(0, num_bytes):
for coding_sm in self.coding_sm:
if not coding_sm or not coding_sm.active:
continue
coding_state = coding_sm.next_state(wrap_ord(c))
coding_state = coding_sm.next_state(byte_str[i])
if coding_state == MachineState.error:
coding_sm.active = False
self.active_sm_count -= 1
Expand Down