New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52
Changes from 61 commits
a50b784
1db438b
184761d
e1f1e53
cff4b4d
da0e3d5
c1840be
d1d9258
adfd0b8
56d8efa
6f9d07f
28e68e9
80428f0
46841a1
c7e69fa
4ca74f9
b85d055
83903ee
5e79c30
360c82c
ae3cc6a
b8b9917
b19c7c0
ff38fc0
8adf4fb
d96229a
2249a0b
19f7e7e
f511d49
1c8a86e
b6d48a7
874711d
88747f0
0edf7bf
89e6a51
9f672e5
55e56ce
5dbf64d
ca8a019
f9d2977
fe76cff
36693a0
935e269
99ff367
05ae54a
13d8d25
94a4246
06c9012
0bea16f
acf12e0
6166cfe
28291e8
24d6bd1
1a2dd6c
88b686f
0c9fdb1
a0c14a7
86ee587
69e7e47
e129f60
1d11c98
1c27668
c423b41
a052ebf
0b3a6d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,12 +19,11 @@ | |
import sys | ||
from io import open | ||
|
||
from chardet import __version__ | ||
from chardet.compat import PY2 | ||
from chardet.version import __version__ | ||
from chardet.universaldetector import UniversalDetector | ||
|
||
|
||
|
||
PY_VER = 2 if sys.version_info < (3, 0) else 3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's common practice in Python to include checks like this in a |
||
|
||
def description_of(lines, name='stdin'): | ||
""" | ||
|
@@ -38,10 +37,12 @@ def description_of(lines, name='stdin'): | |
""" | ||
u = UniversalDetector() | ||
for line in lines: | ||
if PY_VER == 2: | ||
line = bytearray(line) | ||
u.feed(line) | ||
u.close() | ||
result = u.result | ||
if PY2: | ||
if PY_VER == 2: | ||
name = name.decode(sys.getfilesystemencoding(), 'ignore') | ||
if result['encoding']: | ||
return '{0}: {1} with confidence {2}'.format(name, result['encoding'], | ||
|
@@ -66,7 +67,7 @@ def main(argv=None): | |
help='File whose encoding we would like to determine. \ | ||
(default: stdin)', | ||
type=argparse.FileType('rb'), nargs='*', | ||
default=[sys.stdin if PY2 else sys.stdin.buffer]) | ||
default=[sys.stdin if PY_VER == 2 else sys.stdin.buffer]) | ||
parser.add_argument('--version', action='version', | ||
version='%(prog)s {0}'.format(__version__)) | ||
args = parser.parse_args(argv) | ||
|
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes
are not mutable in Python, so switching this to usingbytes
and then concatenating to it with+=
means creating lots of temporary strings.BytesIO
should be faster (although, feel free to prove me wrong).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are wrong, please try this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm... I just ran this and it completely confirmed my suspicions.
The
BytesIO
part (with Python 3) finished in 3.5 seconds, and the part usingbytes
with concatenation was running for over 5 minutes before I killed it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suspicion is right for Python 3 but wrong for Python 2.7.
It is not good idea to create one project for various python's versions. There are many problems with compatibility and speed optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree supporting both Python versions is difficult, but I'm not quite willing to leave Python 2 users completely in the dust yet, since there are so many of them. Especially when the hard work for maintaining compatibility has mostly been done already.
That said, I'll definitely target Python 3 for optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this new piece of code (Python 2 and 3 compatible) is what you need:
BTW the second part is still winner because don't use stream :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the suggestion. 👍