Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 4.0.0 #140

Merged
merged 34 commits into from Dec 10, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
ec3bce7
Convert single-byte charset probers to use nested dicts for language …
dan-blanchard Apr 27, 2017
c68f120
Add API option to get all the encodings confidence #96 (#111)
mdamien Oct 3, 2017
d7c7343
Make sure pyc files are not in tarballs
dan-blanchard Oct 11, 2017
b3d867a
Bump version to 4.0.0
dan-blanchard Oct 19, 2017
d702545
add benchmark script
dan-blanchard Oct 19, 2017
8dccd00
Add more info and options to bench.py
dan-blanchard Oct 19, 2017
726973e
Fix miscalculation in bench.py
dan-blanchard Oct 19, 2017
71a0fad
Simplify timing steps in bench.py
dan-blanchard Oct 19, 2017
5b05c5d
Include license file in the generated wheel package
jdufresne Oct 21, 2017
c4c1ba0
Merge pull request #141 from jdufresne/wheel-license
sigmavirus24 Oct 21, 2017
d94c13b
Drop support for Python 2.6 (#143)
jdufresne Dec 11, 2017
38b43cd
Remove unused coverage configuration (#142)
jdufresne Dec 11, 2017
53914f3
Doc the chardet package suitable for production (#144)
jdufresne Jan 18, 2018
d79a194
Pass python_requires argument to setuptools (#150)
jdufresne Apr 26, 2018
1721846
Update pypi.python.org URL to pypi.org (#155)
jdufresne Jun 26, 2018
b5194bf
Typo fix (#159)
saintamh Aug 8, 2018
440828f
Support pytest 4, don't apply marks directly to parameters (#174)
hroncok Nov 11, 2019
388501a
Test Python 3.7 and 3.8 and document support (#175)
jdufresne Nov 11, 2019
a4605d5
Drop support for end-of-life Python 3.4 (#181)
jdufresne Nov 11, 2019
b411a97
Workaround for distutils bug in python 2.7 (#165)
xeor Nov 12, 2019
eb1a10a
Remove deprecated license_file from setup.cfg (#182)
jdufresne Nov 12, 2019
96f8cff
Remove deprecated 'sudo: false' from Travis configuraiton (#200)
jdufresne Nov 1, 2020
1be32c9
Add testing for Python 3.9 (#201)
jdufresne Dec 8, 2020
0608f05
Adds explicit os and distro definitions
May 5, 2020
5b1d7d5
Merge branch 'edumco-upgrade-travis-syntax'
dan-blanchard Dec 8, 2020
4650dbf
Remove shebang from nonexecutable script (#192)
hrnciar Dec 8, 2020
6a59c4b
Remove use of deprecated 'setup.py test' (#187)
jdufresne Dec 8, 2020
e4290b6
Remove unnecessary numeric placeholders from format strings (#176)
jdufresne Dec 8, 2020
55ef330
Update links (#152)
aaaxx Dec 8, 2020
056a2a4
Remove shebang and executable bit from chardet/cli/chardetect.py (#171)
jdufresne Dec 8, 2020
1db0347
Handle weird logging edge case in universaldetector.py
dan-blanchard Dec 8, 2020
a9286f7
Try to switch from Travis to GitHub Actions (#204)
dan-blanchard Dec 10, 2020
1e208b7
Properly set CharsetGroupProber.state to FOUND_IT (#203)
dan-blanchard Dec 10, 2020
53854fb
Add language to detect_all output
dan-blanchard Dec 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions MANIFEST.in
Expand Up @@ -4,3 +4,5 @@ include requirements.txt
include test.py
recursive-include docs *
recursive-include tests *
global-exclude *.pyc
global-exclude __pycache__
45 changes: 44 additions & 1 deletion chardet/__init__.py
Expand Up @@ -16,11 +16,14 @@
######################### END LICENSE BLOCK #########################


from .compat import PY2, PY3
from .universaldetector import UniversalDetector
from .enums import InputState
from .version import __version__, VERSION


__all__ = ['UniversalDetector', 'detect', 'detect_all', '__version__', 'VERSION']


def detect(byte_str):
"""
Detect the encoding of the given byte string.
Expand All @@ -37,3 +40,43 @@ def detect(byte_str):
detector = UniversalDetector()
detector.feed(byte_str)
return detector.close()


def detect_all(byte_str):
"""
Detect all the possible encodings of the given byte string.

:param byte_str: The byte sequence to examine.
:type byte_str: ``bytes`` or ``bytearray``
"""
if not isinstance(byte_str, bytearray):
if not isinstance(byte_str, bytes):
raise TypeError('Expected object of type bytes or bytearray, got: '
'{0}'.format(type(byte_str)))
else:
byte_str = bytearray(byte_str)

detector = UniversalDetector()
detector.feed(byte_str)
detector.close()

if detector._input_state == InputState.HIGH_BYTE:
results = []
for prober in detector._charset_probers:
if prober.get_confidence() > detector.MINIMUM_THRESHOLD:
charset_name = prober.charset_name
lower_charset_name = prober.charset_name.lower()
# Use Windows encoding name instead of ISO-8859 if we saw any
# extra Windows-specific bytes
if lower_charset_name.startswith('iso-8859'):
if detector._has_win_bytes:
charset_name = detector.ISO_WIN_MAP.get(lower_charset_name,
charset_name)
results.append({
'encoding': charset_name,
'confidence': prober.get_confidence()
})
if len(results) > 0:
return sorted(results, key=lambda result: -result['confidence'])

return [detector.result]
6 changes: 4 additions & 2 deletions chardet/compat.py
Expand Up @@ -25,10 +25,12 @@
if sys.version_info < (3, 0):
PY2 = True
PY3 = False
base_str = (str, unicode)
string_types = (str, unicode)
text_type = unicode
iteritems = dict.iteritems
else:
PY2 = False
PY3 = True
base_str = (bytes, str)
string_types = (bytes, str)
text_type = str
iteritems = dict.items