Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF detection when missing Byte Order Mark #109

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 17 additions & 0 deletions chardet/universaldetector.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ class a user of ``chardet`` should use.
from .latin1prober import Latin1Prober
from .mbcsgroupprober import MBCSGroupProber
from .sbcsgroupprober import SBCSGroupProber
from .utf1632prober import UTF1632Prober


class UniversalDetector(object):
Expand Down Expand Up @@ -80,6 +81,7 @@ class UniversalDetector(object):

def __init__(self, lang_filter=LanguageFilter.ALL):
self._esc_charset_prober = None
self._utf1632_prober = None
self._charset_probers = []
self.result = None
self.done = None
Expand All @@ -105,6 +107,8 @@ def reset(self):
self._last_char = b''
if self._esc_charset_prober:
self._esc_charset_prober.reset()
if self._utf1632_prober:
self._utf1632_prober.reset()
for prober in self._charset_probers:
prober.reset()

Expand Down Expand Up @@ -179,6 +183,19 @@ def feed(self, byte_str):

self._last_char = byte_str[-1:]

# next we will look to see if it is appears to be either a UTF-16 or
# UTF-32 encoding
if not self._utf1632_prober:
self._utf1632_prober = UTF1632Prober()

if self._utf1632_prober.state() == ProbingState.DETECTING:
if self._utf1632_prober.feed(byte_str) == ProbingState.FOUND_IT:
self.result = {'encoding': self._utf1632_prober.charset_name,
'confidence': self._utf1632_prober.get_confidence(),
'language': ''}
self.done = True
return

# If we've seen escape sequences, use the EscCharSetProber, which
# uses a simple state machine to check for known escape sequences in
# HZ and ISO-2022 encodings, since those are the only encodings that
Expand Down
211 changes: 211 additions & 0 deletions chardet/utf1632prober.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: You can remove the bits from this about the original code being mozilla/netscape code, since this part is all new.

#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Jason Zavaglia
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from chardet.enums import ProbingState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: For consistency, please just make this from .enums import ProbingState

from .charsetprober import CharSetProber


class UTF1632Prober(CharSetProber):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we would usually handle this sort of multi-byte encoding is with a CodingStateMachine and CharDistributionAnalysis-based implementation. You can see the prober we have for UTF-8 (which is obviously single-byte) here, which only uses the CodingStateMachien, but the approach is close to what we would need to do for UTF-16 and UTF-32. In fact, through some crazy bit of coincidence, we already have unused state machines for UTF-16LE and UTF-16BE in our code that I've never noticed until now. It appears to have been there since version 1.0, but never used.

Anyway, I know it would be a bit of work to switch this over to using CodingStateMachine like the UTF-8 prober, so I'm not going to require that to merge this, but if you don't do it, I'll probably end up switching this over to that eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this would make sense as it would be cleaner - you could verify the existence of the correct escape sequences - my implementation is more of a quick and dirty heuristic.

I feel grokking this state machine would take me days, so if you don't mind I may defer this task if that's ok.

"""
This class simply looks for occurrences of zero bytes, and infers
whether the file is UTF16 or UTF32 (low-endian or big-endian)
For instance, files looking like ( \0 \0 \0 [nonzero] )+
appear to be UTF32LE. Files looking like ( \0 [notzero] )+ appear
may be guessed to be UTF16LE.
"""

# how many logical characters to scan before feeling confident of prediction
MIN_CHARS_FOR_DETECTION = 20
# a fixed constant ratio of expected zeros or non-zeros in modulo-position.
EXPECTED_RATIO = 0.94

def __init__(self):
super(UTF1632Prober, self).__init__()
self.position = 0
self.zeros_at_mod = [0] * 4
self.nonzeros_at_mod = [0] * 4
self._state = ProbingState.DETECTING
self.quad = [0,0,0,0]
self.invalid_utf16be = False
self.invalid_utf16le = False
self.invalid_utf32be = False
self.invalid_utf32le = False
self.first_half_surrogate_pair_detected_16be = False
self.first_half_surrogate_pair_detected_16le = False
self.reset()

def reset(self):
super(UTF1632Prober, self).reset()
self.position = 0
self.zeros_at_mod = [0] * 4
self.nonzeros_at_mod = [0] * 4
self._state = ProbingState.DETECTING
self.invalid_utf16be = False
self.invalid_utf16le = False
self.invalid_utf32be = False
self.invalid_utf32le = False
self.first_half_surrogate_pair_detected_16be = False
self.first_half_surrogate_pair_detected_16le = False
self.quad = [0,0,0,0]

@property
def charset_name(self):
if self.is_likely_utf32be():
return "utf-32be"
if self.is_likely_utf32le():
return "utf-32le"
if self.is_likely_utf16be():
return "utf-16be"
if self.is_likely_utf16le():
return "utf-16le"
# default to something valid
return "utf-16"

@property
def language(self):
return ""

def approx_32bit_chars(self):
return max(1.0, self.position / 4.0)

def approx_16bit_chars(self):
return max(1.0, self.position / 2.0)

def is_likely_utf32be(self):
approx_chars = self.approx_32bit_chars()
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and (
self.zeros_at_mod[0] / approx_chars > self.EXPECTED_RATIO and
self.zeros_at_mod[1] / approx_chars > self.EXPECTED_RATIO and
self.zeros_at_mod[2] / approx_chars > self.EXPECTED_RATIO and
self.nonzeros_at_mod[3] / approx_chars > self.EXPECTED_RATIO and
not self.invalid_utf32be)


def is_likely_utf32le(self):
approx_chars = self.approx_32bit_chars()
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and (
self.nonzeros_at_mod[0] / approx_chars > self.EXPECTED_RATIO and
self.zeros_at_mod[1] / approx_chars > self.EXPECTED_RATIO and
self.zeros_at_mod[2] / approx_chars > self.EXPECTED_RATIO and
self.zeros_at_mod[3] / approx_chars > self.EXPECTED_RATIO and
not self.invalid_utf32le)

def is_likely_utf16be(self):
approx_chars = self.approx_16bit_chars()
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and (
(self.nonzeros_at_mod[1] + self.nonzeros_at_mod[3]) / approx_chars > self.EXPECTED_RATIO and
(self.zeros_at_mod[0] + self.zeros_at_mod[2]) / approx_chars > self.EXPECTED_RATIO and
not self.invalid_utf16be)

def is_likely_utf16le(self):
approx_chars = self.approx_16bit_chars()
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and (
(self.nonzeros_at_mod[0] + self.nonzeros_at_mod[2]) / approx_chars > self.EXPECTED_RATIO and
(self.zeros_at_mod[1] + self.zeros_at_mod[3]) / approx_chars > self.EXPECTED_RATIO and
not self.invalid_utf16le)

def validate_utf32_characters(self, quad):
"""
Identify if the quad of bytes is not valid UTF-32.

UTF-32 is valid in the range 0x00000000 - 0x0010FFFF
excluding 0x0000D800 - 0x0000DFFF

https://en.wikipedia.org/wiki/UTF-32
"""
if quad[0] != 0 or quad[1] > 0x10 or (
quad[0] == 0 and quad[1] == 0 and 0xD8 <= quad[2] <= 0xDF):
self.invalid_utf32be = True
if quad[3] != 0 or quad[2] > 0x10 or (
quad[3] == 0 and quad[2] == 0 and 0xD8 <= quad[1] <= 0xDF):
self.invalid_utf32le = True

def validate_utf16_characters(self, pair):
"""
Identify if the pair of bytes is not valid UTF-16.

UTF-16 is valid in the range 0x0000 - 0xFFFF excluding 0xD800 - 0xFFFF
with an exception for surrogate pairs, which must be in the range
0xD800-0xDBFF followed by 0xDC00-0xDFFF

https://en.wikipedia.org/wiki/UTF-16
"""
if not self.first_half_surrogate_pair_detected_16be:
if 0xD8 <= pair[0] <= 0xDB:
self.first_half_surrogate_pair_detected_16be = True
elif 0xDC <= pair[0] <= 0xDF:
self.invalid_utf16be = True
else:
if 0xDC <= pair[0] <= 0xDF:
self.first_half_surrogate_pair_detected_16be = False
else:
self.invalid_utf16be = True

if not self.first_half_surrogate_pair_detected_16le:
if 0xD8 <= pair[1] <= 0xDB:
self.first_half_surrogate_pair_detected_16le = True
elif 0xDC <= pair[1] <= 0xDF:
self.invalid_utf16le = True
else:
if 0xDC <= pair[1] <= 0xDF:
self.first_half_surrogate_pair_detected_16le = False
else:
self.invalid_utf16le = True

def feed(self, byte_str):
for c in byte_str:
mod4 = self.position % 4
self.quad[mod4] = c
if mod4 == 3:
self.validate_utf32_characters(self.quad)
self.validate_utf16_characters(self.quad[0:2])
self.validate_utf16_characters(self.quad[2:4])
if c == 0:
self.zeros_at_mod[mod4] += 1
else:
self.nonzeros_at_mod[mod4] += 1
self.position += 1
return self.state()

def state(self):
if self._state in [ProbingState.NOT_ME, ProbingState.FOUND_IT]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use a tuple instead of a list here since this is immutable.

# terminal, decided states
return self._state
elif self.get_confidence() > 0.80:
self._state = ProbingState.FOUND_IT
elif self.position > 4 * 1024:
# if we get to 4kb into the file, and we can't conclude it's UTF,
# let's give up
self._state = ProbingState.NOT_ME
return self._state

def get_confidence(self):
confidence = 0.85

if self.is_likely_utf16le() or self.is_likely_utf16be() or self.is_likely_utf32le() or self.is_likely_utf32be():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see a more mathematical basis for the confidence here, rather than the essentially binary 0.99 or 0.01 we have now. Maybe something based on the difference between EXPECTED_RATIO and the values you have in nonzeros_at_mod and zeros_at_mod? 0.99 is a really high confidence, and I like to reserve it for definite yeses. In fact, calculating this confidence is a lot of what the CharDistributionAnalysis class does, which is part of why I think that's a more natural fit for this codec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say if we look at even 100 bytes (25 code points) and they are all "0 0 0 x 0 0 0 y 0 0 0 z" etc - our confidence is likely to be sky high this is UTF32. Likewise for UTF16.

That was my thinking - I realise it's not particularly rigorous to state this without statistical analysis of many texts however, and maybe there are counter examples I'm not aware of.

One way to improve rigour, would be to verify exceptions to zeros are actually extended characters (surrogate pairs) of UTF16 also. For instance in UTF16 you'd be looking for quads of (D8-DB) xx (DC-DF) xx to be recognised, e.g. - https://en.m.wikipedia.org/wiki/UTF-16#Examples

There are some other unused ranges in UTF16 and UTF32 also, so the detector could be made to flunk to zero if any of these are detected.

You're making me think again about correctness of my implementation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the estimate to 0.85 arbitrarily, as I don't have rationale to set it at 0.99. Happy to discuss this point further.

I also did some additional checking for valid UTF16/UTF32 with the escapes which improves robustness.

return confidence
else:
return 0.00
2 changes: 1 addition & 1 deletion test.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def gen_test_params():
continue
# Test encoding detection for each file we have of encoding for
for file_name in listdir(path):
ext = splitext(file_name)[1].lower()
ext = splitext(file_name)[1].lower()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: You ended up with a bunch of trailing whitespace here.

if ext not in ['.html', '.txt', '.xml', '.srt']:
dan-blanchard marked this conversation as resolved.
Show resolved Hide resolved
continue
full_path = join(path, file_name)
Expand Down
Binary file added tests/UTF-16BE/nobom-utf16be.txt
Binary file not shown.
Binary file added tests/UTF-16BE/plane1-utf-16be.html
Binary file not shown.
Binary file added tests/UTF-16LE/nobom-utf16le.txt
Binary file not shown.
Binary file added tests/UTF-16LE/plane1-utf-16le.html
Binary file not shown.
Binary file added tests/UTF-32BE/nobom-utf32be.txt
Binary file not shown.
Binary file added tests/UTF-32BE/plane1-utf-32be.html
Binary file not shown.
Binary file added tests/UTF-32LE/nobom-utf32le.txt
Binary file not shown.
Binary file added tests/UTF-32LE/plane1-utf-32le.html
Binary file not shown.