-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF detection when missing Byte Order Mark #109
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,211 @@ | ||
######################## BEGIN LICENSE BLOCK ######################## | ||
# The Original Code is mozilla.org code. | ||
# | ||
# The Initial Developer of the Original Code is | ||
# Netscape Communications Corporation. | ||
# Portions created by the Initial Developer are Copyright (C) 1998 | ||
# the Initial Developer. All Rights Reserved. | ||
# | ||
# Contributor(s): | ||
# Jason Zavaglia | ||
# | ||
# This library is free software; you can redistribute it and/or | ||
# modify it under the terms of the GNU Lesser General Public | ||
# License as published by the Free Software Foundation; either | ||
# version 2.1 of the License, or (at your option) any later version. | ||
# | ||
# This library is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | ||
# Lesser General Public License for more details. | ||
# | ||
# You should have received a copy of the GNU Lesser General Public | ||
# License along with this library; if not, write to the Free Software | ||
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA | ||
# 02110-1301 USA | ||
######################### END LICENSE BLOCK ######################### | ||
from chardet.enums import ProbingState | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: For consistency, please just make this |
||
from .charsetprober import CharSetProber | ||
|
||
|
||
class UTF1632Prober(CharSetProber): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way we would usually handle this sort of multi-byte encoding is with a Anyway, I know it would be a bit of work to switch this over to using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, this would make sense as it would be cleaner - you could verify the existence of the correct escape sequences - my implementation is more of a quick and dirty heuristic. I feel grokking this state machine would take me days, so if you don't mind I may defer this task if that's ok. |
||
""" | ||
This class simply looks for occurrences of zero bytes, and infers | ||
whether the file is UTF16 or UTF32 (low-endian or big-endian) | ||
For instance, files looking like ( \0 \0 \0 [nonzero] )+ | ||
appear to be UTF32LE. Files looking like ( \0 [notzero] )+ appear | ||
may be guessed to be UTF16LE. | ||
""" | ||
|
||
# how many logical characters to scan before feeling confident of prediction | ||
MIN_CHARS_FOR_DETECTION = 20 | ||
# a fixed constant ratio of expected zeros or non-zeros in modulo-position. | ||
EXPECTED_RATIO = 0.94 | ||
|
||
def __init__(self): | ||
super(UTF1632Prober, self).__init__() | ||
self.position = 0 | ||
self.zeros_at_mod = [0] * 4 | ||
self.nonzeros_at_mod = [0] * 4 | ||
self._state = ProbingState.DETECTING | ||
self.quad = [0,0,0,0] | ||
self.invalid_utf16be = False | ||
self.invalid_utf16le = False | ||
self.invalid_utf32be = False | ||
self.invalid_utf32le = False | ||
self.first_half_surrogate_pair_detected_16be = False | ||
self.first_half_surrogate_pair_detected_16le = False | ||
self.reset() | ||
|
||
def reset(self): | ||
super(UTF1632Prober, self).reset() | ||
self.position = 0 | ||
self.zeros_at_mod = [0] * 4 | ||
self.nonzeros_at_mod = [0] * 4 | ||
self._state = ProbingState.DETECTING | ||
self.invalid_utf16be = False | ||
self.invalid_utf16le = False | ||
self.invalid_utf32be = False | ||
self.invalid_utf32le = False | ||
self.first_half_surrogate_pair_detected_16be = False | ||
self.first_half_surrogate_pair_detected_16le = False | ||
self.quad = [0,0,0,0] | ||
|
||
@property | ||
def charset_name(self): | ||
if self.is_likely_utf32be(): | ||
return "utf-32be" | ||
if self.is_likely_utf32le(): | ||
return "utf-32le" | ||
if self.is_likely_utf16be(): | ||
return "utf-16be" | ||
if self.is_likely_utf16le(): | ||
return "utf-16le" | ||
# default to something valid | ||
return "utf-16" | ||
|
||
@property | ||
def language(self): | ||
return "" | ||
|
||
def approx_32bit_chars(self): | ||
return max(1.0, self.position / 4.0) | ||
|
||
def approx_16bit_chars(self): | ||
return max(1.0, self.position / 2.0) | ||
|
||
def is_likely_utf32be(self): | ||
approx_chars = self.approx_32bit_chars() | ||
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and ( | ||
self.zeros_at_mod[0] / approx_chars > self.EXPECTED_RATIO and | ||
self.zeros_at_mod[1] / approx_chars > self.EXPECTED_RATIO and | ||
self.zeros_at_mod[2] / approx_chars > self.EXPECTED_RATIO and | ||
self.nonzeros_at_mod[3] / approx_chars > self.EXPECTED_RATIO and | ||
not self.invalid_utf32be) | ||
|
||
|
||
def is_likely_utf32le(self): | ||
approx_chars = self.approx_32bit_chars() | ||
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and ( | ||
self.nonzeros_at_mod[0] / approx_chars > self.EXPECTED_RATIO and | ||
self.zeros_at_mod[1] / approx_chars > self.EXPECTED_RATIO and | ||
self.zeros_at_mod[2] / approx_chars > self.EXPECTED_RATIO and | ||
self.zeros_at_mod[3] / approx_chars > self.EXPECTED_RATIO and | ||
not self.invalid_utf32le) | ||
|
||
def is_likely_utf16be(self): | ||
approx_chars = self.approx_16bit_chars() | ||
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and ( | ||
(self.nonzeros_at_mod[1] + self.nonzeros_at_mod[3]) / approx_chars > self.EXPECTED_RATIO and | ||
(self.zeros_at_mod[0] + self.zeros_at_mod[2]) / approx_chars > self.EXPECTED_RATIO and | ||
not self.invalid_utf16be) | ||
|
||
def is_likely_utf16le(self): | ||
approx_chars = self.approx_16bit_chars() | ||
return approx_chars >= self.MIN_CHARS_FOR_DETECTION and ( | ||
(self.nonzeros_at_mod[0] + self.nonzeros_at_mod[2]) / approx_chars > self.EXPECTED_RATIO and | ||
(self.zeros_at_mod[1] + self.zeros_at_mod[3]) / approx_chars > self.EXPECTED_RATIO and | ||
not self.invalid_utf16le) | ||
|
||
def validate_utf32_characters(self, quad): | ||
""" | ||
Identify if the quad of bytes is not valid UTF-32. | ||
|
||
UTF-32 is valid in the range 0x00000000 - 0x0010FFFF | ||
excluding 0x0000D800 - 0x0000DFFF | ||
|
||
https://en.wikipedia.org/wiki/UTF-32 | ||
""" | ||
if quad[0] != 0 or quad[1] > 0x10 or ( | ||
quad[0] == 0 and quad[1] == 0 and 0xD8 <= quad[2] <= 0xDF): | ||
self.invalid_utf32be = True | ||
if quad[3] != 0 or quad[2] > 0x10 or ( | ||
quad[3] == 0 and quad[2] == 0 and 0xD8 <= quad[1] <= 0xDF): | ||
self.invalid_utf32le = True | ||
|
||
def validate_utf16_characters(self, pair): | ||
""" | ||
Identify if the pair of bytes is not valid UTF-16. | ||
|
||
UTF-16 is valid in the range 0x0000 - 0xFFFF excluding 0xD800 - 0xFFFF | ||
with an exception for surrogate pairs, which must be in the range | ||
0xD800-0xDBFF followed by 0xDC00-0xDFFF | ||
|
||
https://en.wikipedia.org/wiki/UTF-16 | ||
""" | ||
if not self.first_half_surrogate_pair_detected_16be: | ||
if 0xD8 <= pair[0] <= 0xDB: | ||
self.first_half_surrogate_pair_detected_16be = True | ||
elif 0xDC <= pair[0] <= 0xDF: | ||
self.invalid_utf16be = True | ||
else: | ||
if 0xDC <= pair[0] <= 0xDF: | ||
self.first_half_surrogate_pair_detected_16be = False | ||
else: | ||
self.invalid_utf16be = True | ||
|
||
if not self.first_half_surrogate_pair_detected_16le: | ||
if 0xD8 <= pair[1] <= 0xDB: | ||
self.first_half_surrogate_pair_detected_16le = True | ||
elif 0xDC <= pair[1] <= 0xDF: | ||
self.invalid_utf16le = True | ||
else: | ||
if 0xDC <= pair[1] <= 0xDF: | ||
self.first_half_surrogate_pair_detected_16le = False | ||
else: | ||
self.invalid_utf16le = True | ||
|
||
def feed(self, byte_str): | ||
for c in byte_str: | ||
mod4 = self.position % 4 | ||
self.quad[mod4] = c | ||
if mod4 == 3: | ||
self.validate_utf32_characters(self.quad) | ||
self.validate_utf16_characters(self.quad[0:2]) | ||
self.validate_utf16_characters(self.quad[2:4]) | ||
if c == 0: | ||
self.zeros_at_mod[mod4] += 1 | ||
else: | ||
self.nonzeros_at_mod[mod4] += 1 | ||
self.position += 1 | ||
return self.state() | ||
|
||
def state(self): | ||
if self._state in [ProbingState.NOT_ME, ProbingState.FOUND_IT]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might as well use a |
||
# terminal, decided states | ||
return self._state | ||
elif self.get_confidence() > 0.80: | ||
self._state = ProbingState.FOUND_IT | ||
elif self.position > 4 * 1024: | ||
# if we get to 4kb into the file, and we can't conclude it's UTF, | ||
# let's give up | ||
self._state = ProbingState.NOT_ME | ||
return self._state | ||
|
||
def get_confidence(self): | ||
confidence = 0.85 | ||
|
||
if self.is_likely_utf16le() or self.is_likely_utf16be() or self.is_likely_utf32le() or self.is_likely_utf32be(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd really like to see a more mathematical basis for the confidence here, rather than the essentially binary 0.99 or 0.01 we have now. Maybe something based on the difference between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say if we look at even 100 bytes (25 code points) and they are all "0 0 0 x 0 0 0 y 0 0 0 z" etc - our confidence is likely to be sky high this is UTF32. Likewise for UTF16. That was my thinking - I realise it's not particularly rigorous to state this without statistical analysis of many texts however, and maybe there are counter examples I'm not aware of. One way to improve rigour, would be to verify exceptions to zeros are actually extended characters (surrogate pairs) of UTF16 also. For instance in UTF16 you'd be looking for quads of (D8-DB) xx (DC-DF) xx to be recognised, e.g. - https://en.m.wikipedia.org/wiki/UTF-16#Examples There are some other unused ranges in UTF16 and UTF32 also, so the detector could be made to flunk to zero if any of these are detected. You're making me think again about correctness of my implementation :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I reduced the estimate to 0.85 arbitrarily, as I don't have rationale to set it at 0.99. Happy to discuss this point further. I also did some additional checking for valid UTF16/UTF32 with the escapes which improves robustness. |
||
return confidence | ||
else: | ||
return 0.00 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,7 +53,7 @@ def gen_test_params(): | |
continue | ||
# Test encoding detection for each file we have of encoding for | ||
for file_name in listdir(path): | ||
ext = splitext(file_name)[1].lower() | ||
ext = splitext(file_name)[1].lower() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: You ended up with a bunch of trailing whitespace here. |
||
if ext not in ['.html', '.txt', '.xml', '.srt']: | ||
dan-blanchard marked this conversation as resolved.
Show resolved
Hide resolved
|
||
continue | ||
full_path = join(path, file_name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: You can remove the bits from this about the original code being mozilla/netscape code, since this part is all new.