New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF detection when missing Byte Order Mark #109
Conversation
9e60270
to
17f7c72
Compare
Rebased the PR to origin/master. |
17f7c72
to
a04b2d2
Compare
Thanks so much for this contribution! I haven't gotten a chance to run through all of it yet, but from what I've seen so far, it looks good.
As discussed in this comment from an old PR this is entirely the correct and required behavior. UTF-16LE and UTF-16BE are what one is required to report when there is no BOM, but they are not allowed if there is a BOM. |
a04b2d2
to
0a1a87a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some minor changes that are definitely required before merging, but if you're up for it, I'd love to see this reworked to use CharDistributionAnalysis
like most of our other multi-byte probers.
# appear to be UTF32LE. Files looking like ( \0 [notzero] )+ appear | ||
# may be guessed to be UTF16LE. | ||
|
||
class UTF1632Prober(CharSetProber): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we would usually handle this sort of multi-byte encoding is with a CodingStateMachine
and CharDistributionAnalysis
-based implementation. You can see the prober we have for UTF-8 (which is obviously single-byte) here, which only uses the CodingStateMachien
, but the approach is close to what we would need to do for UTF-16 and UTF-32. In fact, through some crazy bit of coincidence, we already have unused state machines for UTF-16LE and UTF-16BE in our code that I've never noticed until now. It appears to have been there since version 1.0, but never used.
Anyway, I know it would be a bit of work to switch this over to using CodingStateMachine
like the UTF-8 prober, so I'm not going to require that to merge this, but if you don't do it, I'll probably end up switching this over to that eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this would make sense as it would be cleaner - you could verify the existence of the correct escape sequences - my implementation is more of a quick and dirty heuristic.
I feel grokking this state machine would take me days, so if you don't mind I may defer this task if that's ok.
def get_confidence(self): | ||
confidence = 0.99 | ||
|
||
if self.is_likely_utf16le() or self.is_likely_utf16be() or self.is_likely_utf32le() or self.is_likely_utf32be(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to see a more mathematical basis for the confidence here, rather than the essentially binary 0.99 or 0.01 we have now. Maybe something based on the difference between EXPECTED_RATIO
and the values you have in nonzeros_at_mod
and zeros_at_mod
? 0.99 is a really high confidence, and I like to reserve it for definite yeses. In fact, calculating this confidence is a lot of what the CharDistributionAnalysis
class does, which is part of why I think that's a more natural fit for this codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say if we look at even 100 bytes (25 code points) and they are all "0 0 0 x 0 0 0 y 0 0 0 z" etc - our confidence is likely to be sky high this is UTF32. Likewise for UTF16.
That was my thinking - I realise it's not particularly rigorous to state this without statistical analysis of many texts however, and maybe there are counter examples I'm not aware of.
One way to improve rigour, would be to verify exceptions to zeros are actually extended characters (surrogate pairs) of UTF16 also. For instance in UTF16 you'd be looking for quads of (D8-DB) xx (DC-DF) xx to be recognised, e.g. - https://en.m.wikipedia.org/wiki/UTF-16#Examples
There are some other unused ranges in UTF16 and UTF32 also, so the detector could be made to flunk to zero if any of these are detected.
You're making me think again about correctness of my implementation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced the estimate to 0.85 arbitrarily, as I don't have rationale to set it at 0.99. Happy to discuss this point further.
I also did some additional checking for valid UTF16/UTF32 with the escapes which improves robustness.
Also, thanks so much for the PR! |
- Restored file suffix filter in test.py - Added functionality to identify valid unicode, to enhance detection - Generated some non-trivial unicode examples using supplementary plane 1
Let me know if you have further things you'd like changed - when you're happy, I'll squash the commits so you can do a clean merge. (I've left the history there for the moment.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Sorry about the delay in reviewing it. I reviewed this months ago and never hit "Submit review".
@@ -0,0 +1,211 @@ | |||
######################## BEGIN LICENSE BLOCK ######################## | |||
# The Original Code is mozilla.org code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: You can remove the bits from this about the original code being mozilla/netscape code, since this part is all new.
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA | ||
# 02110-1301 USA | ||
######################### END LICENSE BLOCK ######################### | ||
from chardet.enums import ProbingState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: For consistency, please just make this from .enums import ProbingState
test.py
Outdated
@@ -53,7 +53,7 @@ def gen_test_params(): | |||
continue | |||
# Test encoding detection for each file we have of encoding for | |||
for file_name in listdir(path): | |||
ext = splitext(file_name)[1].lower() | |||
ext = splitext(file_name)[1].lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: You ended up with a bunch of trailing whitespace here.
return self.state() | ||
|
||
def state(self): | ||
if self._state in [ProbingState.NOT_ME, ProbingState.FOUND_IT]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use a tuple
instead of a list
here since this is immutable.
Hi, guys... What's the status of this patch? |
Anyone? |
ping @jpz , a few nits left. |
Did these changes get merged? |
PR is still open so no |
Any update on this? I have recently run into several UTF-16 files without BOM and chardet fails horribly with "100% ascii" result. @dan-blanchard / @jpz |
@dan-blanchard I lost my momentum when I was working on this PR a few years ago. I didn't get any feedback on my PR for about 4 months, and I had moved onto other things at that stage. I'm sure you were busy with your life, so just explaining how I dropped off the PR. If you support me making these changes, let me know, I'll be happy to take another look and fixing this up. Not sure how busy you are maintaining/extending this any more as the repo seems to have a low rate of change, but happy to help out if it's of value. |
I would love to see this get included. This issue has been a real pain point in my project and has forced me to resort to some ugly hacks. |
Bump, any updates on this? This would still be a very valuable change to chardet. |
There is now a conflict in |
I just ran into this issue myself. It would be nice to see these improvements incorporated! |
This was replaced by #206 |
This change adds heuristic detection of UTF-16 and UTF-32 files when they are missing their byte order marks.
At present we have no strategy for detecting the format of these files. Feel free to give feedback on the PR by the way, happy to have other eyes on it.
Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32. This is justified for the following reasons: it's quite material whether it is little endian or big endian - Python will not decode these files correctly when passed decode("utf-16") - Python assumes endian-ness. This is less important for the files with the BOMs, as UTF-aware readers will generally inspect this and auto-decode the file, but for things missing the BOM, it's really important to be told the endian-ness in hopes of reading the file.
We could change the reporting of of files with BOM to inform the endian-ness also, but I've avoided having an opinion on that for the moment.