Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF detection when missing Byte Order Mark #109

Closed
wants to merge 4 commits into from

Conversation

jpz
Copy link
Contributor

@jpz jpz commented Apr 11, 2017

This change adds heuristic detection of UTF-16 and UTF-32 files when they are missing their byte order marks.

At present we have no strategy for detecting the format of these files. Feel free to give feedback on the PR by the way, happy to have other eyes on it.

Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32. This is justified for the following reasons: it's quite material whether it is little endian or big endian - Python will not decode these files correctly when passed decode("utf-16") - Python assumes endian-ness. This is less important for the files with the BOMs, as UTF-aware readers will generally inspect this and auto-decode the file, but for things missing the BOM, it's really important to be told the endian-ness in hopes of reading the file.

We could change the reporting of of files with BOM to inform the endian-ness also, but I've avoided having an opinion on that for the moment.

@jpz
Copy link
Contributor Author

jpz commented Apr 12, 2017

Rebased the PR to origin/master.

@dan-blanchard
Copy link
Member

Thanks so much for this contribution! I haven't gotten a chance to run through all of it yet, but from what I've seen so far, it looks good.

Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32.

As discussed in this comment from an old PR this is entirely the correct and required behavior. UTF-16LE and UTF-16BE are what one is required to report when there is no BOM, but they are not allowed if there is a BOM.

Copy link
Member

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some minor changes that are definitely required before merging, but if you're up for it, I'd love to see this reworked to use CharDistributionAnalysis like most of our other multi-byte probers.

test.py Outdated Show resolved Hide resolved
chardet/utf1632prober.py Outdated Show resolved Hide resolved
# appear to be UTF32LE. Files looking like ( \0 [notzero] )+ appear
# may be guessed to be UTF16LE.

class UTF1632Prober(CharSetProber):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we would usually handle this sort of multi-byte encoding is with a CodingStateMachine and CharDistributionAnalysis-based implementation. You can see the prober we have for UTF-8 (which is obviously single-byte) here, which only uses the CodingStateMachien, but the approach is close to what we would need to do for UTF-16 and UTF-32. In fact, through some crazy bit of coincidence, we already have unused state machines for UTF-16LE and UTF-16BE in our code that I've never noticed until now. It appears to have been there since version 1.0, but never used.

Anyway, I know it would be a bit of work to switch this over to using CodingStateMachine like the UTF-8 prober, so I'm not going to require that to merge this, but if you don't do it, I'll probably end up switching this over to that eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this would make sense as it would be cleaner - you could verify the existence of the correct escape sequences - my implementation is more of a quick and dirty heuristic.

I feel grokking this state machine would take me days, so if you don't mind I may defer this task if that's ok.

def get_confidence(self):
confidence = 0.99

if self.is_likely_utf16le() or self.is_likely_utf16be() or self.is_likely_utf32le() or self.is_likely_utf32be():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see a more mathematical basis for the confidence here, rather than the essentially binary 0.99 or 0.01 we have now. Maybe something based on the difference between EXPECTED_RATIO and the values you have in nonzeros_at_mod and zeros_at_mod? 0.99 is a really high confidence, and I like to reserve it for definite yeses. In fact, calculating this confidence is a lot of what the CharDistributionAnalysis class does, which is part of why I think that's a more natural fit for this codec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say if we look at even 100 bytes (25 code points) and they are all "0 0 0 x 0 0 0 y 0 0 0 z" etc - our confidence is likely to be sky high this is UTF32. Likewise for UTF16.

That was my thinking - I realise it's not particularly rigorous to state this without statistical analysis of many texts however, and maybe there are counter examples I'm not aware of.

One way to improve rigour, would be to verify exceptions to zeros are actually extended characters (surrogate pairs) of UTF16 also. For instance in UTF16 you'd be looking for quads of (D8-DB) xx (DC-DF) xx to be recognised, e.g. - https://en.m.wikipedia.org/wiki/UTF-16#Examples

There are some other unused ranges in UTF16 and UTF32 also, so the detector could be made to flunk to zero if any of these are detected.

You're making me think again about correctness of my implementation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the estimate to 0.85 arbitrarily, as I don't have rationale to set it at 0.99. Happy to discuss this point further.

I also did some additional checking for valid UTF16/UTF32 with the escapes which improves robustness.

@dan-blanchard
Copy link
Member

Also, thanks so much for the PR!

- Restored file suffix filter in test.py
- Added functionality to identify valid unicode, to enhance detection
- Generated some non-trivial unicode examples using supplementary plane 1
@jpz
Copy link
Contributor Author

jpz commented May 31, 2017

Let me know if you have further things you'd like changed - when you're happy, I'll squash the commits so you can do a clean merge. (I've left the history there for the moment.)

Copy link
Member

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Sorry about the delay in reviewing it. I reviewed this months ago and never hit "Submit review". ☹️

@@ -0,0 +1,211 @@
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: You can remove the bits from this about the original code being mozilla/netscape code, since this part is all new.

# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from chardet.enums import ProbingState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: For consistency, please just make this from .enums import ProbingState

test.py Outdated
@@ -53,7 +53,7 @@ def gen_test_params():
continue
# Test encoding detection for each file we have of encoding for
for file_name in listdir(path):
ext = splitext(file_name)[1].lower()
ext = splitext(file_name)[1].lower()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: You ended up with a bunch of trailing whitespace here.

return self.state()

def state(self):
if self._state in [ProbingState.NOT_ME, ProbingState.FOUND_IT]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use a tuple instead of a list here since this is immutable.

@Naitreey
Copy link

Hi, guys... What's the status of this patch?

@dargueta
Copy link

Anyone?

@jayvdb
Copy link

jayvdb commented Jul 29, 2019

ping @jpz , a few nits left.

@vikingspy
Copy link

Did these changes get merged?

@dargueta
Copy link

PR is still open so no

@Larrax
Copy link

Larrax commented May 12, 2020

Any update on this? I have recently run into several UTF-16 files without BOM and chardet fails horribly with "100% ascii" result. @dan-blanchard / @jpz

@jpz
Copy link
Contributor Author

jpz commented May 12, 2020

@dan-blanchard I lost my momentum when I was working on this PR a few years ago. I didn't get any feedback on my PR for about 4 months, and I had moved onto other things at that stage. I'm sure you were busy with your life, so just explaining how I dropped off the PR. If you support me making these changes, let me know, I'll be happy to take another look and fixing this up.

Not sure how busy you are maintaining/extending this any more as the repo seems to have a low rate of change, but happy to help out if it's of value.

@jpsnyder
Copy link

I would love to see this get included. This issue has been a real pain point in my project and has forced me to resort to some ugly hacks.

@jpsnyder
Copy link

jpsnyder commented Oct 12, 2020

Bump, any updates on this? This would still be a very valuable change to chardet.

@jayvdb
Copy link

jayvdb commented Dec 11, 2020

There is now a conflict in test.py . @jpz , any chance you can rebase and fix the whitespace and other simple style changes requested.

@terrdavis
Copy link

I just ran into this issue myself. It would be nice to see these improvements incorporated!

@dan-blanchard
Copy link
Member

This was replaced by #206

@dan-blanchard dan-blanchard mentioned this pull request Jun 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants