UTF detection when missing Byte Order Mark #109

jpz · 2017-04-11T11:06:55Z

This change adds heuristic detection of UTF-16 and UTF-32 files when they are missing their byte order marks.

At present we have no strategy for detecting the format of these files. Feel free to give feedback on the PR by the way, happy to have other eyes on it.

Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32. This is justified for the following reasons: it's quite material whether it is little endian or big endian - Python will not decode these files correctly when passed decode("utf-16") - Python assumes endian-ness. This is less important for the files with the BOMs, as UTF-aware readers will generally inspect this and auto-decode the file, but for things missing the BOM, it's really important to be told the endian-ness in hopes of reading the file.

We could change the reporting of of files with BOM to inform the endian-ness also, but I've avoided having an opinion on that for the moment.

jpz · 2017-04-12T05:23:48Z

Rebased the PR to origin/master.

dan-blanchard · 2017-04-18T00:32:30Z

Thanks so much for this contribution! I haven't gotten a chance to run through all of it yet, but from what I've seen so far, it looks good.

Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32.

As discussed in this comment from an old PR this is entirely the correct and required behavior. UTF-16LE and UTF-16BE are what one is required to report when there is no BOM, but they are not allowed if there is a BOM.

dan-blanchard

There are some minor changes that are definitely required before merging, but if you're up for it, I'd love to see this reworked to use CharDistributionAnalysis like most of our other multi-byte probers.

test.py

chardet/utf1632prober.py

dan-blanchard · 2017-05-30T14:30:57Z

chardet/utf1632prober.py

+# appear to be UTF32LE.  Files looking like ( \0 [notzero] )+ appear
+# may be guessed to be UTF16LE.
+
+class UTF1632Prober(CharSetProber):


The way we would usually handle this sort of multi-byte encoding is with a CodingStateMachine and CharDistributionAnalysis-based implementation. You can see the prober we have for UTF-8 (which is obviously single-byte) here, which only uses the CodingStateMachien, but the approach is close to what we would need to do for UTF-16 and UTF-32. In fact, through some crazy bit of coincidence, we already have unused state machines for UTF-16LE and UTF-16BE in our code that I've never noticed until now. It appears to have been there since version 1.0, but never used.

Anyway, I know it would be a bit of work to switch this over to using CodingStateMachine like the UTF-8 prober, so I'm not going to require that to merge this, but if you don't do it, I'll probably end up switching this over to that eventually.

Ah, this would make sense as it would be cleaner - you could verify the existence of the correct escape sequences - my implementation is more of a quick and dirty heuristic.

I feel grokking this state machine would take me days, so if you don't mind I may defer this task if that's ok.

dan-blanchard · 2017-05-30T14:42:28Z

chardet/utf1632prober.py

+    def get_confidence(self):
+        confidence = 0.99
+
+        if self.is_likely_utf16le() or self.is_likely_utf16be() or self.is_likely_utf32le() or self.is_likely_utf32be():


I'd really like to see a more mathematical basis for the confidence here, rather than the essentially binary 0.99 or 0.01 we have now. Maybe something based on the difference between EXPECTED_RATIO and the values you have in nonzeros_at_mod and zeros_at_mod? 0.99 is a really high confidence, and I like to reserve it for definite yeses. In fact, calculating this confidence is a lot of what the CharDistributionAnalysis class does, which is part of why I think that's a more natural fit for this codec.

I'd say if we look at even 100 bytes (25 code points) and they are all "0 0 0 x 0 0 0 y 0 0 0 z" etc - our confidence is likely to be sky high this is UTF32. Likewise for UTF16.

That was my thinking - I realise it's not particularly rigorous to state this without statistical analysis of many texts however, and maybe there are counter examples I'm not aware of.

One way to improve rigour, would be to verify exceptions to zeros are actually extended characters (surrogate pairs) of UTF16 also. For instance in UTF16 you'd be looking for quads of (D8-DB) xx (DC-DF) xx to be recognised, e.g. - https://en.m.wikipedia.org/wiki/UTF-16#Examples

There are some other unused ranges in UTF16 and UTF32 also, so the detector could be made to flunk to zero if any of these are detected.

You're making me think again about correctness of my implementation :)

I reduced the estimate to 0.85 arbitrarily, as I don't have rationale to set it at 0.99. Happy to discuss this point further.

I also did some additional checking for valid UTF16/UTF32 with the escapes which improves robustness.

dan-blanchard · 2017-05-30T14:44:29Z

Also, thanks so much for the PR!

- Restored file suffix filter in test.py - Added functionality to identify valid unicode, to enhance detection - Generated some non-trivial unicode examples using supplementary plane 1

jpz · 2017-05-31T15:09:01Z

Let me know if you have further things you'd like changed - when you're happy, I'll squash the commits so you can do a clean merge. (I've left the history there for the moment.)

dan-blanchard

Thanks for the PR! Sorry about the delay in reviewing it. I reviewed this months ago and never hit "Submit review". ☹️

dan-blanchard · 2017-05-31T15:08:06Z

chardet/utf1632prober.py

@@ -0,0 +1,211 @@
+######################## BEGIN LICENSE BLOCK ########################
+# The Original Code is mozilla.org code.


Nitpick: You can remove the bits from this about the original code being mozilla/netscape code, since this part is all new.

dan-blanchard · 2017-05-31T15:08:56Z

chardet/utf1632prober.py

+# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
+# 02110-1301  USA
+######################### END LICENSE BLOCK #########################
+from chardet.enums import ProbingState


Nitpick: For consistency, please just make this from .enums import ProbingState

dan-blanchard · 2017-05-31T15:09:39Z

test.py

@@ -53,7 +53,7 @@ def gen_test_params():
            continue
        # Test encoding detection for each file we have of encoding for
        for file_name in listdir(path):
-            ext = splitext(file_name)[1].lower()
+            ext = splitext(file_name)[1].lower()		


Nitpick: You ended up with a bunch of trailing whitespace here.

dan-blanchard · 2017-05-31T15:10:41Z

chardet/utf1632prober.py

+        return self.state()
+
+    def state(self):
+        if self._state in [ProbingState.NOT_ME, ProbingState.FOUND_IT]:


Might as well use a tuple instead of a list here since this is immutable.

Naitreey · 2018-08-14T02:10:32Z

Hi, guys... What's the status of this patch?

dargueta · 2019-05-13T19:44:53Z

Anyone?

jayvdb · 2019-07-29T01:20:49Z

ping @jpz , a few nits left.

vikingspy · 2020-02-07T21:23:47Z

Did these changes get merged?

dargueta · 2020-02-11T22:09:07Z

PR is still open so no

Larrax · 2020-05-12T08:24:02Z

Any update on this? I have recently run into several UTF-16 files without BOM and chardet fails horribly with "100% ascii" result. @dan-blanchard / @jpz

jpz · 2020-05-12T08:39:02Z

@dan-blanchard I lost my momentum when I was working on this PR a few years ago. I didn't get any feedback on my PR for about 4 months, and I had moved onto other things at that stage. I'm sure you were busy with your life, so just explaining how I dropped off the PR. If you support me making these changes, let me know, I'll be happy to take another look and fixing this up.

Not sure how busy you are maintaining/extending this any more as the repo seems to have a low rate of change, but happy to help out if it's of value.

jpsnyder · 2020-06-24T22:40:42Z

I would love to see this get included. This issue has been a real pain point in my project and has forced me to resort to some ugly hacks.

jpsnyder · 2020-10-12T14:58:16Z

Bump, any updates on this? This would still be a very valuable change to chardet.

jayvdb · 2020-12-11T11:43:10Z

There is now a conflict in test.py . @jpz , any chance you can rebase and fix the whitespace and other simple style changes requested.

terrdavis · 2022-06-24T18:59:38Z

I just ran into this issue myself. It would be nice to see these improvements incorporated!

dan-blanchard · 2022-06-24T20:32:31Z

This was replaced by #206

jpz mentioned this pull request Apr 11, 2017

UTF-16 without BOM detection #105

Open

jpz force-pushed the improved_utf_detection branch from 9e60270 to 17f7c72 Compare April 12, 2017 05:21

jpz force-pushed the improved_utf_detection branch from 17f7c72 to a04b2d2 Compare April 13, 2017 04:16

dan-blanchard self-requested a review April 18, 2017 00:29

support for UTF-16 and UTF-32 detection missing BOMs

0a1a87a

jpz force-pushed the improved_utf_detection branch from a04b2d2 to 0a1a87a Compare April 30, 2017 04:12

dan-blanchard requested changes May 30, 2017

View reviewed changes

Changes per PR comments

0026fcc

- Restored file suffix filter in test.py - Added functionality to identify valid unicode, to enhance detection - Generated some non-trivial unicode examples using supplementary plane 1

clean up poorly written comments

94869c8

dan-blanchard approved these changes Oct 11, 2017

View reviewed changes

RaiKoHoff mentioned this pull request Jul 21, 2019

NUL character rizonesoft/Notepad3#1446

Closed

jayvdb mentioned this pull request Jul 30, 2019

GB18030 thombashi/mbstrdecoder#3

Open

dcslagel mentioned this pull request May 7, 2020

Support header-only LAS files--don't lose the last header section before a missng ~A section kinverarity1/lasio#326

Merged

Merge branch 'master' into improved_utf_detection

ae556b7

dan-blanchard mentioned this pull request Dec 11, 2020

Rebased and cleaned up version of UTF-16/32 BE/LE PR #206

Merged

dan-blanchard closed this Jun 24, 2022

dan-blanchard mentioned this pull request Jun 25, 2022

Release 5.0.0 #254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF detection when missing Byte Order Mark #109

UTF detection when missing Byte Order Mark #109

jpz commented Apr 11, 2017

jpz commented Apr 12, 2017

dan-blanchard commented Apr 18, 2017

dan-blanchard left a comment

dan-blanchard May 30, 2017

jpz May 30, 2017

dan-blanchard May 30, 2017

jpz May 30, 2017

jpz May 31, 2017

dan-blanchard commented May 30, 2017

jpz commented May 31, 2017

dan-blanchard left a comment

dan-blanchard May 31, 2017

dan-blanchard May 31, 2017

dan-blanchard May 31, 2017

dan-blanchard May 31, 2017

Naitreey commented Aug 14, 2018

dargueta commented May 13, 2019

jayvdb commented Jul 29, 2019

vikingspy commented Feb 7, 2020

dargueta commented Feb 11, 2020

Larrax commented May 12, 2020

jpz commented May 12, 2020

jpsnyder commented Jun 24, 2020

jpsnyder commented Oct 12, 2020 •

edited

jayvdb commented Dec 11, 2020

terrdavis commented Jun 24, 2022

dan-blanchard commented Jun 24, 2022

		@@ -0,0 +1,211 @@
		######################## BEGIN LICENSE BLOCK ########################
		# The Original Code is mozilla.org code.

UTF detection when missing Byte Order Mark #109

UTF detection when missing Byte Order Mark #109

Conversation

jpz commented Apr 11, 2017

jpz commented Apr 12, 2017

dan-blanchard commented Apr 18, 2017

dan-blanchard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dan-blanchard commented May 30, 2017

jpz commented May 31, 2017

dan-blanchard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Naitreey commented Aug 14, 2018

dargueta commented May 13, 2019

jayvdb commented Jul 29, 2019

vikingspy commented Feb 7, 2020

dargueta commented Feb 11, 2020

Larrax commented May 12, 2020

jpz commented May 12, 2020

jpsnyder commented Jun 24, 2020

jpsnyder commented Oct 12, 2020 • edited

jayvdb commented Dec 11, 2020

terrdavis commented Jun 24, 2022

dan-blanchard commented Jun 24, 2022

jpsnyder commented Oct 12, 2020 •

edited