UTF-16 without BOM detection #105

jpz · 2017-04-10T17:15:44Z

I have a sample file which I downloaded, which is UTF-16LE and/or UCS-2LE - every second byte is 0. From my guesses, this is a file that has been produced by a Java program prior to Java 1.5, and it has not written the BOM to the head of the file. Another candidate could be Windows prior to Windows 2000, which evidently did not write BOMs either.

This file gets detected to be "ascii". Thus I think making a guess at this being UTF-16LE should be fairly trivial.

If we get to the end of processing (if we did not encounter a BOM mark), if we still think it is ASCII, if char[position modulo 2] is '\0' with high probability (e.g. greater than 95%), then it is likely to be UTF-16LE. Equally, writing a detector for UTF-16BE is trivially the inverse.

Now one thing to note is that one could get a false-positive detection of UTF-16 if it were UTF-32, so I think you'd want a combined UTF-16 & UTF-32 detector, and the UTF-32 detector would be built around the same principles.

Lastly, this test case shows the ASCII prediction is wrong. Numbers clearly outside what one would regard as ASCII (e.g. 50% of the file is the null character) are present.

If there's somewhere useful for me to add my test-case input file, I can make a PR to add it to the repo, as a test case of a file that is interesting to detect, for which we currently are not successfully detecting.

jpz · 2017-04-11T11:07:39Z

I've added the pull-request #109 which deals with this.

RaiKoHoff mentioned this issue Jul 21, 2019

NUL character rizonesoft/Notepad3#1446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 without BOM detection #105

UTF-16 without BOM detection #105

jpz commented Apr 10, 2017

jpz commented Apr 11, 2017

UTF-16 without BOM detection #105

UTF-16 without BOM detection #105

Comments

jpz commented Apr 10, 2017

jpz commented Apr 11, 2017