Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 without BOM detection #105

Open
jpz opened this issue Apr 10, 2017 · 1 comment
Open

UTF-16 without BOM detection #105

jpz opened this issue Apr 10, 2017 · 1 comment

Comments

@jpz
Copy link
Contributor

jpz commented Apr 10, 2017

I have a sample file which I downloaded, which is UTF-16LE and/or UCS-2LE - every second byte is 0. From my guesses, this is a file that has been produced by a Java program prior to Java 1.5, and it has not written the BOM to the head of the file. Another candidate could be Windows prior to Windows 2000, which evidently did not write BOMs either.

This file gets detected to be "ascii". Thus I think making a guess at this being UTF-16LE should be fairly trivial.

If we get to the end of processing (if we did not encounter a BOM mark), if we still think it is ASCII, if char[position modulo 2] is '\0' with high probability (e.g. greater than 95%), then it is likely to be UTF-16LE. Equally, writing a detector for UTF-16BE is trivially the inverse.

Now one thing to note is that one could get a false-positive detection of UTF-16 if it were UTF-32, so I think you'd want a combined UTF-16 & UTF-32 detector, and the UTF-32 detector would be built around the same principles.

Lastly, this test case shows the ASCII prediction is wrong. Numbers clearly outside what one would regard as ASCII (e.g. 50% of the file is the null character) are present.

If there's somewhere useful for me to add my test-case input file, I can make a PR to add it to the repo, as a test case of a file that is interesting to detect, for which we currently are not successfully detecting.

@jpz
Copy link
Contributor Author

jpz commented Apr 11, 2017

I've added the pull-request #109 which deals with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant