You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a sample file which I downloaded, which is UTF-16LE and/or UCS-2LE - every second byte is 0. From my guesses, this is a file that has been produced by a Java program prior to Java 1.5, and it has not written the BOM to the head of the file. Another candidate could be Windows prior to Windows 2000, which evidently did not write BOMs either.
This file gets detected to be "ascii". Thus I think making a guess at this being UTF-16LE should be fairly trivial.
If we get to the end of processing (if we did not encounter a BOM mark), if we still think it is ASCII, if char[position modulo 2] is '\0' with high probability (e.g. greater than 95%), then it is likely to be UTF-16LE. Equally, writing a detector for UTF-16BE is trivially the inverse.
Now one thing to note is that one could get a false-positive detection of UTF-16 if it were UTF-32, so I think you'd want a combined UTF-16 & UTF-32 detector, and the UTF-32 detector would be built around the same principles.
Lastly, this test case shows the ASCII prediction is wrong. Numbers clearly outside what one would regard as ASCII (e.g. 50% of the file is the null character) are present.
If there's somewhere useful for me to add my test-case input file, I can make a PR to add it to the repo, as a test case of a file that is interesting to detect, for which we currently are not successfully detecting.
The text was updated successfully, but these errors were encountered:
I have a sample file which I downloaded, which is UTF-16LE and/or UCS-2LE - every second byte is 0. From my guesses, this is a file that has been produced by a Java program prior to Java 1.5, and it has not written the BOM to the head of the file. Another candidate could be Windows prior to Windows 2000, which evidently did not write BOMs either.
This file gets detected to be "ascii". Thus I think making a guess at this being UTF-16LE should be fairly trivial.
If we get to the end of processing (if we did not encounter a BOM mark), if we still think it is ASCII, if char[position modulo 2] is '\0' with high probability (e.g. greater than 95%), then it is likely to be UTF-16LE. Equally, writing a detector for UTF-16BE is trivially the inverse.
Now one thing to note is that one could get a false-positive detection of UTF-16 if it were UTF-32, so I think you'd want a combined UTF-16 & UTF-32 detector, and the UTF-32 detector would be built around the same principles.
Lastly, this test case shows the ASCII prediction is wrong. Numbers clearly outside what one would regard as ASCII (e.g. 50% of the file is the null character) are present.
If there's somewhere useful for me to add my test-case input file, I can make a PR to add it to the repo, as a test case of a file that is interesting to detect, for which we currently are not successfully detecting.
The text was updated successfully, but these errors were encountered: