GB18030 BOM confuses detection #178

jayvdb · 2019-07-30T07:26:55Z

While it isnt common for text to start with a GB18030 BOM (\uFEFF), it results in non-detection and mis-detection.
https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

text = '我没有埋怨，磋砣的只是一些时间。'

import chardet

print(chardet.detect(('\uFEFF' + text).encode('GB18030')))

result is {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''} , when without the BOM the result is {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

See also http://www.0x08.org/posts/UTF8-BOM

The text was updated successfully, but these errors were encountered:

jayvdb · 2019-07-30T11:57:41Z

It seems GB18030 detection is not so reliable even without the BOM. 你好 encoded as GB18030 is detected as TIS-620, which of course will not decode it correctly -- instead TIS-620 decodes it as ฤใบร.

jayvdb mentioned this issue Jul 30, 2019

Add flag to strip bom from input timrburnham/bom_open#1

Closed

This was referenced Jul 30, 2019

GB18030 thombashi/mbstrdecoder#3

Open

GB18030 without BOM press-index/pychardet#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GB18030 BOM confuses detection #178

GB18030 BOM confuses detection #178

jayvdb commented Jul 30, 2019

jayvdb commented Jul 30, 2019 •

edited

GB18030 BOM confuses detection #178

GB18030 BOM confuses detection #178

Comments

jayvdb commented Jul 30, 2019

jayvdb commented Jul 30, 2019 • edited

jayvdb commented Jul 30, 2019 •

edited