GB18030 encoded file incorrectly classified as GB2312 #168

wesinator · 2018-11-13T19:30:52Z

aadsm/jschardet#49

chardetect 3.0.4

Steps to Reproduce

https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar

chardetect userdb_panda.yar

Actual: userdb_panda.yar: GB2312 with confidence 0.99

Expected: explicitly detected as GB18030

iconv gives an error converting from GB2312, but works with GB18030:

iconv -f GB2312 -t UTF-8 userdb_panda.yar
iconv: illegal input sequence at position 29230

iconv -f GB18030 -t UTF-8 userdb_panda.yar works

#94

The text was updated successfully, but these errors were encountered:

x1angli · 2018-12-21T13:54:27Z

I can reproduce this issue. It seems we have had this issue since the beginning of chardet.....

@wesinator
A quick but dirty walkaround is: whenever chardet comes back a "GB2312", you would always open the file with "gb18030" codec...

x1angli · 2018-12-21T13:56:24Z

Duplicate of #94 #33

x1angli · 2018-12-22T10:10:16Z

@wesinator It seems that the original author of chardet is unwilling to make the gb18030 change, nor distinguishing between gb18030 and gb2312, so I fored a separate repo: https://github.com/x1angli/chardet
hope it helps

wesinator · 2018-12-24T15:57:24Z

@x1angli Could you submit a PR ?

x1angli · 2018-12-25T09:23:30Z

x1angli Could you submit a PR ?

Not yet ~~~~
Since treating all gb2312-encoded files as gb18030 is not an elegant way, even though such way works....
I would recommend you use convert2utf to convert your yar files to utf-8 or utf-8-bom

dan-blanchard · 2022-06-29T03:45:17Z

I'm adding a should_rename_legacy flag in #264 that does not go as far as treating gb2312 as gb18030, but it does return gbk instead of gb2312 if you enable the flag.

If someone would like to provide us with a proper gb18030 prober, we're always open to contributions.

zyjdmmm · 2023-04-19T09:34:33Z

我也发现chardet无法识别中国的GBK编码格式，虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的：https://github.com/microsoft/vscode-textmate

orinbai · 2023-05-23T08:59:10Z

我也发现chardet无法识别中国的GBK编码格式，虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的：https://github.com/microsoft/vscode-textmate

gb2312 was published in 1981 gbk almost in 199x, gb18030 in 200x, and gb18030 is downward compatible.

zyjdmmm · 2023-05-25T08:14:09Z

我也发现chardet无法识别中国的GBK编码格式，虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的：https://github.com/microsoft/vscode-textmate

gb2312 was published in 1981 gbk almost in 199x, gb18030 in 200x, and gb18030 is downward compatible.

事实上，我从事工作后发现，在中国的政府和公司中，基本是要求gb2312编码，虽然gbk和gb18030更加新一些，但是大家基本都不用。
In fact, after working, I found that in the Chinese government and companies, gb2312 coding is generally required. Although gbk and gb18030 are newer, most people do not use them.

bemoody mentioned this issue Aug 22, 2023

Handling of Chinese text encodings MIT-LCP/physionet-build#2063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GB18030 encoded file incorrectly classified as GB2312 #168

GB18030 encoded file incorrectly classified as GB2312 #168

wesinator commented Nov 13, 2018

x1angli commented Dec 21, 2018

x1angli commented Dec 21, 2018

x1angli commented Dec 22, 2018

wesinator commented Dec 24, 2018

x1angli commented Dec 25, 2018

dan-blanchard commented Jun 29, 2022

zyjdmmm commented Apr 19, 2023

orinbai commented May 23, 2023

zyjdmmm commented May 25, 2023

GB18030 encoded file incorrectly classified as GB2312 #168

GB18030 encoded file incorrectly classified as GB2312 #168

Comments

wesinator commented Nov 13, 2018

Steps to Reproduce

x1angli commented Dec 21, 2018

x1angli commented Dec 21, 2018

x1angli commented Dec 22, 2018

wesinator commented Dec 24, 2018

x1angli commented Dec 25, 2018

dan-blanchard commented Jun 29, 2022

zyjdmmm commented Apr 19, 2023

orinbai commented May 23, 2023

zyjdmmm commented May 25, 2023