Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GB18030 encoded file incorrectly classified as GB2312 #168

Open
wesinator opened this issue Nov 13, 2018 · 9 comments
Open

GB18030 encoded file incorrectly classified as GB2312 #168

wesinator opened this issue Nov 13, 2018 · 9 comments

Comments

@wesinator
Copy link

aadsm/jschardet#49

chardetect 3.0.4

Steps to Reproduce

https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar

  • chardetect userdb_panda.yar

Actual: userdb_panda.yar: GB2312 with confidence 0.99

Expected: explicitly detected as GB18030

iconv gives an error converting from GB2312, but works with GB18030:

iconv -f GB2312 -t UTF-8 userdb_panda.yar
iconv: illegal input sequence at position 29230

iconv -f GB18030 -t UTF-8 userdb_panda.yar works

#94

@x1angli
Copy link

x1angli commented Dec 21, 2018

I can reproduce this issue. It seems we have had this issue since the beginning of chardet.....

@wesinator
A quick but dirty walkaround is: whenever chardet comes back a "GB2312", you would always open the file with "gb18030" codec...

@x1angli
Copy link

x1angli commented Dec 21, 2018

Duplicate of #94 #33

@x1angli
Copy link

x1angli commented Dec 22, 2018

@wesinator It seems that the original author of chardet is unwilling to make the gb18030 change, nor distinguishing between gb18030 and gb2312, so I fored a separate repo: https://github.com/x1angli/chardet
hope it helps

@wesinator
Copy link
Author

@x1angli Could you submit a PR ?

@x1angli
Copy link

x1angli commented Dec 25, 2018

x1angli Could you submit a PR ?

Not yet ~~~~
Since treating all gb2312-encoded files as gb18030 is not an elegant way, even though such way works....
I would recommend you use convert2utf to convert your yar files to utf-8 or utf-8-bom

@dan-blanchard
Copy link
Member

I'm adding a should_rename_legacy flag in #264 that does not go as far as treating gb2312 as gb18030, but it does return gbk instead of gb2312 if you enable the flag.

If someone would like to provide us with a proper gb18030 prober, we're always open to contributions.

@zyjdmmm
Copy link

zyjdmmm commented Apr 19, 2023

我也发现chardet无法识别中国的GBK编码格式,虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的:https://github.com/microsoft/vscode-textmate

@orinbai
Copy link

orinbai commented May 23, 2023

我也发现chardet无法识别中国的GBK编码格式,虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的:https://github.com/microsoft/vscode-textmate

gb2312 was published in 1981 gbk almost in 199x, gb18030 in 200x, and gb18030 is downward compatible.

@zyjdmmm
Copy link

zyjdmmm commented May 25, 2023

我也发现chardet无法识别中国的GBK编码格式,虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的:https://github.com/microsoft/vscode-textmate

gb2312 was published in 1981 gbk almost in 199x, gb18030 in 200x, and gb18030 is downward compatible.

事实上,我从事工作后发现,在中国的政府和公司中,基本是要求gb2312编码,虽然gbk和gb18030更加新一些,但是大家基本都不用。
In fact, after working, I found that in the Chinese government and companies, gb2312 coding is generally required. Although gbk and gb18030 are newer, most people do not use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants