New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GB18030 encoded file incorrectly classified as GB2312 #168
Comments
I can reproduce this issue. It seems we have had this issue since the beginning of chardet..... @wesinator |
@wesinator It seems that the original author of chardet is unwilling to make the gb18030 change, nor distinguishing between gb18030 and gb2312, so I fored a separate repo: https://github.com/x1angli/chardet |
@x1angli Could you submit a PR ? |
Not yet ~~~~ |
I'm adding a If someone would like to provide us with a proper gb18030 prober, we're always open to contributions. |
我也发现chardet无法识别中国的GBK编码格式,虽然老旧的gb2312能够准确识别。不过我觉得vscode使用的库识别gbk等编码是非常准确的:https://github.com/microsoft/vscode-textmate |
gb2312 was published in 1981 gbk almost in 199x, gb18030 in 200x, and gb18030 is downward compatible. |
事实上,我从事工作后发现,在中国的政府和公司中,基本是要求gb2312编码,虽然gbk和gb18030更加新一些,但是大家基本都不用。 |
aadsm/jschardet#49
chardetect 3.0.4
Steps to Reproduce
https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar
chardetect userdb_panda.yar
Actual:
userdb_panda.yar: GB2312 with confidence 0.99
Expected: explicitly detected as GB18030
iconv gives an error converting from GB2312, but works with GB18030:
iconv -f GB18030 -t UTF-8 userdb_panda.yar
works#94
The text was updated successfully, but these errors were encountered: