Skip to content
forked from saintfish/chardet

Charset detector library for golang derived from ICU

License

Notifications You must be signed in to change notification settings

asdfsx/gochardet

 
 

Repository files navigation

chardet

chardet is library to automatically detect charset of texts for Go programming language. It's based on the algorithm and data in ICU's implementation.

Documentation and Usage

See pkgdoc

## 关于多字节字符号判断的说明

简单看了下shift-jis的处理逻辑。内部有几个限制

  • 最后一个字符可能不会读取。每次调用 DecodeOneChar 都会读取两个字节,即一个双字节字符。循环中如果第一次调用就读完了所有数据,那么就会直接退出循环,导致最后一个读到的数据不会计入到 totalCharCount 等变量中。也就是说10个日文字符,可能只有前9个会参与判断。

    for c, raw, err = r.decoder.DecodeOneChar(raw); len(raw) > 0; c, raw, err = r.decoder.DecodeOneChar(raw)
    
  • totalCharCount一定要大于 10。但是结合上一条。起码要有 11 个字符才能满足这个条件

    if doubleByteCharCount <= 10 && badCharCount == 0 
    	if doubleByteCharCount == 0 && totalCharCount < 10 
    
  • 双字节字符的总数要大于等于4。如果双字节字符不够,在下边的计算中 maxVal=0 ,然后之后的 scaleFactor、confidence 都会收到影响。

    maxVal := math.Log(float64(doubleByteCharCount) / 4)
    scaleFactor := 90 / maxVal
    confidence := int(math.Log(float64(commonCharCount)+1)*scaleFactor + 10)
    

About

Charset detector library for golang derived from ICU

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 94.4%
  • Go 5.6%