Skip to content

zhenzou/html2article

 
 

Repository files navigation

基于文本密度的html2article实现[golang]

Install

go get -u -v github.com/sundy-li/html2article

Performance

  • Accuracy: >= 98%

  • Qps: 2w/s , 0.06ms/op go test -bench=. BenchmarkExtract-4 20000 66341 ns/op

  • 说明(对比其他开源实现,可能是目前最快的html2article实现,我们测试的数据集约3kw来自于微信公众号,各大类中文科技媒体历史文章,目前能达到98%以上准确率)

Examples

参考examples from_url.go

package main

import (
	"github.com/sundy-li/html2article"
)

func main() {
	urlStr := "https://www.leiphone.com/news/201602/DsiQtR6c1jCu7iwA.html"
	ext, err := html2article.NewFromUrl(urlStr)
	if err != nil {
		panic(err)
	}
	article, err := ext.ToArticle()
	if err != nil {
		panic(err)
	}
	println("article title is =>", article.Title)
	println("article publishtime is =>", article.Publishtime)
	println("article content is =>", article.Content)
	
	//parse the article to be readability
	article.Readable(urlStr)
	println("read=>", article.ReadContent)
}

Algorithm

About

基于文本密度的html2article实现[golang]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 100.0%