wikiextract

A tool to get the plainest text out of Wikipedia XML dumps. The output is a CSV file. Inspired by attardi/Wikiextractor

Installation

go install github.com/prithvidasgupta/wikiextract

Example usage:

wikiextract extract -f ./samplewiki.xml.bz2 -o out.csv

The above command is comparable to wikiextractor ./sample.xml.bz2 --no-templates -ns ns0 of attardi/Wikiextractor

Benchmark device

Windows WSL -- 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz 3.11 GHz RAM 16.0 GB

Benchmarking with Simple EN wiki dump

wikiextract	wikiextractor
real 0m45.275s	real 1m26.966s
user 1m37.802s	user 4m25.311s
sys 0m6.076s	sys 0m25.905s

Benchmarking with BN wiki dump

wikiextract	wikiextractor
real 1m36.287s	real 2m45.072s
user 2m59.901s	user 6m0.504s
sys 0m7.948s	sys 0m35.705s

Current Goal: To achieve parity with attardi/Wikiextractor

Why do this at all?

I am trying to learn Go and I saw there were no Wiki XML dump extractors in Go, so might as well do it.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cmd		cmd
extract		extract
xsd		xsd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd

cmd

extract

extract

xsd

xsd

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.sh

build.sh

go.mod

go.mod

go.sum

go.sum

main.go

main.go

Repository files navigation

wikiextract

Installation

Benchmark device

Benchmarking with Simple EN wiki dump

Benchmarking with BN wiki dump

Why do this at all?

About

Releases 1

Packages

Contributors 2

Languages

License

prithvidasgupta/wikiextract

Folders and files

Latest commit

History

Repository files navigation

wikiextract

Installation

Benchmark device

Benchmarking with Simple EN wiki dump

Benchmarking with BN wiki dump

Why do this at all?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages