Skip to content
This repository has been archived by the owner on Oct 29, 2019. It is now read-only.
/ warc Public archive

Golang WARC (Web ARChive) Library

License

Notifications You must be signed in to change notification settings

datatogether/warc

Repository files navigation

warc

GitHub Slack GoDoc License

warc is an implementation of ISO28500 1.0, the WebARCive specfication. it provides readers, writers, and structs for working with warc records.

from the spec:

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries. package warc

License & Copyright

Affero General Public License v3

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Usage

import "github.com/datatogether/warc"