iterscraper

A basic package used for scraping information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.

Thanks Francesc for featuring this repo in episode #1 of Just For Func. Watch The Video or Review Francesc's pull request.

Flags

Flags are all optional, and are set with a single dash on the command line, e.g.

iterscraper \
-url            "http://foo.com/%d" \
-from           1                   \
-to             10                  \
-concurrency    10                  \
-output         foo.csv             \
-nameQuery      ".name"             \
-addressQuery   ".address"          \
-phoneQuery     ".phone"            \
-emailQuery     ".email"

For an explanation of the options, type iterscraper -help

General usage of iterscraper:

  -addressQuery string
        JQuery-style query for the address element (default ".address")
  -concurrency int
        How many scrapers to run in parallel. (More scrapers are faster, but more prone to rate limiting or bandwith issues) (default 1)
  -emailQuery string
        JQuery-style query for the email element (default ".email")
  -from int
        The first ID that should be searched in the URL - inclusive.
  -nameQuery string
        JQuery-style query for the name element (default ".name")
  -output string
        Filename to export the CSV results (default "output.csv")
  -phoneQuery string
        JQuery-style query for the phone element (default ".phone")
  -to int
        The last ID that should be searched in the URL - exclusive (default 1)
  -url string
        The URL you wish to scrape, containing "%d" where the id should be substituted (default "http://example.com/v/%d")

URL Structure

Successive pages must look like:

http://example.com/foo/1/bar
http://example.com/foo/2/bar
http://example.com/foo/3/bar

iterscraper would then accept the url in the following style, in Printf style such that numbers may be substituted into the url:

http://example.com/foo/%d/bar

Installation

Building the source requires the Go programming language and the Glide package manager.

# Dependency is GoQuery
go get github.com/PuerkitoBio/goquery
# Get and build source
go get github.com/philipithomas/iterscraper
# If your $PATH is configured correctly, you can call it directly
iterscraper [flags]

Errata

This is purpose-built for some internal scraping. It's not meant to be the scraping tool for every user case, but you're welcome to modify it for your purposes
On a 429 - too many requests error, the app logs and continues, ignoring the request.
The package will follow up to 10 redirects
On a 404 - not found error, the system will log the miss, then continue. It is not exported to the CSV.

Extensions

calini/grape is an extension of iterscraper that also adds the ability to swap the incremental indexes with a dictionary file, and query for different attributes.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
main.go		main.go
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

main.go

main.go

test.sh

test.sh

Repository files navigation

iterscraper

Flags

URL Structure

Installation

Errata

Extensions

About

Releases

Packages

Contributors 5

Languages

License

philipithomas/iterscraper

Folders and files

Latest commit

History

Repository files navigation

iterscraper

Flags

URL Structure

Installation

Errata

Extensions

About

Resources

License

Stars

Watchers

Forks

Languages