gcrawler

⚠️ 🔥 A pull request will be merged soon that adds concurrency support. 🎉

gcrawler is a simple (not concurrent) web crawler written in Java.

Given a starting url, it visits every reachable page under that domain, not crossing subdomains.

gcrawler outputs a json to stdout listing every visited page and every static asset (images, js, css) found on that page.

Example output:

[
  {
    "url": "http://www.debian.org",
    "assets": [
      "http://www.debian.org/Pics/openlogo-50.png",
      "http://www.debian.org/Pics/identica.png",
      "http://www.debian.org/Pics/planet.png",
      "http://www.debian.org/debhome.css",
      "http://www.debian.org/debian-en.css",
      "http://www.debian.org/favicon.ico"
    ]
  },
  {
    "url": "http://www.debian.org/",
    "assets": [
      "http://www.debian.org/Pics/openlogo-50.png",
      "http://www.debian.org/Pics/identica.png",
      "http://www.debian.org/Pics/planet.png",
      "http://www.debian.org/debhome.css",
      "http://www.debian.org/debian-en.css",
      "http://www.debian.org/favicon.ico"
    ]
  },
  {
    "url": "http://www.debian.org#content",
    "assets": [
      "http://www.debian.org/Pics/openlogo-50.png",
      "http://www.debian.org/Pics/identica.png",
      "http://www.debian.org/Pics/planet.png",
      "http://www.debian.org/debhome.css",
      "http://www.debian.org/debian-en.css",
      "http://www.debian.org/favicon.ico"
    ]
  },
  {
    "url": "http://www.debian.org/intro/about",
    "assets": [
      "http://www.debian.org/Pics/openlogo-50.png",
      "http://www.debian.org/Pics/identica.png",
      "http://www.debian.org/Pics/planet.png",
      "http://www.debian.org/debian.css",
      "http://www.debian.org/debian-en.css"
    ]
  },
  ...
]

How to build

gcrawler uses Gradle as build system. You don't need to install Gradle as it is shipped with sources.

The following will create an executable uber-jar under build/libs/gcrawler-all.jar:

$ ./gradlew clean shadowJar

How to run

Run the gcrawler executable jar passing the starting url as parameter:

$ java -jar build/libs/gcrawler-all.jar http://www.debian.org

gcrawler supports some command line options, see below.

High level design

gcrawler is written in Java and uses the fantastic JSoup library to fetch and parse HTML documents, plus the Jackson json serializer.

The implementation is single-threaded by purpose, but it could easily be parallelized with minor changes.

CPU usage is very low because the main cpu-intensive activity (DOM parsing) happens only in the main thread.

Memory usage is generally low, because documents are not retained after parsing and json results are emitted as soon as possible. In fact, gcrawler uses the Jackson streaming APIs to flush output as soon as documents are parsed.

Stopping the crawler

gcrawler will try hard to emit a valid json to stdout.

Upon task cancellation (e.g. when hitting ^C) the crawler will stop as soon as possible (i.e. at the end of the currently running task) and will emit a valid json for the pages successfully visited.

Error handling

Again, gcrawler will try hard to emit a valid json to stdout and, in general, it will avoid to intermingle stacktraces with json output.

Errors are printed to stderr instead.

Errors happening before any json character has been emitted (e.g. wrong command line parameters), will be printed to stderr and will cause the application to exit early.

Errors happening while the crawler is running and json is being printed, are muted by default. To enable printing crawling errors to stderr use the --print-errors true flag.

Command line parameters

gcrawler requires an odd number of command line parameters, e.g.:

$ java -jar build/libs/gcrawler-all.jar [param1 value1 param2 value2 ...] <url>

Each parameter must be followed by its value. The starting url must be passed as last parameter.

flag	values	default	description
`--print-errors`	boolean	false	Print any exception to stderr
`--halt-on-error`	boolean	false	Stop crawling after any error occurs
`--normalize-urls`	boolean	true	Normalize urls before crawling (remove traling `#` and `?`)
`--timeout-millis`	integer	5000	Socket timeout in milliseconds for connect and read operations

For example, the following:

java -jar build/libs/gcrawler-all.jar --print-errors true --timeout-millis 10000 www.debian.org 2>errors.txt >output.txt

will apply a connection timeout of 10 seconds, will print to file errors.txt any skipped url and will save the json to output.txt.

Notes

In case the starting url returns any error, the empty json list [] is printed.
In case a page does not import any static asset, the empty json list "assets":[] is printed.
gcrawler does not apply any url normalization, except for removing trailing fragment # and question mark ? in order to avoid too much duplicate url. This can be configured with the --normalize-urls parameter.
At the moment, it is possible to configure only the JSoup connection timeout. Anyway JSoup provides sensible defaults: it will follow redirects, will not ignore http errors, will validate certificates, and will handle response body sizes up to 1M.
gcrawler will cross between http and https without problems, but will refuse to cross between domains, therefore crawling debian.org is different than crawling www.debian.org.

License

The gcrawler source files are distributed under the BSD-style license.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

gradle/wrapper

gradle/wrapper

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

build.gradle

build.gradle

gradlew

gradlew

gradlew.bat

gradlew.bat

settings.gradle

settings.gradle

Repository files navigation

gcrawler

How to build

How to run

High level design

Stopping the crawler

Error handling

Command line parameters

Notes

License

About

Releases

Packages

Languages

License

cristiangreco/gcrawler

Folders and files

Latest commit

History

Repository files navigation

gcrawler

How to build

How to run

High level design

Stopping the crawler

Error handling

Command line parameters

Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages