Skip to content
Peng Cheng edited this page Jan 5, 2015 · 10 revisions

SpookyStuff

SpookyStuff is a fast and simple query engine for web scraping/data enrichment/acceptance QA. It aims to allow unstructured web resources being queried and linked like a relational database.

SpookyStuff is the fastest and most scalable of its kind, with a speed record of querying 330404 dynamic pages per hour on 300 cores.

SpookyStuff is tightly integrated with Spark ecosystem and can export structured data directly as RDD and Spark SQL table.

Powered by

  • Apache Spark
  • Selenium
    • GhostDriver/PhantomJS
  • JSoup
  • Apache Tika
  • (build by) Apache Maven
    • Scala/ScalaTest plugins
  • (deployed by) Ansible
  • Current implementation is influenced by Spark SQL and Mahout Sparkbinding.

Apache Spark Selenium PhantomJSApache Tika Build by Apache Maven Ansible

Demo

Click me for a quick impression.

This environment is deployed on a Spark cluster with 8+ cores. It may not be accessible during system upgrade or maintenance. Please contact a committer/project manager for a customized demo.

Clone this wiki locally