Cloud Crawler

This repository contains cloud crawler functions used by scrapeulous.com.

If you want to add your own crawler function to be used within the crawling infrastructure of scrapeulous, please contact us at contact.

Quickstart

Here is how you can test all crawling functions locally.

This repository contains a test_runner program.

For example, execute the Google Scraper with:

node test_runner.js google_scraper.js '["keyword 1",]'

or run the amazon crawler with:

node test_runner.js amazon.js '["Notebook",]'

or the reverse image crawler with:

node test_runner.js reverse_image_google_url.js '["https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Mohamed_Atta.jpg/220px-Mohamed_Atta.jpg", "https://aldianews.com/sites/default/files/styles/article_image/public/articles/ISISAmenaza.jpg?itok=u7Nhc41a"]'

or

node test_runner.js reverse_image_bing_url.js '["https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Mohamed_Atta.jpg/220px-Mohamed_Atta.jpg"]'

or

node test_runner.js reverse_image_bing.js '["AC_I161709.jpg"]'

or you can run the social scraper:

node test_runner.js social.js '["http://www.flinders.edu.au/", "http://www.latrobe.edu.au/", "http://www.griffith.edu.au/", "http://www.murdoch.edu.au/", "https://www.qut.edu.au/"]'

Examples of crawler functions

Scraping of Product Metadata on Amazon
Extract the SERP from Google
Extract the SERP from Bing
Simple HTTP crawler making plain requests
Leads: Extracting phone numbers and email addresses from any url with raw http requests
Extracting linkedin profile data from any linkedin profile
Extracting amazon warehouse deals
Extracting amazon product data

Crawling class description

You can add two types of Cloud Crawler functions:

For crawling with the chrome browser controlled via puppeteer, use the BrowserWorker base class
Scraping with the http library got and parsing with cheerio, use the HttpWorker base class

Function prototype for browsers looks like this:

/**
 *
 * The BrowserWorker class contains your scraping/crawling logic.
 *
 * Each BrowserWorker class must declare a crawl() function, which is executed on a distributed unique machine
 * with dedicated CPU, memory and browser instance. A unique IP is not guaranteed,
 * but it is the norm.
 *
 * Scraping workers time out after 200 seconds. So the function
 * should return before this hard limit.
 *
 * Each Worker has a `page` param: A puppeteer like page object. See here:
 * https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#class-page
 */
class Worker extends BrowserWorker {
  /**
  *
  * Implement your crawling logic here. You have access to `this.page` here
  * with a fully loaded browser according to configuration.
  *
  * @param item: The item that this crawl function makes progress with
  */
  async function crawl(item) {
  
  }
}

And the function prototype for HttpWorker instances looks similar:

/**
 *
 * The HttpWorker class contains your scraping/crawling logic.
 *
 * Each HttpWorker class must declare a crawl() function, which is executed on a distributed unique machine
 * with dedicated CPU, memory and browser instance. A unique IP is not guaranteed,
 * but it is the norm.
 *
 * Scraping workers time out after 200 seconds. So the function
 * should return before this hard limit.
 *
 * The class has access to the `this.Got` http library and `this.Cheerio` for parsing html documents.
 * https://github.com/sindresorhus/got
 */
class Worker extends HttpWorker {
  /**
  *
  * Implement your crawling logic here. You have access to `this.Got` here
  * with a powerful http client library.
  *
  * @param item: The item that this crawl function makes progress with
  */
  async function crawl(item) {
  
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cloud Crawler

Quickstart

Examples of crawler functions

Crawling class description

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cloud Crawler

Quickstart

Examples of crawler functions

Crawling class description