Skip to content
James McKinney edited this page Oct 14, 2013 · 4 revisions

Troubleshooting

If you do any of the following, you may cause duplicates to be imported:

  • drop a MongoDB collection (this breaks foreign keys, which may be part of an object's fingerprint)
  • add details to objects (this may change the fingerprint)
  • change the fingerprint method (this will change the fingerprint)

If you hardcode the _id of an object in your scraper, you will not be able to update any of the properties that appear in its fingerprint, because MongoDB will attempt to insert a new document with the same _id but different properties, and it will fail.

Workflow

The Pupa.rb workflow is as follows:

  • The user may select which actions and which scraping tasks to run from the command-line
  • Pupa.rb runs the given actions in order (scrape and import by default)

scrape

  • Pupa.rb runs the given scraping tasks in order (all scraping tasks by default)
  • The objects yielded by the scraping tasks are dumped as JSON documents either to disk or to Redis
  • Pupa.rb validates all JSON documents that have JSON Schema, unless this option is disabled
  • The user may inspect the JSON documents while the scraper runs

import

  • Pupa.rb loads the JSON documents either from disk or from Redis
  • It removes any duplicate objects
  • It determines an evaluation order, such that all referenced objects are saved before the referencing object
  • It creates or updates objects in MongoDB, changing primary keys and foreign keys as necessary
Clone this wiki locally