-
Notifications
You must be signed in to change notification settings - Fork 7
Home
James McKinney edited this page Oct 14, 2013
·
4 revisions
If you do any of the following, you may cause duplicates to be imported:
- drop a MongoDB collection (this breaks foreign keys, which may be part of an object's fingerprint)
- add details to objects (this may change the fingerprint)
- change the fingerprint method (this will change the fingerprint)
If you hardcode the _id
of an object in your scraper, you will not be able to update any of the properties that appear in its fingerprint, because MongoDB will attempt to insert a new document with the same _id
but different properties, and it will fail.
The Pupa.rb workflow is as follows:
- The user may select which actions and which scraping tasks to run from the command-line
- Pupa.rb runs the given actions in order (
scrape
andimport
by default)
- Pupa.rb runs the given scraping tasks in order (all scraping tasks by default)
- The objects yielded by the scraping tasks are dumped as JSON documents either to disk or to Redis
- Pupa.rb validates all JSON documents that have JSON Schema, unless this option is disabled
- The user may inspect the JSON documents while the scraper runs
- Pupa.rb loads the JSON documents either from disk or from Redis
- It removes any duplicate objects
- It determines an evaluation order, such that all referenced objects are saved before the referencing object
- It creates or updates objects in MongoDB, changing primary keys and foreign keys as necessary