Home

Troubleshooting

If you do any of the following, you may cause duplicates to be imported:

drop a MongoDB collection (this breaks foreign keys, which may be part of an object's fingerprint)
add details to objects (this may change the fingerprint)
change the fingerprint method (this will change the fingerprint)

If you hardcode the _id of an object in your scraper, you will not be able to update any of the properties that appear in its fingerprint, because MongoDB will attempt to insert a new document with the same _id but different properties, and it will fail.

Workflow

The Pupa.rb workflow is as follows:

The user may select which actions and which scraping tasks to run from the command-line
Pupa.rb runs the given actions in order (scrape and import by default)

`scrape`

Pupa.rb runs the given scraping tasks in order (all scraping tasks by default)
The objects yielded by the scraping tasks are dumped as JSON documents either to disk or to Redis
Pupa.rb validates all JSON documents that have JSON Schema, unless this option is disabled
The user may inspect the JSON documents while the scraper runs

`import`

Pupa.rb loads the JSON documents either from disk or from Redis
It removes any duplicate objects
It determines an evaluation order, such that all referenced objects are saved before the referencing object
It creates or updates objects in MongoDB, changing primary keys and foreign keys as necessary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Troubleshooting

Workflow

`scrape`

`import`

Clone this wiki locally