A simple web crawler.
Given a starting URL it will start crawling and create a simple text sitemap in json format, showing the links between URLs. This is intened to be run using AWS serverless services and the output will be uploaded to an S3 bucket.
- The crawler uses Scrapy (Scrapy is a free and open-source web-crawling framework written in Python).
AWS Account This is intended to be run in AWS using serverless servicess.
- The crawler is intended to be run using AWS serverless services (API Gateway, Lambda, Step Functions, S3). Deployment is done via Serverless Framework which can be installed as a node package.
Serverless Framework plugins:
- Get Python 3, this was tested against 3.7 but any 3.x should do.
- Install serverless framework:
npm install -g serverless
- Change directory to project folder and install serverless plugins:
-
sls install plugin -n serverless-python-requirements
-sls install plugin -n serverless-python-requirements
-sls install plugin -n serverless-pseudo-parameters
- Deploy to AWS
-
sls deploy
Once deployed, sls
will output an URL to be used similar to:
Serverless StepFunctions OutPuts
endpoints:
POST - https://7hy0y72rb1.execute-api.eu-west-1.amazonaws.com/dev/startCrawl
Use the URL provided and post a payload similar to the ones below. Simple payload:
{
"spiderConfig" :
{
"url": "http://books.toscrape.com"
}
}
More complex example:
{
"spiderConfig" :
{
"url": "http://books.toscrape.com",
"dry_run": "no",
"scrapy_settings": {
"LOG_LEVEL": "ERROR",
"CONCURRENT_ITEMS": "400",
"CONCURRENT_REQUESTS": "64",
"CONCURRENT_REQUESTS_PER_DOMAIN": "32",
"CONCURRENT_REQUESTS_PER_IP": "0",
"DNSCACHE_ENABLED": "True"
}
}
}
- Required:
url
to be used as a starting point. - Optional:
dry_run
tell the spider to start crawling. Values:yes
orno
(defaults tono
if not set). - Optional:
scrapy_settings
can be used to set spider settings. For list of possible settings see Built-in settings reference. - Optional:
spiderConfig.scrapy_settings.FEED_URI
: can override the S3 bucket to upload the results to. Defaults tos3://url-crawler-#{AWS::AccountId}
(can use/tmp/results.json
if testing locally).
The results will be uploaded to an S3 bucket named url-crawler-#{AWS::AccountId}
.
To run it locally you'll have to:
- virtualenv venv --python=python3
- pip install Scrapy
- sls invoke local -f crawl -p payload.json
There is a payload.json
file to be used as an example; adjust as you see fit.
Results will be sent to FEED_URI
as set in payload.json
.
Just run: sls deloy
with desired parameters if needed (i.e. -r region -s stage
).