A command line tool written in Python 3 to scrape HTML and PDF articles from a given url.
- Install Python 3
- cd to the project folder
- Create and activate a virtual env, e.g.:
python3 -m virtualenv env && source env/bin/activate
- Install required libraries:
pip install -r requirements.txt
- Install the application
pip install .
- Ensure that you are in the virtualenv where the libraries were installed (see step 3 in Installation)
- cd to the project folder and:
python -m unittest discover -s tests
To view available command line options, in a terminal type: scrape -h
scrape $url --dry-run
Where $url
is the URL you wish to scrape (content type must be HTML/PDF).
scrape $url
The JSON file will be saved in a /articles
directory.
The directory will be created if it doesn't exist and the location will be printed as the articles are saved.
scrape $url -O /path/to/custom/directory
scrape $url1 $url2 $url3