This Reddit scraper collects info on posts and their authors from a given reddit web-page. The collected info is saved to a txt-file/db-table/db-collection, while the logging info to a log file. CRUD-operations are performed via pretty simple RESTful API.
Mkdir
on your machine and create a virtual environment down there with Python >= 3.9.- Download files from this repo and place them next to the folder with your newly created virtenv.
- Install the packages from requirements.txt file into your virtenv folder
(either with
$ pip install -r requirements.txt
or with the package-manager you'd normally do this). - To run the scraper you'll need Chrome browser on your machine as well as ChromeDriver of a corresponding version. To find out which version of ChromeDriver you need, open your Chrome browser, hit Customize and control Google Chrome -> Settings -> About Chrome.
- To be able to perfrom CRUD operations within a database, make sure you've got connection to PostgreSQL and MongoDB. Both of the services are available for free in development purposes.
- To run CRUD-operations on the output file/table/collection you will either need your favourite command line util, or a GUI app. Postman will come in handy. It's also free of charge, at least to extend needed.
Move to the folder with the downloaded files from this repo, open the terminal and run python manin.py --help
.
You will see arguments you'll need to provide with dash-dash flags to run the program. Here's the demo.
$ python main.py --help
usage: main.py [-h]
[--chromedriver-path CHROMEDRIVER_PATH]
[--target-dir-path TARGET_DIR_PATH]
[--url URL]
[--number NUMBER]
[--host HOST]
[--port PORT]
[--server SERVER]
[--postgres-host POSTGRES_HOST]
[--postgres-port POSTGRES_PORT]
[--postgres-db POSTGRES_DB]
[--postgres-user POSTGRES_USER]
[--postgres-pass POSTGRES_PASS]
[--mongo-host MONGO_HOST]
[--mongo-port MONGO_PORT]
[--mongo-db MONGO_DB]
Open the settings.py
file and overwrite your --chromedriver-path
and --target-dir-path
.
Also make sure to have your db instances launch (if on linux, use service postgresql status
and systemctl status mongod
). Note, that you are to obligatory set your db credentials and details.
Before you launch the script move to the main.py
module and set the type of CRUD executor
you would like to use: either TXT
, or SQL
(postgres), or NoSQL
(mongo).
I.e. the info will be saved to txt-file
, or tables
, or collections
respectively.
After you launch the script, it will collect enough raw info from the webpage, process it and temporarily place in into a collector-dict (so, it will be in the RAM).
Open the Postman
and use the following uris for the corresponding CRUD
operatons.
Here is the scheme:
http://localhost:8087/posts/
with PUT
--> to fetch first entry from the RAM and save it.
http://localhost:8087/posts/remaining/
with PUT
--> to fetch all remaining entries from the RAM and save them.
http://localhost:8087/posts/
with GET
--> to get all the entries already saved into file/database.
http://localhost:8087/posts/UNIQUE_ID/
with GET
--> to get a specific entry already saved file/database.
http://localhost:8087/posts/UNIQUE_ID/
with PUT
--> to update a specific entry from file/database.
http://localhost:8087/posts/UNIQUE_ID/
with DELETE
--> to get rid of a specific entry from file/database.
Here are some API demos:
http://localhost:8087/posts/
with PUT
http://localhost:8087/posts/
with GET
http://localhost:8087/posts/UNIQUE_ID/
with PUT
Go try reddit scraper!
You are strongly discouraged to abuse the program and run massive and frequent requests to the Reddit servers. Remember, you may scrape but don't you rape!