REDDIT SCRAPER

You may scrape but don't you rape!

ABOUT

This Reddit scraper collects info on posts and their authors from a given reddit web-page. The collected info is saved to a txt-file/db-table/db-collection, while the logging info to a log file. CRUD-operations are performed via pretty simple RESTful API.

INSTALLATION

Mkdir on your machine and create a virtual environment down there with Python >= 3.9.
Download files from this repo and place them next to the folder with your newly created virtenv.
Install the packages from requirements.txt file into your virtenv folder (either with $ pip install -r requirements.txt or with the package-manager you'd normally do this).
To run the scraper you'll need Chrome browser on your machine as well as ChromeDriver of a corresponding version. To find out which version of ChromeDriver you need, open your Chrome browser, hit Customize and control Google Chrome -> Settings -> About Chrome.
To be able to perfrom CRUD operations within a database, make sure you've got connection to PostgreSQL and MongoDB. Both of the services are available for free in development purposes.
To run CRUD-operations on the output file/table/collection you will either need your favourite command line util, or a GUI app. Postman will come in handy. It's also free of charge, at least to extend needed.

LAUNCHING AND USING

Move to the folder with the downloaded files from this repo, open the terminal and run python manin.py --help. You will see arguments you'll need to provide with dash-dash flags to run the program. Here's the demo.

$ python main.py --help

usage: main.py [-h]

[--chromedriver-path CHROMEDRIVER_PATH]
[--target-dir-path TARGET_DIR_PATH]

[--url URL]
[--number NUMBER]
[--host HOST]
[--port PORT]
[--server SERVER]

[--postgres-host POSTGRES_HOST]
[--postgres-port POSTGRES_PORT]
[--postgres-db POSTGRES_DB]
[--postgres-user POSTGRES_USER]
[--postgres-pass POSTGRES_PASS]

[--mongo-host MONGO_HOST]
[--mongo-port MONGO_PORT]
[--mongo-db MONGO_DB]

Open the settings.py file and overwrite your --chromedriver-path and --target-dir-path. Also make sure to have your db instances launch (if on linux, use service postgresql status and systemctl status mongod). Note, that you are to obligatory set your db credentials and details.

Before you launch the script move to the main.py module and set the type of CRUD executor you would like to use: either TXT, or SQL(postgres), or NoSQL(mongo). I.e. the info will be saved to txt-file, or tables, or collections respectively.

After you launch the script, it will collect enough raw info from the webpage, process it and temporarily place in into a collector-dict (so, it will be in the RAM).

Open the Postman and use the following uris for the corresponding CRUD operatons.

Here is the scheme:

http://localhost:8087/posts/ with PUT --> to fetch first entry from the RAM and save it.
http://localhost:8087/posts/remaining/ with PUT --> to fetch all remaining entries from the RAM and save them.
http://localhost:8087/posts/ with GET --> to get all the entries already saved into file/database.
http://localhost:8087/posts/UNIQUE_ID/ with GET --> to get a specific entry already saved file/database.
http://localhost:8087/posts/UNIQUE_ID/ with PUT --> to update a specific entry from file/database.
http://localhost:8087/posts/UNIQUE_ID/ with DELETE --> to get rid of a specific entry from file/database.

Here are some API demos:

http://localhost:8087/posts/ with PUT

http://localhost:8087/posts/ with GET

http://localhost:8087/posts/UNIQUE_ID/ with PUT

Go try reddit scraper!

DISCLAIMER

You are strongly discouraged to abuse the program and run massive and frequent requests to the Reddit servers. Remember, you may scrape but don't you rape!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
api_demos		api_demos
crud_executors		crud_executors
.gitignore		.gitignore
README.md		README.md
argparser.py		argparser.py
collector.py		collector.py
constants.py		constants.py
loader.py		loader.py
main.py		main.py
manager.py		manager.py
parser.py		parser.py
requirements.txt		requirements.txt
settings.py		settings.py
utils.py		utils.py
webserver.py		webserver.py

rustworthy/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

REDDIT SCRAPER

You may scrape but don't you rape!

ABOUT

INSTALLATION

LAUNCHING AND USING

DISCLAIMER

About

Resources

Stars

Watchers

Forks

Languages