Skip to content

Stack: Python with predominantly its standard libraries, Selenium, Beautiful Soup, aiohttp, MongoDB, PostgreSQL. RESTful API via custom async web-server.

Notifications You must be signed in to change notification settings

rustworthy/WebCrawler

Repository files navigation

REDDIT SCRAPER

You may scrape but don't you rape!

ABOUT

This Reddit scraper collects info on posts and their authors from a given reddit web-page. The collected info is saved to a txt-file/db-table/db-collection, while the logging info to a log file. CRUD-operations are performed via pretty simple RESTful API.

INSTALLATION

  1. Mkdir on your machine and create a virtual environment down there with Python >= 3.9.
  2. Download files from this repo and place them next to the folder with your newly created virtenv.
  3. Install the packages from requirements.txt file into your virtenv folder (either with $ pip install -r requirements.txt or with the package-manager you'd normally do this).
  4. To run the scraper you'll need Chrome browser on your machine as well as ChromeDriver of a corresponding version. To find out which version of ChromeDriver you need, open your Chrome browser, hit Customize and control Google Chrome -> Settings -> About Chrome.
  5. To be able to perfrom CRUD operations within a database, make sure you've got connection to PostgreSQL and MongoDB. Both of the services are available for free in development purposes.
  6. To run CRUD-operations on the output file/table/collection you will either need your favourite command line util, or a GUI app. Postman will come in handy. It's also free of charge, at least to extend needed.

LAUNCHING AND USING

Move to the folder with the downloaded files from this repo, open the terminal and run python manin.py --help. You will see arguments you'll need to provide with dash-dash flags to run the program. Here's the demo.

$ python main.py --help

usage: main.py [-h]

[--chromedriver-path CHROMEDRIVER_PATH]
[--target-dir-path TARGET_DIR_PATH]

[--url URL]
[--number NUMBER]
[--host HOST]
[--port PORT]
[--server SERVER]

[--postgres-host POSTGRES_HOST]
[--postgres-port POSTGRES_PORT]
[--postgres-db POSTGRES_DB]
[--postgres-user POSTGRES_USER]
[--postgres-pass POSTGRES_PASS]

[--mongo-host MONGO_HOST]
[--mongo-port MONGO_PORT]
[--mongo-db MONGO_DB]

Open the settings.py file and overwrite your --chromedriver-path and --target-dir-path. Also make sure to have your db instances launch (if on linux, use service postgresql status and systemctl status mongod). Note, that you are to obligatory set your db credentials and details.

Before you launch the script move to the main.py module and set the type of CRUD executor you would like to use: either TXT, or SQL(postgres), or NoSQL(mongo). I.e. the info will be saved to txt-file, or tables, or collections respectively.

After you launch the script, it will collect enough raw info from the webpage, process it and temporarily place in into a collector-dict (so, it will be in the RAM).

Open the Postman and use the following uris for the corresponding CRUD operatons.

Here is the scheme:

http://localhost:8087/posts/ with PUT --> to fetch first entry from the RAM and save it.
http://localhost:8087/posts/remaining/ with PUT --> to fetch all remaining entries from the RAM and save them.
http://localhost:8087/posts/ with GET --> to get all the entries already saved into file/database.
http://localhost:8087/posts/UNIQUE_ID/ with GET --> to get a specific entry already saved file/database.
http://localhost:8087/posts/UNIQUE_ID/ with PUT --> to update a specific entry from file/database.
http://localhost:8087/posts/UNIQUE_ID/ with DELETE --> to get rid of a specific entry from file/database.

Here are some API demos:

http://localhost:8087/posts/ with PUT demo

http://localhost:8087/posts/ with GET demo

http://localhost:8087/posts/UNIQUE_ID/ with PUT demo

Go try reddit scraper!

DISCLAIMER

You are strongly discouraged to abuse the program and run massive and frequent requests to the Reddit servers. Remember, you may scrape but don't you rape!

About

Stack: Python with predominantly its standard libraries, Selenium, Beautiful Soup, aiohttp, MongoDB, PostgreSQL. RESTful API via custom async web-server.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages