Skip to content

indicnlp/newspaper-crawler-scripts

 
 

Repository files navigation

Newspaper Crawler Scripts

Set of scripts for crawling newspaper websites. Please find the available scripts below

Available scripts.

Tamil

Site URL script
Nakkheeran http://nakkheeran.in/ tamil/crawler-nakkheeran.py
Dailythanthi http://dailythanthi.com/ tamil/crawler-dailythanthi.py
Tamil The Hindu http://tamil.thehindu.com/ tamil/crawler-tamil-hindu.py
Puthiyathalaimurai http://puthiyathalaimurai.com/ tamil/crawler-puthiyathalaimurai.py
Dinamani http://dinamani.com/ tamil/crawler-dinamani.py

Malayalam

Site URL script
Manorama http://www.manoramaonline.com/ malayalam/crawler-manorama.py

Contribute

Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.

Todo

[ ] Extract common code into a decorator

Setup

pip3 install -r requirements.txt

Latest Script

crawler-viduthalai4.py under tamil uses the latest MultiThreadedCrawler2.

Directory structure

<newspaper_name>
  title.list --> acts as a index for other directories.
  articles
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan
  abstracts
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan  

About

set of scripts for crawling newspaper websites.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%