Newspaper Crawler Scripts

Set of scripts for crawling newspaper websites. Please find the available scripts below

Available scripts.

Tamil

Site	URL	script
Nakkheeran	http://nakkheeran.in/	tamil/crawler-nakkheeran.py
Dailythanthi	http://dailythanthi.com/	tamil/crawler-dailythanthi.py
Tamil The Hindu	http://tamil.thehindu.com/	tamil/crawler-tamil-hindu.py
Puthiyathalaimurai	http://puthiyathalaimurai.com/	tamil/crawler-puthiyathalaimurai.py
Dinamani	http://dinamani.com/	tamil/crawler-dinamani.py

Malayalam

Site	URL	script
Manorama	http://www.manoramaonline.com/	malayalam/crawler-manorama.py

Contribute

Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.

Todo

[ ] Extract common code into a decorator

Setup

pip3 install -r requirements.txt

Latest Script

crawler-viduthalai4.py under tamil uses the latest MultiThreadedCrawler2.

Directory structure

<newspaper_name>
  title.list --> acts as a index for other directories.
  articles
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan
  abstracts
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
data		data
malayalam		malayalam
reference_scripts		reference_scripts
tamil		tamil
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aliases.py		aliases.py
config.py		config.py
crawler.py		crawler.py
newspapers.csv		newspapers.csv
requirements.txt		requirements.txt
stacktracer.py		stacktracer.py

License

indicnlp/newspaper-crawler-scripts

Folders and files

Latest commit

History

Repository files navigation

Newspaper Crawler Scripts

Available scripts.

Tamil

Malayalam

Contribute

Todo

Setup

Latest Script

Directory structure

About

Resources

License

Stars

Watchers

Forks

Languages