Startup news scraping and classification

This is the source code of MonkeyLearn's series of posts related to analyzing startup news using machine learning models.

Code organization

The project itself is a Scrapy project that is used to gather data from different sites like TechCrunch and VentureBeat. Besides, there are a series of Python scripts and Jupyter notebooks that implement other logic like data processing and communication with the MonkeyLearn API.

Filtering startup news with Machine Learning

The TechCrunch, VentureBeat, and Recode spiders (startup_news/spiders) are used to gather data to train a topic classifier in MonkeyLearn. Article title, subtitle (if exists), text, and tags are used as sample text. A subsample of the whole dataset has to be tagged by a human in order to train a model.

To crawl from these sites use

scrapy crawl techcrunch -o itemsTechCrunch.csv
scrapy crawl venturebeat -o itemsVentureBeat.csv
scrapy crawl recode -o itemsRecode.csv

The untagged training set used in the post is available as training_set.csv

Creating machine learning models to analyze startup news

Try out the events and history classifier.ipynb is a notebook that does exactly what its name says, both with and without a pipeline. Feel free to try out both versions and see which one performs better.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
startup_news		startup_news
.gitignore		.gitignore
Pipfile		Pipfile
Process data.ipynb		Process data.ipynb
README.md		README.md
Try out the events and industry classifier.ipynb		Try out the events and industry classifier.ipynb
Try out the startup classifier.ipynb		Try out the startup classifier.ipynb
itemsTechCrunch.csv		itemsTechCrunch.csv
scrapy.cfg		scrapy.cfg
training_set.csv		training_set.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startup_news

startup_news

.gitignore

.gitignore

Pipfile

Pipfile

Process data.ipynb

Process data.ipynb

README.md

README.md

Try out the events and industry classifier.ipynb

Try out the events and industry classifier.ipynb

Try out the startup classifier.ipynb

Try out the startup classifier.ipynb

itemsTechCrunch.csv

itemsTechCrunch.csv

scrapy.cfg

scrapy.cfg

training_set.csv

training_set.csv

Repository files navigation

Startup news scraping and classification

Code organization

Filtering startup news with Machine Learning

Creating machine learning models to analyze startup news

About

Releases

Packages

Languages

Venus713/filtering_news_ML

Folders and files

Latest commit

History

Repository files navigation

Startup news scraping and classification

Code organization

About

Resources

Stars

Watchers

Forks

Languages