Skip to content

jamur2/rsscluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rsscluster

Gensim document similarity demonstration using RSS feeds as document sources

About

Gensim is a Python library used to perform "topic modeling". The most practical purpose for topic modelling is finding similar documents. This simple script serves as an example of the usage and power of this library. Given a list of RSS feeds (in OPML format), the script will create a database of all stories contained in those feeds, and show some sample clusters of similar stories.

For example, if an OPML file contains both The Washington Post and New York Times feeds, a good cluster would include stories from both papers on the same event, such as a recent presidential tour of Africa.

Installation

First off, you'll need virtualenv. Depending on your operating system and your tolerance for system packages, you have a choice of installation methods. Install it the way you want (or just follow the directions).

Once you're done with that, you can set up your environment. From this directory, run:

$ virtualenv --no-site-packages .
$ bin/pip install -r requirements.txt

At this point, you'll probably have some errors. Numpy and scipy (two dependencies of gensim) are notoriously difficult to build on your first try. The first thing to check is that you have all of the dependencies installed (eg., gcc). Hopefully you'll see an error message indicating what's missing.

I also sometimes have an issue installing numpy from requirements.txt. If I build it separately with:

$ bin/pip install numpy==1.6.2

And then build the rest with:

$ bin/pip install -r requirements.txt

That can move the build process along. You also may get things working by installing system packages for numpy and scipy, but that can bring its own problems. If none of the above helps, I apologize, and can only say that Google is your friend.

Usage

First, you'll need an OPML file. Most readers allow you to export your feeds as OPML. If you can't, or you don't have enough feeds to generate interesting output, feel free to use my sample file in this directory, named "feeds.opml"

The basic usage of rsscluster.py is:

Usage: rsscluster.py [options] OPML_FILE

Options:
-h, --help            show this help message and exit
-t THRESHOLD, --threshold=THRESHOLD
                        Documents whose similarity is larger than this
                        threshold will be considered similar (0-1,
                        default=0.6)
-d DATE, --date=DATE  Publication date of stories to base clusters around
                        (format=YYY-MM-DD, default=today)
-s, --skip-training   Skip training (if you already have an existing
                        database, you may want to skip the training step)
-m, --html            HTML output
-f OUTPUT_FILE, --output-file=OUTPUT_FILE
                        Output file (default=stdout)

For example, if you want to generate HTML output of feeds.opml in feeds.html, you would run:

$ bin/python rsscluster.py --html --output-file=feeds.html feeds.opml

By default, rsscluster will generate clusters around stories published on the current day. If you want to generate clusters around stories published on other days, you would run:

$ bin/python rsscluster.py --date=2013-06-29 feeds.opml

Also, rsscluster keeps the database around between runs. This way, as older stories fall off RSS feeds, they can still be indexed for similiarity in the future. Because the database sticks around, you don't really need to retrain it on each run; you just need to index the new documents. To skip the training phase, you run:

$ bin/python rsscluster.py --skip-training feeds.opml

About

Gensim document similarity demonstration using RSS feeds as document sources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages