Skip to content
Telofy edited this page Jul 22, 2014 · 6 revisions

Precise Altruism

The Idea

At work for Ferret Go, Telofy had the idea to start some Twitter accounts that would tweet news articles about altruism and charity based on Kuerzr feeds. The full-text search for a small number of keywords, however, resulted in a lot of false positives. Kashif Rasul’s course on Data Science at the Freie Universität Berlin then led to a synergetic cooperation between Lea and Telofy, and spawned the idea for Precise Altruism.

The name is a reference to the concept of precision (as opposed to recall) and effective altruism. Consequently, our news sources consist of a Kuerzr feed, a Google News feed, and the feeds of several organizations associated with the effective altruism movement.

The Corpus

The unannotated training data was provided by Ferret Go and annotated by us according to principles set forth in our readme. The article texts themselves are not part of the repository.

The Classifier

The heart of our application is a classification pipeline built with scikit-learn, which uses tf-idf to generate a feature matrix of our news data and then a Stochastic Gradient Descent classifier to assign them one of our two categories.

We used grid search and cross-validation to determine the optimal classifier and an optimal set of parameters for it. Using only a small set of plausible parameters and only three splits for the cross-validation, we quickly determined the four out of initially ten classification algorithms that performed best on our data, Stochastic Gradient Descent, Logistic Regression, and two variations of the Support Vector Machines classifier. In our final, most finely tuned run, Stochastic Gradient Descent achieved an F1 score of 93%, about two percentage points more than the best of the other three classifiers.

The clearest takeaways from the grid search over a plausible SGD parameter set were that as loss functions log, hinge, modified_huber, and perceptron performed well; that as penalty l2 and elasticnet performed well; that activating the shuffling helped; that using bigrams in addition to unigrams was useful but that 3-grams did not improve the F1 score; and that the best values for alpha and n_iter varied widely among the best configurations.

The Daemon

The daemon is the service that continuously runs on the server and continually checks the source feeds. It sends if-modified-since and if-none-matches headers whenever possible to minimize server load and traffic. Then the feed entries are compared to those in the database to filter out known ones, whereby we also compute the Jaccard distance between the preprocessed titles to avoid posting the same press releases over and over.

The articles that are typically associated with these entries are then fetched, stripped of boilerplate using Readability, summarized using Sumy, and finally posted to Tumblr. We extended the extraction step with one that also extracts a featured image and added a naive keyword extraction for the post tags on Tumblr. The articles are also posted to Twitter, so we have found it necessary to truncate some titles within the tweets and to do so gracefully.

The Website

Please follow the Altrunews Tumblr.

Clone this wiki locally