GSoC 2014 Proposal: Improved Linear Models

Name: Manoj Kumar

Email : manojkumarsivaraj334@gmail.com

IRC : ManojKumar

Github : MechCoder

Blog : http://manojbits.wordpress.com/

Programming Skills

Projects done

I am Manoj, a Mechanical junior enrolled in Birla Institute of Technology and Science, Goa. Python has been my primary programming language and I have done a number of useful projects in it, including a GUI for solving Thermodynamics Problems (ThermoPython). My other minor projects can be seen here (Projects) and on my blog, where I occasionally write notes. Though I major in Mechanical Engineering, I have taken a number of Math courses as electives like Optimization, Numerical Analysis and Graphs and Networks out of interest .I participated in Google Summer of Code 2013 under SymPy (which is a symbolic library for mathematics) in which I added Lie Group and SeriesSolve support in the Ordinary Differential Equation module and improved the Partial Differential Equation module to an extent. This involved knowledge of advanced mathematics, which I had picked up quickly before the application period. My proposal and commits can be seen here, (Proposal) and (Commits) . Other than this I have a semester worth of programming experience in C++ (programming assignments), Matlab (for my Mechanical Engineering assignments) and a little bit of Java for an android app.

Machine Learning and Scikit Learn Experience

I have coded a few key algorithms in Machine Learning that can be seen here, https://github.com/Manoj-Kumar-S/Machine_Learning that are based on Andrew Ng's lectures and a few research papers, I have been quite an active contributor from December. I am in no way an expert in Machine Learning and I look forward to Summer of Code as an opportunity to hone my ML skills. Here a list of my Pull Requests, sorted by date (at the time of writing)

https://github.com/scikit-learn/scikit-learn/pull/2493 - (not merged) - First Pull Request. Could not be merged since we could not come to a conclusion as to what is right, macro or micro averaging
https://github.com/scikit-learn/scikit-learn/pull/2566 - (merged) - Constant Output Dummy Classifier which provides constant output regardless of the input.
https://github.com/scikit-learn/scikit-learn/pull/2590 - (merged) - Simple Error checking when y.ndim > 2, for ElasticNetCV and LassoCV
https://github.com/scikit-learn/scikit-learn/pull/2717 - (merged) - Testing LogLoss and HingeLoss under thresholded metrics.
https://github.com/scikit-learn/scikit-learn/pull/2598 - (merged) - Added MultiTaskElasticNetCV and MultiTaskLassoCV
https://github.com/scikit-learn/scikit-learn/pull/2788 - (merged) - Proper fitting of alpha_grid for sparse matrices.
https://github.com/scikit-learn/scikit-learn/pull/2866 - (merged) - Testing fix_intercept = True for RidgeCV using Newtons Method/
https://github.com/scikit-learn/scikit-learn/pull/2876 - (merged) - Fixing compute_class_weight for condition="auto"
https://github.com/scikit-learn/scikit-learn/pull/2951 - (merged) - Speed up sparse coordinate_descent.

Project Idea

Motivation

I have a decent knowledge of the codebase of the linear models, and have worked with my potential mentor Alexandre Gramfort on a number of pull requests. After proposing a few ideas which were considered a bit too heavy for a Summer of Code, I've zeroed in on this idea, since I believe it would be useful to both me and the sklearn community, from this discussion on the mailing list.

Mailing list discussion

My project idea is motivated by the following facts.

There are chance that doing a random update or skipping through co-ordinates that are not active will lead to faster convergence.
For logistic Regression, sklearn is dependent on an external library called LibLinear. This prevents ElasticNet and fitting a regularization path.
Also sklearn lacks a complete Logistic Regression model when the tasks are multi-output.

Abstract

I propose to implement the following goals this summer.

Random / Cyclic co-ordinate descent
Finishing Gael's Logistic Regression CV PR
Finishing Larsman's Multiomial Regression PR + MultinomialRegressionCV
Strong Rules for (ElasticNet + Lasso) and (ElasticNetCV + LassoCV)
(L1 + L2) regression using CDN
Fixing all issues that I might face while dealing with the above goals.

Thoery and Implementation

Random / Cyclic co-ordinate descent.

There is a pretty good possibility that random coordinate descent would converge faster than doing updating all co-ordinates over every iteration. This can be done in two ways.

a] Picking random features to update across every iteration.

b] Permuting the order of the features outside every iteration, which is similar to randomised descent but without replacement as suggested by Mathieu. Care should be taken that the speed actually improves and tests aren't broken.

Completing Logistic Regression CV

The second aim would be to complete the Logistic Regression CV as started by Gael and Fabian. The aim would be to complete the TODO's mentioned in this PR. https://github.com/scikit-learn/scikit-learn/pull/2862 .

a] Fix the existing test failures and add more tests.

b] Refactor the helper functions to avoid duplicate code.

c] Sparse matrix support for Logistic Regression CV. It seems logistic_regression_path works well for sparse matrices but is broken for LogisticRegressionCV

d] Document tolerance and stopping criterion.

Completing MultinomialLogisticRegression and adding a regularization path.

The third aim would be to complete Larsmans's work.

a] There is scope for adding more tests and improving test coverage.

b] Fixing tests, especially those related to ones in X and y are weighted.

c] Do extensive profiling to see where the slowdown happens. Fix the unravelling of the parameters and examine the L-BFGS code to see if computing loss and gradient separately actually helps.

d] Add MultinomialLogisticRegressionCV to fit a regularization path similar to LogisticRegressionCV.

e] If everything works well, and we do manage to have a consistent and clean API across LogisticRegression when it is mono and multi-output, this can be done away with.

Strong rules for ElasticNet and Lasso

The strong rules are described in this paper [2].The idea is to skip co-ordinates that cannot be active.

The pseudo-code for the algorithm can be seen here, [3]. A slightly (maybe better) alternative for the algorithm can be seen in this Pull Request discussion, Pull Request Discussion A good challenge would be to integrate it with the existent enet_path/lasso_path and coordinate_descent but it would be simpler than before due to the improvements done in this Pull Request https://github.com/scikit-learn/scikit-learn/pull/2598.

L1 + L2 Logistic Regression using co-ordinate descent.

L1 regression isn't quite straightforward due to the non-differentiabilty of the L1 norm. This paper [1] (Algorithm 2) describes the method to find the Newton direction d and the term 'lambda' used for line search. This has been implemented in Mathieu Blondel's Lightning. Using this as a base and Appendix B in the paper [1] (which describes how to get the Newton direction) this method can maybe be extended to L1 + L2 regularization.

If there is time left, glmnet which is more advanced and complex than coordinate descent (which is described in the same paper) can be ported (or written again) from LibLinear.

Timeline

Pre- GSoC (Today-April 21)

Goal: Increase familiarity

Contribute as many patches as possible, to get more and more familiarized with the code-base.

Community Bonding Period (April 22 - May 18)

Goal: Community Bonding

For organizations like scikit-learn and SymPy I believe students who are a part of GSoC are already a part of the community since they would have already contributed quite a few patches. In this period

I would read more extensively on the algorithms that I'm going to implement
Try merge my open PR's if any.
If possible start to code.

P.S : I also have my exams from May 5 - May 17 during which my rate of contribution would possibly slow down, but I will try to pick up.

Week 1 (May 19 - May 25)

Goal: Randomised Coordinate/ Cyclic coordinate descent

Week 2, 3, (May 26 - June 8)

Goal: Finish Logistic Regression CV PR

Week 4, 5, 6 (June 9 - June 29)

Goal: Finish MultinomialLR and MultinomialLRCV

Note - Pass midterm evaluation on June 27

Week 7, 8, 9 (June 30 - July 20)

Goal: Implement strong rules for ElasticNet and Lasso.

Week 10, 11, 12 (July 21- August 11)

Goal: L1 + L2 Logistic Regression

Week 13 - Improve documentation, tests (and maybe tutorials)

Notes

The timeline is tentative. From my experience things can change in a very small period of time. A good example is when I was playing around with the code of Larsmans' PR, I came across a bug in compute_class_weight which had to be fixed before anything more could be done.
I have no exams, during this period which means I am free to work for a minimum of 40 hours per week. I also have no planned vacations this summer.
I am enthusiastic about seeing my work merged in master. If by any chance, by the end of Summer of Code, my work does not get merged, I would work towards merging it beyond the summer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly