GSoC 2015 (Proposal and Report) : Various enhancements to the model selection API of scikit learn by Raghav R V (raghavrv)

###SUB-ORGANIZATION: scikit-learn

###Detailed Project Information.

EDIT (14-April-2016) - As I have finished up only the 1st 2 goals during my GSoC, I have removed all the absurd time promises given by me. Right now I am aiming to finish off the rest of the pending work. I'm using this wiki to track my progress in the same.

NOTE - I have added descriptive tags to all the links; Reviewers may find it handy to hower over the links instead of actually visiting them.

####1. Make CV iterators data independent and provide a clean API w/ deprecations.

Status: Done and merged in #4294

Motivation:

Cross Validation is an important tool to avoid over fitting the model. scikit-learn has a nice set of tools to split the data into train, test set based on various strategies. However, currently the CV iterator objects are data dependent, in the sense that they are initialized with data dependent parameters like y, labels etc. and restrict usability, especially if one wishes to use the generator objects for multiple datasets.

Goal:

The goal here is to make these generator objects data independent and provide a clean API like estimators enjoy currently.

Implementation:

I have already started work on the same, building upon Ignacio Rosi's work at #3340. Refer PR #4294.

This essentially attempts to separate the data dependent parameters from the __init__ and providing a clean API using the split(X, y=None) method. This needs to be done without breaking the existing functionality to allow backward compatibility with previous version(s).

Also at the end, all the examples which directly use CV also need to be modified to conform to/showcase the new API.

####2. Group together, clean up and organize model evaluation and optimization modules.

Status: Done and merged in #4294

This was suggested by Joel in the issue #1848.

Motivation:

Cross Validation and Model Selection related modules have grown over time via many great contributions. These need a clean up and assembly into a toplevel module model_selection, to group together related algorithms which will enhance their usability. For instance, grid_search.py contains RandomizedSearchCV which is not a quite appropriate structure as pointed out in the above issue.

Goal: (Note that there were slight changes in the organization and naming)

The goal is to group together related algorithms / classes / functions into the below structure which is taken from Joel's comment at PR #1848:

sklearn/
  model_selection/
    partition.py -- KFold, train_test_split, check_cv
    validate.py -- cross_val_score, permutation_test_score, learning_curve, validation_curve
    search.py -- GridSearchCV, RandomizedSearchCV
    scoring.py -- make_scorer, get_scorer, check_scorer, etc.
    utils.py -- ParameterGrid (may be used by validation_curve), ParameterSampler

Implementation:

This should probably be trivial, and would involve moving all the classes/functions of learning_curve.py, cross_validation.py and grid_search.py into the respective files as shown above and providing a clean module level import path as discussed in #1848 and also support the current import structure, for backward compatibility, while issuing a deprecation warnings for those who attempt to use the same.

I am planning to salvage PR #4254, for the same.

To keep the diff reviewable (as this was a problem previously), I would like to do this, if possible via 3 different PRs one each for grid_search.py, learning_curve.py and cross_validation.py.

3. Multiple metric support to enhance scikit-learn's CV objects.

Status:

Part 1/3 (Restructure grid_scores_ to a dict of 1D arrays that can be imported into pandas as a DataFrame) -

Done and merged in #6697
Part 2/3 (Attempt improving the scorer API to enable multiple metric grid search)
Part 3/3 (Enable multiple metric grid search.)

This was suggested by Joel in the issue #1848. Motivation:

Search algorithms to optimize and fine tune the model, currently work only with a single metric. Including support for multiple metric would be really useful as a diagnostic tool to provide more insight into the parameter exploration.

Goal:

The end goal here is to provide a mechanism with a clean API to allow multiple metric support (with the ability to explore the model simultaneously w.r.t multiple metrics without having to manually repeat the fit/predict individually for each metric)

Implementation:

This is a major work requiring a clean design and a lot of discussions regarding the API structure and backward compatibility as well. I would like to devote the entire month of May to do this as cleanly as possible.

Mathieu Blondel has done most of the work here at PR #2759. I would be building upon his PR and finishing up what ever needs to be done based on further discussions.

NOTE:

There are a few other things that need to be discussed and consolidated before coding them up. Thanks to Vlad for pointing the same out.

On the issue of how the output should be handled, people had suggested using masked arrays, Pandas, a better dict of arrays and a dedicated class to handle the output. This needs to be discussed with the other developers and consolidated into a nice solution, now that we would be supporting multiple metrics.

Before such a discussion, I should also acquaint myself well with the current situation, the ideas proposed in #1020, where Andreas explores goals on how to better present the output of search algorithms, #1034 where he attempts to fix the same by introducing a new class to handle the output and #1842 where Joel explores another way to solve #1020 by introducing a method to index the parameter grid.

Probably related issues/ideas to look at are #2733 and #2079.

This would involve discussing what needs to be done further upon Mathieu's existing work and scavenging the PR (#2759 and issue #1850) for comments and discussions to frame a clear TODO.

4. sample_weight support in grid_search et al.

Motivation:

Currently custom scorers which take in thesample_weight cannot be effectively used in grid search which does not support delegating the sample_weight parameter. This could hamper usability and hence needs to be fixed.

Goal: Support delegating sample_weight (and related parameters(?)) to the scorers.

Implementation:

Noel Dawe has attempted this has also got the same reviewed with positive feedback from core devs at PR #1574 and Vlad at PR #3524 has put forth ideas which provide mechanisms to allow multiple parameters to be neatly passed to the scorer.

To be frank I am a bit fuzzy with the various options here, as I have not looked into the same very well, but I believe the following ideas were attempted/suggested :

Simply supporting sample_weight alone at the top level GridSearchCV itself.
Adding fit_params and scorer_params to allow multiple such estimator / scorer parameters which add flexibility and also provide a way to support sample_group.

Hence I'll start working towards this deliverable by raising a new thread for the same and initiating discussion on the API front and later proceed with the evaluation and implementation.

5. Generalized grid search and early stopping

There is a detailed discussion here at #1626

The basic idea behind generalized cv is that estimators should provide a nice functionality for tuning its multiple parameters. This should work seamlessly with grid search. Such a setup will help build the estimator specific cv functionality inside the estimator itself and the more generalized stuff into our grid search module which should ideally work together with the estimators CV functionality.

This involves heavy discussions and exchange of multiple ideas as this is a major API change which would involve touching several estimators and the grid search module as well. As Andreas and Olivier note, this is probably not easy and might not get fully implemented. But this should at the least kindle enough discussion and attempt at implementing the end goal. This should pave way for a full fledged implementation perhaps in the near future.

6 Introducing additional CV strategies for non trivial cv tasks

Recently (as on March 27th) DisjointKFold was proposed in PR #4444.

Olivier suggests including a similar one that is a blend of ShuffleSplit and LeavePLabelOut.

I would like to include more such CV strategies that can help in non trivial cv tasks for our advanced users.

Estimated completion date: August 10th

7. Make an extensive Cross Validation tutorial.

Motivation: Cross Validation is an important tool for everyone and our users could benefit from an exhaustive CV tutorial.

Goal: To forge an exhaustive CV tutorial that could help users use scikit-learns CV tools effectively.

Implementation: I need to have a thorough understanding of CV since my entire GSoC proposal revolves around the same. This deliverable, I hope, will help me understand more on cross validation along with its intricacies and all the trivial as well as non trivial CV usecases. Hence I will do this in parallel with the main work, investing 5 hours per week for the same.

The following sections as suggested by Olivier should be added along with others that I may frame based on discussions with my mentors on the go.

Selection of the CV strategy. There is also a technical paper highlighting the same.
On using stratified CV iterators for dichotomous classification tasks.
How to check if the IID criterion is not broken.

There are just a few of the topics that could be included. I'll add more after an exhaustive survey of available texts on CV.

8. Improve docstring, examples and contributor documentation.

Motivation:

The current contributors guide could be more exhaustive and help new contributors who often get stuck with similar git or other minor issues related to conventions / code formatting etc. In general it should serve as a quick reference guide for most version control / convention / code formatting related issues. This should also be helpful in code reviews as core devs can quickly point to this instead of correcting minor mistakes like code formatting etc.
Docstring is the first place people look for when they get stuck with a particular module, it would definitely be helpful to them if we add a minimal example for each model as a quick reference.

Goal:

I've suggested at this wiki page the new structure of the contributors guide. The goal would be to fill up all the sections of the suggested tree of headings. Refer - #3912
To add Examples: section for as many models as possible. - Refer - #3846

9. Participate in project wide bug fixes/code reviews along the way.

I am planning to tackle at the least one bug fix, however minor it may be, unrelated to the scheduled work for that week to help bring minor improvements across our code-base.

###STUDENT INFORMATION:

####PERSONAL INFORMATION: NAME: RAGHAV RV

EMAIL: rvraghav93@gmail.com

TELEPHONE: +33 752421000

TIME ZONE: GMT + 1:00 [ Paris / France ]

IRC NICKNAME: raghavrv

GITHUB HANDLE: raghavrv

BLOG: http://rvraghav93.blogspot.com

BLOG FEED URL: http://rvraghav93.blogspot.com/feeds/posts/default

####UNIVERSITY INFORMATION: UNIVERSITY NAME: ANNA UNIVERSITY

COLLEGE NAME: SRI VENKATESWARA COLLEGE OF ENGINEERING

DEGREE: BACHELOR OF ENGINEERING

MAJOR: ELECTRONICS AND COMMUNICATIONS

GRADUATION YEAR: 2015

####ABOUT ME:

I am Raghav R V, a final year undergrad studying in SVCE under Anna University, India. I have taken up quite a few projects in Python, over the past two years. I have also successfully completed my project in Google Summer of Code 2014 under Python Software Foundation / BinPy, where I implemented simulation of various digital components, ASCII based logic visualization tools, binary multiplication algorithms based on the bitstring library and a few selective analog componenents like the signal generator module, analog buffers etc.

While I would like to note that most of the work that I did in BinPy were nowhere near professional standards, I nevertheless, ended up learning a lot of Python / git / boolean algorithms and got the chance to interact with an awesome open source community.

I started with machine learning around September of 2014 and have contributed to scikit-learn from Nov 2014 in the form of minor bug fixes / documentation improvements etc.

I have made bug fixes at quite a few places and hence am quite well versed with our API.

####My Contributions to scikit-learn

(Sorted by the number of discussions)

https://github.com/scikit-learn/scikit-learn/pulls?q=is%3Apr+author%3Araghavrv+sort%3Acomments-desc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2015 (Proposal and Report) : Various enhancements to the model selection API of scikit learn by Raghav R V (raghavrv)

Clone this wiki locally