GSoC 2015 Proposal: Cross validation and Meta estimators for Semi supervised Learning

Student Information

Name: Boyuan Deng
Email: bryanhsudeng@gmail.com
Telephone: 008618906286868
Time zone: Central European Time (UTC+1)
IRC handle including network: bryandeng@irc.freenode.net
Source control username: bryandeng on Github
Twitter: @boyuandeng
Blog: http://boyuandeng-gsoc2015.blogspot.de
GSoC Blog RSS feed: http://boyuandeng-gsoc2015.blogspot.com/feeds/posts/default

University Information

University: Saarland University (Universität des Saarlandes)
Major: Erasmus Mundus LCT (mainly natural language processing, also doing machine learning and information retrieval at Max-Planck Institute for Informatics. And after GSoC I’ll move to another university as arranged by the program.)
Current Year and Expected Graduation date: Year 1, expected to graduate in latter half of 2016.
Degree: MSc

Other Related Backgrounds

KDD Cup 2013 - Author-Paper Identification Challenge (Track 1) 6th/554

Project Proposal

Abstract

Though being the de facto statistical machine learning library for Python, scikit-learn’s capabilities on semi-supervised learning are still not fully established.

The goal of this project is to provide new algorithm implementations for the sklearn.semi_supervised subpackage, improve existing ones and enable it to interact smoothly and correctly with other components. We particularly want to support cross validation for semi-supervised learning.

Details

At this moment if someone wants to do semi-supervised learning task using scikit-learn, he or she will find that only two graph-based methods, called LabelPropagation and LabelSpreading respectively, are available in the sklearn.semi_supervised subpackage.

This subpackage looks like deserted compared to others and huge improvement is possible. I decompose the task into three parts.

Cross Validation for Semi-supervised Learning

Currently the sklearn.cross_validation module is unaware of unlabeled data. When splitting the dataset, it blindly puts unlabeled data into testing set, which is meaningless and also confuses the scoring function.

We have to modify the current cross validation infrastructure to make it work correctly for semi-supervised algorithms (including newly added ones) and try to maintain backward compatibility (existing code for cross validation on supervised learning can run without modification).

Due to that we are going to modify the API for cross validation, this step should be done before new algorithm implementations.

Note:

Multiple projects will touch the cross validation infrastructure. I'm glad to know that the prospective participant of the “Multiple Metric Support for Cross Validation and Gridsearches” project wants to finish some critical parts in April as this will greatly reduce the chance of merge conflict.

I'll actively participate in the discussion in April and give my suggestions related to semi-supervised learning.

Improve Existing Implementations

There's still room for improvement on the existing implementations. For example, documentation and a closed form version presented in [2] which surpasses the current iterative framework.

LabelPropagation and LabelSpreading, as stated in the documentation, were implemented according to a book chapter [3], which has different parameter notations and settings compared to the original papers.

I'm already working on clarifying this in the documentation as well as adjusting the parameter settings to make them conform to the original meanings.

Also the current implementation requires that unlabeled data have a pseudo label "-1", we may change this after the new cross validation API has been forged.

Besides that, LabelSpreading (may be a name invented by the authors of [3]) corresponds to the iterative framework in [2]. But [2] clearly recommends a closed form solution it gives and only uses iterative framework as an intermediate step of proof. [3] doesn't emphasize this point because it just intends to give an introduction to different label propagation algorithms on a similarity graph.

I'll implement the previously neglected closed form solution as it is expected to be much faster and more elegant.

New Algorithm Implementations for Semi-supervised Learning

I plan to implement two or three new algorithms for sklearn.semi_supervised.

One example in the idea page is the “self-taught learning” algorithm as specified in [1]. It’s generally a semi-supervised algorithm because it uses both labeled and unlabeled data for training and tests on unseen data, though the authors emphasize that labeled and unlabeled data don’t necessarily share the same distribution. So this algorithm is quite useful when only random unlabeled data can be obtained. In this situation label propagation won't work.

Besides that, my personal preferences are the family of semi-supervised SVMs (S3VMs), like the semi-supervised SVM (Bennett & Demiriz, 1999), the transductive SVM (TSVM) (Joachims, 1999), and the Laplacian SVM (Belkin et al., 2006). It's a very popular type of algorithms in the semi-supervised learning field. I think users will be very happy to be able to run them using scikit-learn.

We may have to implement S3VMs from scratch because the current basis of sklearn.svm, libSVM, doesn't directly provide an SVM formulation for the objective functions of S3VMs. Also, S3VMs' objective functions are non-convex, which are different from SVM's.

Other nice candidates are generative models like semi-supervised naive Bayes.

However, the final choices of algorithms to be implemented should be decided by the community, maybe after a full discussion in April in the mailing list.

Timeline

Week 1 (May 25 - May 31) : API design (may start early in the community bonding period and along with other participants) .
Week 2, 3 (Jun 1 - Jun 14) : Implement the part of the new cross validation API related to semi-supervised learning.
Week 4, 5 (Jun 15 - Jun 28) : Continue implementation, write tests and update documentation. The new API should be mergeable at the end of this period.
Week 6, 7 (Jun 29 - Jul 12) : Improve existing graph-based algorithms.
Week 8, 9 (July 13 - Jul 26) : Implement new semi-supervised algorithms and write corresponding documentation.
Week 10, 11 (July 27 - Aug 9) : Continue implementation and write tests.
Week 12 (Aug 10 - Aug 16) : Improve documentation.

Links to patches

References

Raina, Rajat, et al. "Self-taught learning: transfer learning from unlabeled data." Proceedings of the 24th international conference on Machine learning. ACM, 2007.
Zhou, Dengyong, et al. "Learning with local and global consistency." Advances in neural information processing systems 16.16 (2004): 321-328.
Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. In Semi-Supervised Learning (2006), pp. 193-216
https://github.com/scikit-learn/scikit-learn/issues/1243
https://github.com/scikit-learn/scikit-learn/issues/2593
https://github.com/scikit-learn/scikit-learn/issues/4449

Other Schedule Information

It’s actually still during teaching period (summer semester) in Germany when GSoC goes on. But there won’t be much course workload for me due to the extra credits I got earlier this year. And of course, I’m glad to work on weekends for GSoC. I have some exam(s) in August (or maybe in the last week of July).

On June 8-9, I’ll attend a meeting in Groningen, the Netherlands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly