GSoC 2015 Proposal: Metric Learning module

Name: Artem Sobolev

Email: [github login with dash replaced by dot]@gmail.com

Github: Barmaley-exe

Background

I’m a MSc-level student in Computer Science and Software Engineering at Saint-Petersburg State University, Russia. I also study Data Mining at Computer Science Center and Yandex School of Data Analysis. I’ve completed several Machine Learning classes, and did a project on recommendation systems. I also did several internships with one of them being a research-oriented ML internship.

Proposal

My proposal is to introduce a new module designed for Metric Learning. This is an established area of research whose methods would be nice to have implemented in scikit-learn. Those metrics could be used to facilitate distance-based classifiers (like KNN) and clustering.

Most of metric learning models learn a positive-semidefinite matrix A, which corresponds to a Mahalanobis distance. It can be shown (x^T A y = x^T L^T L y = (L x)^T (L y) where A = L^T L is a Cholesky decomposition) that this is equivalent to a (linear) mapping of the data into a new space and then taking the Euclidean distance. Thus, all of, so to say, linear metric learners can be implemented as transformers: we just apply L from the Cholesky decomposition to our data.

When it comes to nonlinear metrics, there's an interesting trick, named KernelPCA trick. Basically, one can just Pipeline the kernel PCA and a (linear) metric learning algorithm to get the same effect as if we trained a kernelized version of the later one. Unfortunately, it doesn't work with all the algorithms, but out of those I'm proposing (LMNN, NCA and ITML, more on that later) several (LMNN and NCA) do work in that way. ITML should not be mixed with the Kernel PCA.

It's worth notion that there's a nonlinear version of NCA, which uses multilayer neural networks (a stack of RBMs) to find a nonlinear transformation f(x). The method seems quite heavy, and it's not clear how easy it'd be to reuse current RBM implementation (they speak about fine-tuning by backpropagation). Therefore I decided it's not worth implementing.

API

The core contribution of this project would be a metric_learning (all names are preliminary) module with several different algorithms. Each of them is a transformer that utilizes y during fit, where y is a usual vector of labels of training samples, just like in case of classification. Another possible application is getting a similarity matrix according to the metric learned. Thus, there will be 2 transformers for each algorithm: one maps input data from the original space into a linearly transformed one, and the other maps input data into a square similarity matrix, that can be used for clustering, for example.

Each transformer will also have a metric_ attribute to get an instance of DistanceMetric, that can be used in KNN.

ml = LMNNTransformer()
knn = KNeighborsClassifier()
pl = Pipeline( ('ml', ml), ('knn', knn) )
pl.fit(X_train, y_train)
pl.predict(X_test)

Similarity learning:

ml = LMNNSimilarity()
sc = SpectralClustering(affinity="precomputed")
pl = Pipeline( ('ml', ml), ('sc', sc) )
pl.fit(X_train, y_train)
pl.predict(X_test)

Alternatively, since similarity is just an RBF kernel on top of usual distance, and to avoid code duplication, all the Similarity transformers can be implemented using an adapter similar to OneVsRestClassifier on top of usual Transformers.

Details

I propose to implement several highly recognized and most cited algorithms:

LMNN — Distance Metric Learning for Large Margin Nearest Neighbor Classification
ITML — Information-Theoretic Metric Learning
NCA — Neighbourhood Components Analysis. There is an issue to add it, but it didn't turn into a pull-request.

Timeline

April, 27th — May, 24th

Get a clear understanding of all algorithms, sketch their design.

May, 25th — June, 14th (3 weeks)

Prepare codebase (Base classes, if needed)
Implement NCA
Write tests and documentation for NCA
Submit NCA and initial metric_learning module for a review #1

June, 15th — July, 5th (3 weeks)

Implement LMNN
Pass the mid-term
Tests and documentation for LMNN
Submit LMNN for a review #2

July, 6th — July, 26th (3 weeks)

Complete review #1
Implement ITML
Tests and documentation for ITML
Submit ITML for a review #3

July, 27th — August, 16th (3 weeks):

Get all reviews completed and ready to merge.
If time permits
Kernelized ITML
Tests and documentation for Kernel ITML

August, 17th — August, 24th (1 week):

Pencils down.
Get everything merged.
Submit everything to Google.
Rule the Galaxy.

Prior contributions

https://github.com/scikit-learn/scikit-learn/pull/4390 — merged
https://github.com/scikit-learn/scikit-learn/pull/4326 — merged
https://github.com/scikit-learn/scikit-learn/pull/4388 — merged
https://github.com/scikit-learn/scikit-learn/pull/4237 — in progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly