GSoC 2014 Proposal : Locality sensitive Hashing for approximate neighbor search

Student Information

Name : Maheshakya Wijewardena

E-mail : pmaheshakya4@gmail.com

Telephone : +94711228855

Time Zone: GMT+5.5

IRC handle: maheshakya@freenode.net

Github : maheshakya

Twitter: Wmaheshakya

Blog: http://espyyonder.blogspot.com/

GSoC Blog RSS feed: http://espyyonder.blogspot.com/feeds/posts/default?alt=rss

My background and programming experience:

I am a third year undergraduate student at University of Moratuwa, Sri Lanka. I have started in Computer Science and Engineering as my specialized field in Faculty of Engineering. I have initiated my programming life with C. After sometime, I started with C++ and Java. In college and in the first few semesters in the university, I have coded many of the basic algorithms(sorting, searching, shortest paths, etc.) as programming exercises. I have taken maths modules such as Linear algebra, Discrete mathematics, Numerical methods and Advanced statistics. In addition, I have taken courses in Data structures and Algorithms, Intelligent systems, Image processing, Database systems, Operating systems and Signals and Systems of which the knowledge can be useful in the context of this project. Couple of years ago I got my attention on Python and witnessed its’ capability on scientific computing. I am well experienced in using Numpy and Scipy libraries. I have started learning Cython recently in order to help with optimizing my Python codes. I have used Matlab for several Signal processing projects. Apart from those, I have done some projects in C# and have passable experience in MySQL and postgreSQL databases. I have a little experience in functional programming with Scala, Lisp, Erlang, Python and Logic programming with Prolog. I have done a few projects related to machine learning and those are indicated in the "Additional information" section in this proposal.

Linkedin: http://www.linkedin.com/pub/maheshakya-wijewardena/35/531/189

University Information

University: University of Moratuwa, Sri Lanka

Major: Computer Scicence and Engineering

Current Year and Expected Graduation date: Third year, 2015-05-01

Degree: BSc

Project Proposal Information

Title: scikit-learn: Locality sensitive Hashing for approximate neighbor search

Abstract:

Nearest neighbor search is a well known problem which can be defined as follows: given a collection of n data points, create a data structure which, given any query point, reports the data points that is closest to the query. This problem holds a major importance in certain applications: Data mining, Databases, Data analysis, Pattern recognition, similarity search, Machine learning, Image and video processing, Information retrieval and statistics. To perform nearest neighbor search, there are several efficient algorithms known for the case where the dimension is low. But those methods suffer from either space or query time that is exponential in dimension.

In order to address the “Curse of Dimensionality” in large data sets, recent researches had been lead based on approximating neighbor search. It has been proven that in many cases, approximate nearest neighbor is as almost good as the exact one[1]. Locality Sensitive Hashing is one of those approximating methods. The key idea of LSH is to hash data points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than for those that are far apart.

In this project, several variants of LSH-ANN methods will be prototyped and evaluated. After Identifying the best method for scikit-learn, necessary hashing algorithms will be implemented. Then with the results got from prototyping stage, storing and querying structure of ANN will be implemented. After that ANN part will be integrated into sklearn.neighbors module. As these activities proceed, testing, examples and documentation will be covered. Bench marking will be done to assess the implementation.

Motivation

In scikit-learn, currently exact nearest neighbor search is implemented.But when it comes to higher dimensions, it fails to perform efficiently[2]. The goal of the project is an efficient implementation of approximating neighbor search based on Locality sensitive hashing. The project idea has been there in GSOC 2011[3], but it has not been implemented. It is in this years' project ideas list as well. I need to apply similarity search in my final year research project so I have studied nearest neighbor algorithms. While studying on approximated neighbor search I came across LSH based ANN methods. As I think, with necessary adjustments, implementation of this technique will make a significant enhancement in nearest neighbor search(when it come to high dimensionalities). It will benefit in my projects as well as other applications in the machine learning community. Hence I decided start with this project after several discussions in the mailing list and issues page in scikit-learn Github repository. In the following sections, I will discuss my plan for this project that I have came up with after discussing with Olivier, Gael, Robert and Daniel.

Milestones of the project

Prototyping/Evaluating existing LSH based ANN methods (vanilla and others) in order to find the most appropriate method to have in scikit-learn. There is no point of having methods in scikit-learn which are impractical to use with real data.
Implementation of hashing algorithms
Implementation of storing structure to retain trained/hashed data.
Integrating the ANN search into current implementation of neighbor search, so that this can be used with the same API.
Completing tests, examples, documentation and bench marking.

Implementation plan

Prototyping/Evaluating

This is the most vital part of this project. There are several implementations of LSH based ANN available. When choosing the best method of implementation which suits for scikit-learn, following criteria will be taken into account.

Accuracy.
Efficiency of queries.
Feasibility of maintenance.
Minimum storage.

After discussions in the mailing list, following techniques will be tested with same the data set and different parameters. A good candidate data set can be created by extracting components from movielens datasets[5] to evaluate with and to bench mark.

Vanilla LSH-ANN algorithm - Storing hashed training data in multiple hash tables(numpy 2D arrays). As discussed in the mailing list, existing implementations will be used in evaluating and benchmarking purposes as we do not have to put extra effort on implementing. Following are the existing implementations in Python.

lshash[10]
NearPy[11]
Optimal-LSH[12]
pybrain[13]

LSH-Forests[6] - Trees will be built using sorted arrays and binary searching(numpy arrays will be used). This is a promising technique but implementations are not available. Therefore, in I will prototype this method and thoroughly analyze its' performance. This algorithm will be given a high priority during the prototyping stage.
Alternative implementations : ANNOY[7] - Here, similar method to LSH algorithm which uses a tree structure for storing and querying is used. This has a less complex code therefore, I will create a prototype.

Initially we have discussed to investigate approximation methods used in FLANN[8] as it has the advantage of high speed queries. But the speed difference is not sufficient to justify the added complexity of the implementation[9]. Therefore, it will not be used in prototyping, but will be used to evaluate the prototyped models(Results of implemented models will be compared with Exact neighbor search methods in scikit-learn to evaluate accuracy of approximations. To compare query speeds, FLANN will be used)

These will be the main considerations in the prototyping/evaluating stage. They will be tested against the above mentioned criteria and the most appropriate method for scikit-learn will be chosen. Sole purpose of this step is to determine the best technique, therefore this can be done in a Gist(I will create a Gist in my account for this purpose). For this part, documenting and compliance with API or bench marking is not required.

Implementing Hashing algorithms

As this module will be developed in a scalable manner, I have no intentions of implementing many hashing algorithms as possible at this stage. Therefore I will implement Random Projections algorithm[4] and Min-wise independent permutations algorithm[4]. Main concern will be to use Random projections. Moreover, all hashing algorithms will be implemented in Cython (for improved speed in training and queries).

Input parameters to hash function:

input_vector (Cannot be none, only numerical values are allowed) : data point in the training data or the query point.
tuning_parameter/parameters: This parameter will influence on choosing random hyper-plane with which the input_vector projects. These parameters will be decided after the prototyping process.

returns: hashed value(for most of the hashing algorithms, this will be 1 or 0) of an individual vector

Implementing storing structure to retain trained/hash data and run queries

This will be the core section of this project. The logical structure of this will be decided after the prototyping/evaluating stage. There will be a separate class to store and maintain this structure which will be extended from BaseEstimator. Following are the structures of basic public methods in this class.

init(estimator = est, params = params) : Initializing function of the estimator where the LSH algorithm is selected and other tuning parameters are set (if available after prototyping and available). For vanilla LSH based ANN algorithm, these parameters will be k and L as discussed in the paper[1]. For LSH-Forest[5], this will be number of trees (l). If the approach in Spotifys’ Annoy[6] is chosen, this will be index(f) and the metric.

fit(X) : As in the existing nearest neighbor module, this ANN module will also have a fit method where training data is fed and models are trained.

neighbors(x, params) : x is the vector of which neighbors needs to be approximated. parameters can be number of required neighbors, distance from query vector, etc.(This depends on the chosen method during the prototyping stage). This returns approximated neighbor of the query vector.

Integrating ANN search into neighbor search

Implementation of hashing algorithms will be put in sklearn.metrics. But since Approximating neighbor search is related to nearest neighbors, implementation of LSH based ANN should be included in sklearn.neighbors module to use the same API. This would be quite challenging as the current implementations of nearest neighbor search (kd_tree, ball_tree, brute_force) are significantly different.

Completing tests, examples and documentation

For each component which is being developed, corresponding tests, examples and documentation will proceed. Bench marking is done as done for other modules in scikit-learn. In this bench mark, following will be plotted average error and average precision of approximated neighbors against the query time will be plotted for different parameters of ANN implementation.

Average error against query time in milliseconds.
Average precision against query time in milliseconds.

Timeline

From Today to April 21st

In this period I will discuss more about the project with respective mentors and other members of the community who are willing to assist. While discussing, I will contribute to scikit-learn as best as I can. I will continue fixing issues and get more familiar with the code base.

Milestone: Getting more familiar with scikit-learn code base.

April 22nd to May 18th

This is the community bonding period. By this time, I will have a much better understanding of the code base. But I will keep on providing patches for issues I can fix. During this time, I will spend more time on studying the algorithms I will have to implement. I will start initial prototyping during this time.

Milestone: Setting the environment for the project.

Week 1 (May 19th - May 25th)

Milestone: Adjusting vanilla LSH implementations and ANNOY for evaluation.

Week 2 (May 26th - June 1st)

Milestone: Prototyping LSH-Forest

Week 3 (June 2nd-June 8th)

Milestone: Evaluating prototypes and choosing the best method of implementation

Week 4, 5 (June 9th-June 22nd)

Milestone: Implementing Hashing algorithms

Week 6 , 7, 8(June 23th-July 13th)

Mid term evaluation (June 23rd-June 27th) :During this period, I will start implementing the storing structure.

Milestone: Completing the implementation of storing structure

Week 9, 10 , 11(July 14th-July 20th)

Milestone: Integrating ANN search into neighbor search

Week 12 (July 21th-July 27th)

Milestone: Completing tests

Week 13 (July 28th-August 3rd)

Milestone: Complete benchmarking

Week 14 (August 4th-August 10th)

Milestone: Complete documentation

Week 15 (August 11th-August 17th)

Suggested “pencils down” date is August 11th. In this week I fill finalize the tests, documentation and benchmarks.

August 18th - Firm “pencils down” date

Other relevant information:

During this time period I do not have any other commitments.

Patches/Code samples/Pull requests

I have recently started contributing to scikit-learn. I have started with fixing minor issues. I am looking forward to continue with contributing to my best.

Merged pull requests

Implemented median and constant strategies in DummyRegressor: https://github.com/scikit-learn/scikit-learn/pull/2886
Added reference to the function fetch_olivetti_faces in the narrative documentation of olivetti face dataset: https://github.com/scikit-learn/scikit-learn/pull/2945
Added _LearntSelectorMixin in BaseGradientBoosting : https://github.com/scikit-learn/scikit-learn/pull/2966

Additional information:

I have done following projects which are related to machine learning.

Time series forecasting using Neural networks: Here a sequential supervised learning technique(feed forward network with a moving window) has been used to predict time series data. As the learning algorithm, a multi-perception neural network is used. Any other learning algorithm can be used with simple manipulations. https://github.com/maheshakya/Time-Series-Forcasting-with-Neural-Networks
Bootstrap Aggregating module for scikit-learn: This project has been done for a university project.

References

A. Andoni and P. Indyk,"Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions", Available[online]: http://people.csail.mit.edu/indyk/p117-andoni.pdf
R. Rehurek, "Performance Shootout of Nearest Neighbours: Contestants", Available[online]: http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/
scikit-learn, "A list of topics for a Google summer of code (GSOC) 2011", Available[online]: https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-Google-summer-of-code-(GSOC)-2011#locality-sensitive-hashing
Wikipedia, "Locality Sensitive Hashing", Available[online]: http://en.wikipedia.org/wiki/Locality-sensitive_hashing
movielens datasets, Avalible[online]: http://grouplens.org/datasets/movielens/
M. Bawa, T. Condie and P. Ganesan, "LSH Forest: Self-Tuning Indexes for Similarity Search", Available[online]: http://www2005.org/docs/p651.pdf
E. Bernhardsson, "Annoy", Available[online]: https://github.com/spotify/annoy
Fast Library for Approximate Nearest Neighbors, Available[online]: http://www.cs.ubc.ca/research/flann/
R. Rehurek, "Performance Shootout of Nearest Neighbours: Querying", Available[online]: http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
lshash, Available[online]: https://pypi.python.org/pypi/lshash
NearPy, Available[online]: http://nearpy.io/
Optimal-LSH, Available[online]: https://github.com/yahoo/Optimal-LSH
pybrain, Available[online]: https://github.com/pybrain/pybrain/tree/master/pybrain/supervised/knn/lsh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly