Skip to content

GSoC 2015 Proposal: Global optimization based Hyper parameter optimization

Angermueller Chirstof edited this page Mar 19, 2015 · 1 revision

Name: Christof Angermueller

Web : http://cangermueller.com

Github : cangermueller

Twitter : @cangermueller

1 About me

I am a PhD student in Bioinformatics at the University of Cambridge and European Bioinformatics Institute (EBI/EMBL), supervised by Oliver Stegle and Zoubin Ghahramani. My research is about machine learning algorithms for analyzing high-dimensional biological data, which includes Bayesian matrix factorization methods and deep neural networks. I have been using Python regularly for both my research and as general purpose programming language for more than five years, implemented a Python package for variational Bayesian matrix factorization (vbmfa), and aim to contribute to open source development in the long. To start with, I want to work on the following GSoC topic with scikit learn. I decided to contribute to scikit-learn, since a) I use it on a daily basis for various machine learning tasks, b) it has a very active community, and c) allows me to combine my interests in machine learning and software engineering.

2 Abstract

Scikit-learn is a suite for various machine learning algorithms. These algorithms have often hyper-parameters, which are not learnt from data, but specified by the user. Optimizing hyper-parameters is critical for generalization performance, and often time consuming, since it requires retraining the machine learning algorithm and evaluating its fitness for different hyper-parameter settings.

Currently, scikit-learn only provides the module GridSearchCV [1] for exhaustive grid search, and RandomizedSearchCV [2] for randomized hyper-parameter optimization. While the first is intractable if the hyper-parameter space is large, the latter has been outperformed by more sophisticated sequential model based optimization (SMBO) algorithms [3], which take previous evaluations of the fitness function into account to choose parameters with the highest chance of improvement.

The goal of the proposed GSoC topic is to implement a SMBO algorithm based on Gaussian Processes (GP) as described by Snoek et al. [4], which a) allows to explore the hyper-parameter space more efficiently and more accurately than GridSearchCV and RandomizedSearchCV, b) is well integrated into the scikit-learn infrastructure, and c) is easy to use for transparently optimizing various scikit-learn algorithms.

3 Milestones

3.1 GPSearchCV

GPSearchCV (tentative name) will be a new scikit-learn module for hyper-parameter optimization based on Gaussian Processes [4]. It will be similar to the spearmint python package [5], but more lightweight and consistent with the scikit-learn framework. It will have same interfaces as GridSearchCV and RandomizedSearchCV, and will allow to easily optimize hyper-parameters of different scikit-learn algorithms like SVMs and random forests. Gaussian Processes with Matern 5/2 kernel will be used as prior distribution over the fitness function, and the expected improvement acquisition function will be used to choose the next hyper-parameters. In the first version, kernel parameters will be estimated by maximum likelihood, since integrating over them is currently not supported by the Gaussian Process scikit-learn module, and would also be computationally more costly. Implementing slice sampling for Bayesian treatment of parameters is one of my later goals (see below).

3.2 Unit tests

I will write unit tests to assure that my implementation works correctly for different scenarios.

3.3 Benchmarks

The new GPSearchCV module will only become popular if it renders superior result over existing modules. I will therefore compare its performance with GridSearchCV, RandomizedSearchCV, and spearmint. I will evaluate their performance to optimize hyper-parameter of different classification and regression algorithms.

3.4 Documentation

I will follow the scikit-learn conventions to document all interfaces, and write code examples on how to use GPSearchCV. Links to my benchmark results will allow users to judge which optimization module (GridSearchCV, RandomizedSearchCV, GPSearchCV) is most appropriate for their task.

4 Additional features

If time remains after I accomplished all milestones described above, I will work on the following points in the given order:

4.1 Marginalizing over kernel parameters

Bayesian treatment of parameters by marginalizing over them instead of maximizing the likelihood is computationally more costly, but improves generalization performance. I will implement slice sampling to approximate the integral over hyper-parameters, which might also be incorporated into the scikit-learn Gaussian Process module.

4.2 Parallelization of hyper-parameter search

Instead of waiting until the fitness function has been evaluated for one hyper-parameter configuration before trying out new parameters, several configurations can be evaluated in parallel to speed-up the optimization. I will use all completed, and integrate over pending evaluations to approximate the expected acquisition function, from which multiple hyper-parameter configurations can be chosen in parallel.

4.3 Additional acquisition functions

I will implement two alternative acquisition functions, which might be more suited for certain optimization tasks than the expected improvement: 1) the probability of improvement, and 2) the GP upper confidence bound [6], which has an additional parameter to trade-off exploitation versus exploration.

4.4 Modeling cost for evaluation the fitness function

The time for evaluating the fitness function can vary considerably for different hyper-parameters, e.g. choosing by a different number of hidden layers in a neural network. I will therefore use a second Gaussian Process to model the time for evaluating the fitness function, and to compute the expected improvement per second.

5 Tentative Timeline

Now – 25 May: Learning period

  • Submit several PR requests, if possible related to Gaussian Processes and hyper-parameter optimization
  • Familiarize with implementation of GridSearchCV, RandomizedSearchCV, and spearmint
  • Read more about hyper-parameter optimization and Gaussian Processes

Week 1, 2: Gaussian Processes

  • Implement Matern 5/2 Kernel and maximum likelihood parameter estimation
  • Extend the existing Gaussian Process module and improve its performance if necessary

Week 3, 4: Acquisition function and optimization pipeline

  • Implement expected improvement acquisition function
  • Implement search over hyper-parameter space
  • Build optimizing pipeline from parts implemented so far
  • 1 July: Submit mid-term evaluation

Week 5, 6: First working prototype GPSearchCV

  • Sanity check of first prototype by comparing results with spearmint
  • Refactor code
  • Test basic features outlined in 3.1

Week 7, 9: Additional Features

  • Slice sampling of hyper-parameters
  • Parallelization

Week 10, 11: Wrap-up

  • Profile implementation and improve performance
  • Perform benchmarking

Week 12: Documentation

  • Document interfaces
  • Write usage examples
  • 28 August: Submit code examples to Google

References

[1] GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV

[2] RandomizedSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html#sklearn.grid_search.RandomizedSearchCV

[3] Bergstra et al., “Algorithms for Hyper-Parameter Optimization.”

[4] Snoek, Larochelle, and Adams, “Practical Bayesian Optimization of Machine Learning Algorithms.”

[5] Spearmint: https://github.com/JasperSnoek/spearmint

[6] Srinivas et al., “Gaussian Process Optimization in the Bandit Setting.”

Clone this wiki locally