GSoC 2015 Proposal: Global optimization based Hyper parameter optimization (SMAC RF and GP) Hamzeh Alsalhi

Student Information

Name: Hamzeh Alsalhi

Email: ha258@cornell.edu

Time zone: Eastern Time (ET)

Github: hamsal

Skype: hamzeh.al

Alternate email: 93hamsal@gmail.com

blog: http://hamzehal.blogspot.com/

University Information

University: Cornell University

Major: Computer Science

Current Year and Expected Graduation date: 4th year, expected graduation may 2015

Degree (e.g. BSc, PhD): BA

I am attending Cornell University for Meng degree in computer science in Fall 2015

Qualifications

My qualifications that make me a good fit for successful completion of this project include an experience writing machine learning algorithms in python from my course work from "Introduction to Machine Learning" and "Machine Learning for Data-science" In which I have has practice implementing Decision Trees, SVM, PCA, K-means clustering, and spectral clustering. In addition my coursework has familiarized me machine learning literature and has given me the ability to refer to papers and extract the information necessary to implement an algorithm described. Finally I have successfully completed a GSoC project at scikit-learn in the summer of 2014.

Project Proposal

Proposal Title

Scikit-learn: Global optimization based Hyperparameter optimization (SMAC RF and GP)

Proposal Abstract

Scikit-learn is a leading python machine library that makes it easy to use cutting edge machine learning tools. Often human intervention is necessary to optimize classifiers to preform their best on specific tasks. Parameter searches help reduce the guess work involved by automatically finding the most aproptiate parameter settings. The goal of this project is to expand on scikit-learns parameter search tools by implementing two variations of a Global Optimization based Hyperparameter optimizer. The first is SMAC using Random Forests, the second is SMAC using Gaussian Processes.

Implementation Plan

Sequential Model-based Algorithm Configuration is a recent method that has been developed to optimize the parameters for any automatic process and it has proven useful in machine learning to optimize classifier parameters. It has been shown to scale better with high dimensions and discrete input dimensions than other parameter optimization methods.

The heart of this algorithm will be a model that computes a probability distribution for the function that maps a parameter configuration to a score from a performance metric. This will include randomly sampling examples to evaluate the classifier on so as to approximate performance over the entire data set. SMAC can be configured to use RF or GP. To address the exploration through the parameter space a desirability function will be used to calculate the expected improvement given from evaluating the classifier with a particular configuration.

SMAC using RF

The choice of random forests is because SMAC with RF prove to improve performance for discrete optimization tasks. This will be valuable as part of scikit-learn because it will be useful in different contexts than Hyperparameter optimizers based on GP are. The plan is to develop this variant of the algorithm first because it can be done interdependently of the current ongoing work in GP.

SMAC using GP

The GP variant of SMAC will be implemented after SMAC RF and is valuable becuase it will provide an additional Hyperparameter optimizer that works best on continuous parameter searches.

Deliverables

There are ultimately three deliverables for this project:

A pull request for SMAC RF that includes the algorithm implementation, extensive unit testing, and appropriate documentation.

A pull request to finalize PR #4270 as a prerequisite of Spearmint if necessary.

A pull request for SMAC GP that includes the algorithm implementation, extensive unit testing, and appropriate documentation.

Timeline

Week 1 – May 25th - Begin implementing SMAC RF optimization algorithm in a new PR with included unit tests
Week 2 – June 1st – Address first week of feedback and continue to make improvements to the SMAC RF PR
Week 3 – June 8th - Address second week of feedback on PR, finalize SMAC RF, and mark ready for merge
Week 4 – June 15th - Begin implementing required GP prerequisites for SMAC GP by picking up work on PR #4270 if necessary
Week 5 – June 22nd - Address first week of feedback and finalize work on GP prerequisites

Midterm Evaluations

Week 6 – June 29th - Begin implementing SMAC GP optimization algorithm in a new PR with included unit tests
Week 7 – July 6th - Address first week of feedback and continue to make improvements to the SMAC GP PR
Week 8 – July 13th - Address second week of feedback on PR, finalize SMAC GP, and mark ready for merge
Week 9 – July 20th – Begin Patching any bugs, errors, or backward compatibility issues that have occurred as a result of my contributions to the code base.
Week 10 – July 27th – Finalize patches and mark ready for merge.
Week 11 – Aug 3rd & Week 10 – Aug 10th - Two week buffer to allow for unexpected delays

Link to a patch

My main patches to scikit-learn from GSOC 2014:

https://github.com/scikit-learn/scikit-learn/pull/3486

https://github.com/scikit-learn/scikit-learn/pull/3438

https://github.com/scikit-learn/scikit-learn/pull/3276

https://github.com/scikit-learn/scikit-learn/pull/3203

https://github.com/scikit-learn/scikit-learn/pull/3161

References

Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An evaluation of sequential model-based optimization for expensive blackbox functions In GECCO 2013 Blackbox Optimization Benchmarking workshop (BBOB'13).

Jasper Snoek, Hugo Larochelle and Ryan P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly