GSoC 2015 Proposal: Improve GMM module

Student Information

Name Wei Xue

IRC WeiXue

Source control username Github: xuewei4d

Instant Messaging information GTalk: xuewei4d@gmail.com

Student Background

I am a PhD student in School of Computing and Information Sciences, Florida International University, US. My main research interests are Machine Learning, Text Mining. I have passed several machine learning related courses in class Machine Learning, Graphical Models, Computer Vision. I also received certifications of two MOOC courses, Convex Optimization, Machine Learning Techniques. In these course projects, I used Python and Matlab to solve convex optimization problems and implemented prototypes of Bagging, AdaBoost, CART, RandomForest and Neural Networks. I have been using Python as my primary programming language for 3 years. In the research projects, I programmed crawlers to collect user comments from travel websites, processed text data with scikit-learn, and built topic models with C++. I also have experience of implementing Gibbs sampling for Dirichlet process Gaussian mixture models on Matlab.

University Information

University Florida International University

Major Computer Science

Current Year and Expected Graduation date Current Year 2015, Expected Graduation date May 16, 2016

Degree PhD in pursuit

Project Proposal Information

Proposal Title scikit-learn: Improve GMM module

Abstract

Current GMM, VBGMM and DPGMM implementations in scikit-learn have many problems. They suffer from interface incompatibility, potential bugs and incomplete testing. In this project, I propose to clean up APIs of GMM, re-implement VBGMM and DPGMM based on well-acknowledged literature, and pass various test cases.

Motivation

Gaussian mixture model (GMM) is a popular unsupervised clustering method. GMM corresponds a linear combination of several Gaussian distributions to represent the probability distribution of observations. In the classic GMM, with the prefix number of Gaussian component, a set of parameters should be estimated to represent the distribution of the training data. It includes means, covariances and the coefficients in the linear combination. Expectation Maximization (EM) is usually used to infer the parameters. Variational inference generalize EM method by fully Bayesian estimation in M-step. GMM using variational Bayesian inference is named VBGMM. On the other hand, Dirichlet process Gaussian mixture model (DPGMM) allows a mixture of infinite Gaussian distributions. Users do not have to preset the number of components ahead of time. DPGMM could be approximated by truncated stick-breaking representations, and estimated by truncated variational inference. [12]

In scikit-learn, GMM, VBGMM and DPGMM are all implemented. However, their source code suffers from several problems.

Incompatible Interface
- Issue 1528 Consistency in GMM, _get_covars Covariance and precision matrix are interchangeably used in different literature. The covariance matrices that GMM uses are not provided in VBGMM and DPGMM.
- Issue 2473 Bug: GMM score() returns an array, not a value GMM.score returns the per-sample likelihood, where as other models return the sum of likelihood.
- Issue 3813 log-responsibilities in GMM GMM.score_samples returns exponentiated log-responsibility which has possibility of precision loss and overflow.
- Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
- Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM In this issue, I found GMM does not provide an explicit way for users to initialize means, covariances and weights.
Potential bugs
- Issue 1637 dpgmm sample not working
- Issue 1764 DPGMM - _update_concentrations fail implementation A bug when updating gamma_, although fixed.
- Issue 2454 Scaling kills DPGMM DPGMM and VBGMM returns incorrect parameter estimations even with simple toy data.
- Issue 4267 Density doesn't normalize in VBGMM and DPGMM The score_samples method of GMM, VBGMM and DPGMM returns the probability density at given values, however the integration of these values does not equal to 1.
- Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM Users might set params that broke GMM algorithms.
Documentation
- VBGMM DPGMM derivation The derivations in these pages are not consistent with the text book such as PRML[12] and ESL[14].
Testing
- VBGMM and DPGMM have not been tested comprehensively.

Theory and Implementation

Mathematical Derivations

PRML book [12] gives detailed updating equations for Gaussian mixture models and variational inferences for GMM, although some steps of updating parameters of Normal-Wishart distributions are skipped. Blei's paper [11] gives variational inference steps for Dirichlet process mixtures for exponential distributions. Although Gaussian distribution is one of exponential distributions, detailed updating functions should be derived. Existing implementations have three types of covariance matrices to be estimated, spherical, tied, and full. In this step, I will represent detailed updating equations for all types of models and keep notations clean and consistent.

Cleaning GMM API

GMM has inconsistent interface to other components of scikit-learn. In this step, I will

Fix Issue 2473, Issue 3813 and Issue 4062. Implement DensityEstimatorMixin which supports density method and score method. density returns the PDF of input X. score returns the sum of log likelihood of input X.
Add method responsibility_sample which returns the responsibility of every sample
Implement user-friendly interface for users to initialize parameters of GMM. Issue 4429

Implementing VBGMM

In GMM, the parameters are estimated using EM method. In each iterations, E-step optimize the the mixture weights of linear combinations in a Bayesian way, M-step uses maximum likelihood to estimate the parameters of all Gaussian distributions, means and precision matrices. In variational Bayesian inference, M-step is generalized to a full Bayesian way, where means and precision matrices are governed by Gaussian-Wishart distribution. In this step, I will follow the standard mathematical derivations, Chapter 10 in PRML [12].

Implementing DPGMM

DPGMM is another mixture models, which uses Dirichlet process as a nonparametric prior of the distribution parameters. It is more flexible than GMM, since the number of components could vary according to the data. The simplest way to infer DPGMM is Monte-Carlo Markov chain (MCMC), but it is generally slow to converge. In Blei's paper [11], truncated variational inference is proposed which has better convergence time. In this step, I will implement the variational inference following Blei's paper [11].

Testing

As Issue 2454 pointed out, there are some potential bugs in current implementation. It cannot pass toy tests. The test cases have not been done for VBGMM and DPGMM. The testing cases will be comprehensive, and focus on the following aspects. Several IPython Notebook will be demonstrated how to use these three models GMM, VBGMM, DPGMM as well.

Interface coherence
Correctness
Numerical stability
Speed

Time line

Week 1 (May 25 - May 31) : Mathematical Derivation
Week 2, 3 (Jun 1 - Jun 14) : Clean GMM
Week 4, 5 (Jun 15 - Jun 28) : Implement VBGMM and Test
Week 6, 7 (Jun 29 - July 12) : Implement VBGMM, DPGMM and Test
Week 8, 9 (July 13 - July 26) : Implement DPGMM and Test
Week 10, 11 (July 27 - Aug 9) : Test
Week 12 (Aug 10 - Aug 16) : Documentation

References

Issue 1528 Consistency in GMM, _get_covars
Issue 1637 dpgmm sample not working
Issue 1764 DPGMM - _update_concentrations fail implementation
Issue 2473 Bug: GMM score() returns an array, not a value
Issue 2454 Scaling kills DPGMM
Issue 3813 log-responsibilities in GMM
Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
Issue 4202 GMM and score_samples(X) back to probabilities
Issue 4267 Density doesn't normalize in VBGMM and DPGMM
Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM
VBGMM DPGMM derivation
Blei, David M., and Michael I. Jordan. "Variational inference for Dirichlet process mixtures." Bayesian analysis 1.1 (2006): 121-143.
Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Hastie, Trevor, et al. The elements of statistical learning. Vol. 2. No. 1. New York: springer, 2009.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly