Skip to content

GSoC 2015 Proposal: Improve GMM module

Wei Xue edited this page Mar 24, 2015 · 13 revisions

Student Information

Name Wei Xue

IRC WeiXue

Source control username Github: xuewei4d

Instant Messaging information GTalk: xuewei4d@gmail.com

Blog http://xuewei4d.github.io

Student Background

I am a PhD student in School of Computing and Information Sciences, Florida International University, US. My main research interests are Machine Learning, Text Mining. I have passed several machine learning related courses in class Machine Learning, Graphical Models, Computer Vision. I also received certifications of two MOOC courses, Convex Optimization, Machine Learning Techniques. In these course projects, I used Python and Matlab to solve convex optimization problems and implemented prototypes of Bagging, AdaBoost, CART, RandomForest and Neural Networks. I have been using Python as my primary programming language for 3 years. In the research projects, I programmed crawlers to collect user comments from travel websites, processed text data with scikit-learn, and built topic models with C++. I also have experience of implementing Gibbs sampling for Dirichlet process Gaussian mixture models on Matlab.

University Information

University Florida International University

Major Computer Science

Current Year and Expected Graduation date Current Year 2015, Expected Graduation date May 16, 2016

Degree PhD in pursuit

Project Proposal Information

Proposal Title scikit-learn: Improve GMM module

Abstract

Current GMM, VBGMM and DPGMM implementations in scikit-learn have many problems. They suffer from interface incompatibility, potential bugs and incomplete testing. In this project, I propose to clean up APIs of GMM, re-implement VBGMM and DPGMM based on well-acknowledged literature, and pass various test cases.

Motivation

Gaussian mixture model (GMM) is a popular unsupervised clustering method. GMM corresponds a linear combination of several Gaussian distributions to represent the probability distribution of observations. In the classic GMM, with the prefix number of Gaussian component, a set of parameters should be estimated to represent the distribution of the training data. It includes means, covariances and the coefficients in the linear combination. Expectation Maximization (EM) is usually used to infer the parameters. Variational inference generalize EM method by fully Bayesian estimation in M-step. GMM using variational Bayesian inference is named VBGMM. On the other hand, Dirichlet process Gaussian mixture model (DPGMM) allows a mixture of infinite Gaussian distributions. Users do not have to preset the number of components ahead of time. DPGMM could be approximated by truncated stick-breaking representations, and estimated by truncated variational inference. [12]

In scikit-learn, GMM, VBGMM and DPGMM are all implemented. However, their source code suffers from several problems.

Theory and Implementation

Mathematical Derivations

PRML book [12] gives detailed updating equations for Gaussian mixture models and variational inferences for GMM, although some steps of updating parameters of Normal-Wishart distributions are skipped. Blei's paper [11] gives variational inference steps for Dirichlet process mixtures for exponential distributions. Although Gaussian distribution is one of exponential distributions, detailed updating functions should be derived. Existing implementations have three types of covariance matrices to be estimated, spherical, tied, and full. In this step, I will represent detailed updating equations for all types of models and keep notations clean and consistent.

Cleaning GMM API

GMM has inconsistent interface to other components of scikit-learn. In this step, I will

  • Fix Issue 2473, Issue 3813 and Issue 4062. Implement DensityEstimatorMixin which supports density method and score method. density returns the PDF of input X. score returns the sum of log likelihood of input X.
  • Add method responsibility_sample which returns the responsibility of every sample
  • Implement user-friendly interface for users to initialize parameters of GMM. Issue 4429

Implementing VBGMM

In GMM, the parameters are estimated using EM method. In each iterations, E-step optimize the the mixture weights of linear combinations in a Bayesian way, M-step uses maximum likelihood to estimate the parameters of all Gaussian distributions, means and precision matrices. In variational Bayesian inference, M-step is generalized to a full Bayesian way, where means and precision matrices are governed by Gaussian-Wishart distribution. In this step, I will follow the standard mathematical derivations, Chapter 10 in PRML [12].

Implementing DPGMM

DPGMM is another mixture models, which uses Dirichlet process as a nonparametric prior of the distribution parameters. It is more flexible than GMM, since the number of components could vary according to the data. The simplest way to infer DPGMM is Monte-Carlo Markov chain (MCMC), but it is generally slow to converge. In Blei's paper [11], truncated variational inference is proposed which has better convergence time. In this step, I will implement the variational inference following Blei's paper [11].

Testing

As Issue 2454 pointed out, there are some potential bugs in current implementation. It cannot pass toy tests. The test cases have not been done for VBGMM and DPGMM. The testing cases will be comprehensive, and focus on the following aspects. Several IPython Notebook will be demonstrated how to use these three models GMM, VBGMM, DPGMM as well.

  • Interface coherence
  • Correctness
  • Numerical stability
  • Speed

Time line

  • Week 1 (May 25 - May 31) : Mathematical Derivation
  • Week 2, 3 (Jun 1 - Jun 14) : Clean GMM
  • Week 4, 5 (Jun 15 - Jun 28) : Implement VBGMM and Test
  • Week 6, 7 (Jun 29 - July 12) : Implement VBGMM, DPGMM and Test
  • Week 8, 9 (July 13 - July 26) : Implement DPGMM and Test
  • Week 10, 11 (July 27 - Aug 9) : Test
  • Week 12 (Aug 10 - Aug 16) : Documentation

References

  1. Issue 1528 Consistency in GMM, _get_covars
  2. Issue 1637 dpgmm sample not working
  3. Issue 1764 DPGMM - _update_concentrations fail implementation
  4. Issue 2473 Bug: GMM score() returns an array, not a value
  5. Issue 2454 Scaling kills DPGMM
  6. Issue 3813 log-responsibilities in GMM
  7. Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
  8. Issue 4202 GMM and score_samples(X) back to probabilities
  9. Issue 4267 Density doesn't normalize in VBGMM and DPGMM
  10. Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM
  11. VBGMM DPGMM derivation
  12. Blei, David M., and Michael I. Jordan. "Variational inference for Dirichlet process mixtures." Bayesian analysis 1.1 (2006): 121-143.
  13. Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
  14. Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
  15. Hastie, Trevor, et al. The elements of statistical learning. Vol. 2. No. 1. New York: springer, 2009.

Past Work

Clone this wiki locally