-
-
Notifications
You must be signed in to change notification settings - Fork 25k
GSoC 2015 Proposal: Improve GMM module
Name Wei Xue
Email xuewei4d@gmail.com
Time Zone Eastern Daylight Time (UTC -4:00)
IRC WeiXue
Source control username Github: xuewei4d
Instant Messaging information GTalk: xuewei4d@gmail.com
Blog http://xuewei4d.github.io
I am a PhD student in School of Computing and Information Sciences, Florida International University, US. My main research interests are Machine Learning, Text Mining. I have passed several machine learning related courses in class Machine Learning, Graphical Models, Computer Vision. I also received certifications of two MOOC courses, Convex Optimization, Machine Learning Techniques. In these course projects, I used Python and Matlab to solve convex optimization problems and implemented prototypes of Bagging, AdaBoost, CART, RandomForest and Neural Networks. I have been using Python as my primary programming language for 3 years. In the research projects, I programmed crawlers to collect user comments from travel websites, processed text data with scikit-learn, and built topic models with C++. I also have experience of implementing Gibbs sampling for Dirichlet process Gaussian mixture models on Matlab.
University Florida International University
Major Computer Science
Current Year and Expected Graduation date 3rd year, expected to graduate on May, 2016
Degree PhD in pursuit
Proposal Title scikit-learn: Improve GMM module
Current GMM, VBGMM and DPGMM implementations in scikit-learn have many problems. They suffer from interface incompatibility, potential bugs and incomplete testing. In this project, I propose to clean up APIs of GMM, re-implement VBGMM and DPGMM based on well-acknowledged literature, and pass various test cases.
This proposal is organized as follows. Model Introduction section gives a brief overview of Gaussian mixture model, variational Bayesian Gaussian mixture model and Dirichlet process Gaussian mixture model. Next, Motivation section states the existing problems in the scikit-learn. Plan section outlines the details solving these problems in my GSoC participation.
Gaussian mixture model (GMM) is a popular unsupervised clustering method. GMM corresponds a linear combination of several Gaussian distributions to represent the probability distribution of observations. In the classic GMM, with the prefix number of Gaussian component, a set of parameters should be estimated to represent the distribution of the training data. It includes means, covariances and the coefficients of the linear combination. Expectation Maximization (EM) is usually used to find the maximum likelihood parameters of the mixture model. In each iteration, E-step estimates the conditional distribution of the latent variables. M-step finds the model parameters that maximize the likelihood.
In variational Bayesian Gaussian mixture model (VBGMM), M-step is generalized into full Bayesian estimation, where the parameters are represented by the posterior distribution, instead of only single value like in maximum-likelihood estimation [14, 15].
On the other hand, Dirichlet process Gaussian mixture model (DPGMM) allows a mixture of infinite Gaussian distributions. It uses Dirichlet process as a nonparametric prior of the distribution parameters, and the number of components could vary according to the data. Therefore, one does not have to preset the number of components ahead of time. The simplest way to infer DPGMM is Monte-Carlo Markov chain (MCMC), but it is generally slow to converge. In Blei's paper [13], truncated variational inference is proposed, which has better convergence time.
GMM, VBGMM and DPGMM are very popular in various machine learning applications [21]. However, in scikit-learn, the implementation suffers from so many problems, which prohibits the widely use of these models. Every year since 2012 Improve GMM topic is in scikit-learn's GSoC To-do list.
- Incompatible Interface
-
Issue 1528 Consistency in GMM, _get_covars Covariance and precision matrix are interchangeably used in different literature. The
covariance
matrices that GMM uses are not provided in VBGMM and DPGMM. -
Issue 2473 Bug: GMM
score()
returns an array, not a valueGMM.score
returns the per-sample likelihood, where as other models return the sum of likelihood. -
Issue 3813 log-responsibilities in GMM
GMM.score_samples
returns exponentiated log-responsibility which has possibility of precision loss and overflow. - Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
-
Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM In this issue, I found GMM does not provide an explicit way for users to initialize
means
,covariances
andweights
.
-
Issue 1528 Consistency in GMM, _get_covars Covariance and precision matrix are interchangeably used in different literature. The
- Potential bugs
- Issue 1637 dpgmm sample not working
-
Issue 1764 DPGMM - _update_concentrations fail implementation A bug when updating
gamma_
. It has been fixed. - Issue 2454 Scaling kills DPGMM DPGMM and VBGMM returns incorrect parameter estimations even on simple toy data set.
-
Issue 4267 Density doesn't normalize in VBGMM and DPGMM The
score_samples
method of GMM, VBGMM and DPGMM returns the probability density at given values, however the integration of these values does not equal to 1. -
Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM Users might set
params
that broke GMM algorithms.
- Documentation
- VBGMM DPGMM derivation The derivations in these pages are not consistent with the text book such as PRML[14] and MLAPP[15].
- Testing
- VBGMM and DPGMM have not been tested comprehensively.
To save GMM, VBGMM and DPGMM modules from the danger of decay, I present the following detailed plan in this GSoC program.
The large part of documentation for these modules is mathematical derivations. The textbook PRML [14] gives detailed updating equations for Gaussian mixture model and variational inference, except some derivation steps for updating functions are skipped. Blei's paper [13] gives the updating functions of variational inference for Dirichlet process mixtures with exponential distributions. Gaussian distribution is one of exponential distributions, but the specific updating equations should be derived. Existing documents concerning derivation for VBGMM DPGMM are not consistent with research literature. It might be the cause of incorrect algorithm output. I will document the updating equations for all models in detail. The goal of this part is to guarantee derivations correct, clean and easy to understand.
Other documentation including docstrings, examples will also be updated with the convention of scikit-learn. Several IPython Notebooks will also demonstrate how to use these three models with intuitive pictures.
The API inconsistency needs to be resolved at the very first place for all of three classes. The discussion in Issue 2473, Issue 3813, Issue 4062 and Issue 4429 shed light on new interface. In this stage, after group discussion, I will nail down a clean and consistent interface shared by GMM, VBGMM and DPGMM. KDE, another density estimator, should also be taken into account.
Based on the discussion of previous issues, the main problem in GMM lies in the interface. This part probably consists the following prospective tasks
- Rename GMM to GaussianMixture in order to make it consistent with naming convention
- Implement
DensityEstimatorMixin
which supportsdensity
method andscore
method.density
returns the PDF of input X.score
returns the sum of log likelihood of input X. - Add method
responsibility_sample
which returns the responsibility of every sample - Implement friendly interface to initialize parameters of GMM
- Add new parameter estimation method, maximum a posterior (MAP) [Optional] [15]
The problems reported in VBGMM and DPGMM are the tip of an iceberg. The current evidence is pointing to the nonstandard updating equations. There would be some uncertainty on this part, because this stage depends on the mathematical derivation and intensive testing. The tasks generally are
- Reshape the interface
- Check the parameter updating methods against the settled mathematical derivations. Find out why existing implementation fails on simple cases
- Fix the emerged bugs and refactor problematic methods
- Better initialization strategy [Optional] [15]
As Issue 2454 pointed out, there are some potential bugs in the current implementation. It cannot pass toy tests. Moreover, the test cases have not even been done for VBGMM and DPGMM. In this part, each feature is accompanied by a comprehensive set of testing. The test data sets include, for example, random generated toy examples, Iris data set, Old Faithful data set. The test cases consists of the following several parts.
- Test help functions and class methods in modules
- Test friendly exception message
- Benchmark with regards to n_samples, n_features, n_components, n_iterations, tolerance and four covariance type.
- Plot the likelihood (or lower-bound) against the above variables
- Use the model as a classifier to test accuracy, recall and f1-score
- Do profiling on memory usage and runtime. If acceleration is considered, compare the improved against current implementation in terms of performance
- Week 1
- Derivate mathematical equations
- Improve documents and examples for GMM
- Week 2, 3, 4
- Discuss the consistent interface
- Clean the API of GMM
- Test and Documentation
- Week 5, 6, 7, 8
- Debug and implement VBGMM
- Test and Documentation
- Pass the midterm
- Week 9, 10
- Debug and implement DPGMM
- Test and Documentation
- Week 11, 12
- Add more test cases, improve documents and fix bugs
- Week 13
- Pencils down
- Wrap up
- Issue 393 DPGMM and VBGMM clustering weights not updating)
- Issue 1528 Consistency in GMM, _get_covars
- Issue 1637 dpgmm sample not working
- Issue 1764 DPGMM - _update_concentrations fail implementation
- Issue 2473 Bug: GMM
score()
returns an array, not a value - Issue 2454 Scaling kills DPGMM
- Issue 3813 log-responsibilities in GMM
- Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
- Issue 4202 GMM and score_samples(X) back to probabilities
- Issue 4267 Density doesn't normalize in VBGMM and DPGMM
- Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM
- VBGMM DPGMM derivation in scikit-learn
- Blei, David M., and Michael I. Jordan. "Variational inference for Dirichlet process mixtures." Bayesian analysis 1.1 (2006): 121-143.
- Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
- Hastie, Trevor, et al. The elements of statistical learning. Vol. 2. No. 1. New York: springer, 2009.
- mclust. An R package for normal mixture modeling via EM, model-based clustering, classification, and density estimation.
- Matlab gmdistribution
- BayesPy
- Variational Dirichlet Process Gaussian Mixture Model
- Wikipedia Mixture Model
- [MRG + 1] Deprecate estimator_params in RFE and RFECV [PR MERGED]
- [MRG + 1] Friendly error on KNeighbors(n_neighbors=0 or not convertible to int) [PR]
- [MRG] Friendly error when lvmpdf has non positive definite covariance [PR]
- incorrect estimated means lead to non positive definite covariance in GMM [Issue]
- [MRG] implement balanced_accuracy_score [PR]