-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
GSoC 2015 Proposal: Improve GMM module
Name Wei Xue
IRC WeiXue
Source control username Github: xuewei4d
Instant Messaging information GTalk: xuewei4d@gmail.com
Blog http://xuewei4d.github.io
I am a PhD student in School of Computing and Information Sciences, Florida International University, US. My main research interests are Machine Learning, Text Mining. I have passed several machine learning related courses in class Machine Learning, Graphical Models, Computer Vision. I also received certifications of two MOOC courses, Convex Optimization, Machine Learning Techniques. In these course projects, I used Python and Matlab to solve convex optimization problems and implemented prototypes of Bagging, AdaBoost, CART, RandomForest and Neural Networks. I have been using Python as my primary programming language for 3 years. In the research projects, I programmed crawlers to collect user comments from travel websites, processed text data with scikit-learn, and built topic models with C++. I also have experience of implementing Gibbs sampling for Dirichlet process Gaussian mixture models on Matlab.
University Florida International University
Major Computer Science
Current Year and Expected Graduation date Current Year 2015, Expected Graduation date May 16, 2016
Degree PhD in pursuit
Proposal Title scikit-learn: Improve GMM module
Current GMM, VBGMM and DPGMM implementations in scikit-learn have many problems. They suffer from interface incompatibility, potential bugs and incomplete testing. In this project, I propose to clean up APIs of GMM, re-implement VBGMM and DPGMM based on well-acknowledged literature, and pass various test cases.
Gaussian mixture model (GMM) is a popular unsupervised clustering method. GMM corresponds a linear combination of several Gaussian distributions to represent the probability distribution of observations. In the classic GMM, with the prefix number of Gaussian component, a set of parameters should be estimated to represent the distribution of the training data. It includes means, covariances and the coefficients in the linear combination. Expectation Maximization (EM) is usually used to infer the parameters. Variational inference generalize EM method by fully Bayesian estimation in M-step. GMM using variational Bayesian inference is named VBGMM. On the other hand, Dirichlet process Gaussian mixture model (DPGMM) allows a mixture of infinite Gaussian distributions. Users do not have to preset the number of components ahead of time. DPGMM could be approximated by truncated stick-breaking representations, and estimated by truncated variational inference. [12]
In scikit-learn, GMM, VBGMM and DPGMM are all implemented. However, their source code suffers from several problems.
- Incompatible Interface
-
Issue 1528 Consistency in GMM, _get_covars Covariance and precision matrix are interchangeably used in different literature. The
covariance
matrices that GMM uses are not provided in VBGMM and DPGMM. -
Issue 2473 Bug: GMM
score()
returns an array, not a valueGMM.score
returns the per-sample likelihood, where as other models return the sum of likelihood. -
Issue 3813 log-responsibilities in GMM
GMM.score_samples
returns exponentiated log-responsibility which has possibility of precision loss and overflow. - Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
-
Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM In this issue, I found GMM does not provide an explicit way for users to initialize
means
,covariances
andweights
.
-
Issue 1528 Consistency in GMM, _get_covars Covariance and precision matrix are interchangeably used in different literature. The
- Potential bugs
- Issue 1637 dpgmm sample not working
-
Issue 1764 DPGMM - _update_concentrations fail implementation A bug when updating
gamma_
, although fixed. - Issue 2454 Scaling kills DPGMM DPGMM and VBGMM returns incorrect parameter estimations even with simple toy data.
-
Issue 4267 Density doesn't normalize in VBGMM and DPGMM The
score_samples
method of GMM, VBGMM and DPGMM returns the probability density at given values, however the integration of these values does not equal to 1. -
Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM Users might set
params
that broke GMM algorithms.
- Documentation
- VBGMM DPGMM derivation The derivations in these pages are not consistent with the text book such as PRML[12] and ESL[14].
- Testing
- VBGMM and DPGMM have not been tested comprehensively.
PRML book [12] gives detailed updating equations for Gaussian mixture models and variational inferences for GMM, although some steps of updating parameters of Normal-Wishart distributions are skipped. Blei's paper [11] gives variational inference steps for Dirichlet process mixtures for exponential distributions. Although Gaussian distribution is one of exponential distributions, detailed updating functions should be derived. Existing implementations have three types of covariance matrices to be estimated, spherical
, tied
, and full
. In this step, I will represent detailed updating equations for all types of models and keep notations clean and consistent.
GMM has inconsistent interface to other components of scikit-learn. In this step, I will
- Fix Issue 2473, Issue 3813 and Issue 4062. Implement
DensityEstimatorMixin
which supportsdensity
method andscore
method.density
returns the PDF of input X.score
returns the sum of log likelihood of input X. - Add method
responsibility_sample
which returns the responsibility of every sample - Implement user-friendly interface for users to initialize parameters of GMM. Issue 4429
In GMM, the parameters are estimated using EM method. In each iterations, E-step optimize the the mixture weights of linear combinations in a Bayesian way, M-step uses maximum likelihood to estimate the parameters of all Gaussian distributions, means and precision matrices. In variational Bayesian inference, M-step is generalized to a full Bayesian way, where means and precision matrices are governed by Gaussian-Wishart distribution. In this step, I will follow the standard mathematical derivations, Chapter 10 in PRML [12].
DPGMM is another mixture models, which uses Dirichlet process as a nonparametric prior of the distribution parameters. It is more flexible than GMM, since the number of components could vary according to the data. The simplest way to infer DPGMM is Monte-Carlo Markov chain (MCMC), but it is generally slow to converge. In Blei's paper [11], truncated variational inference is proposed which has better convergence time. In this step, I will implement the variational inference following Blei's paper [11].
As Issue 2454 pointed out, there are some potential bugs in current implementation. It cannot pass toy tests. The test cases have not been done for VBGMM and DPGMM. The testing cases will be comprehensive, and focus on the following aspects. Several IPython Notebook will be demonstrated how to use these three models GMM, VBGMM, DPGMM as well.
- Interface coherence
- Correctness
- Numerical stability
- Speed
- Week 1 (May 25 - May 31) : Mathematical Derivation
- Week 2, 3 (Jun 1 - Jun 14) : Clean GMM
- Week 4, 5 (Jun 15 - Jun 28) : Implement VBGMM and Test
- Week 6, 7 (Jun 29 - July 12) : Implement VBGMM, DPGMM and Test
- Week 8, 9 (July 13 - July 26) : Implement DPGMM and Test
- Week 10, 11 (July 27 - Aug 9) : Test
- Week 12 (Aug 10 - Aug 16) : Documentation
- Issue 1528 Consistency in GMM, _get_covars
- Issue 1637 dpgmm sample not working
- Issue 1764 DPGMM - _update_concentrations fail implementation
- Issue 2473 Bug: GMM
score()
returns an array, not a value - Issue 2454 Scaling kills DPGMM
- Issue 3813 log-responsibilities in GMM
- Issue 4062 KernelDensity and GMM interfaces are unnecessarily confusing
- Issue 4202 GMM and score_samples(X) back to probabilities
- Issue 4267 Density doesn't normalize in VBGMM and DPGMM
- Issue 4429 incorrect estimated means lead to non positive definite covariance in GMM
- VBGMM DPGMM derivation
- Blei, David M., and Michael I. Jordan. "Variational inference for Dirichlet process mixtures." Bayesian analysis 1.1 (2006): 121-143.
- Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
- Hastie, Trevor, et al. The elements of statistical learning. Vol. 2. No. 1. New York: springer, 2009.
- [MRG + 1] Deprecate estimator_params in RFE and RFECV -- PR MERGED
- [MRG + 1] Friendly error on KNeighbors(n_neighbors=0 or not convertible to int) -- PR
- [MRG] Friendly error when lvmpdf has non positive definite covariance -- PR
- incorrect estimated means lead to non positive definite covariance in GMM -- Issue
- [MRG] implement balanced_accuracy_score -- PR