Model-Based Clustering

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 413931 Experts worldwide ranked by ideXlab platform

Luca Scrucca - One of the best experts on this subject based on the ideXlab platform.

  • Model-Based Clustering with sparse covariance matrices
    Statistics and Computing, 2018
    Co-Authors: Michael Fop, Thomas Brendan Murphy, Luca Scrucca
    Abstract:

    Finite Gaussian mixture models are widely used for Model-Based Clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious Model-Based Clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality. The general methodology for Model-Based Clustering with sparse covariance matrices is implemented in the R package mixggm, available on CRAN.

  • Model-Based Clustering with Sparse Covariance Matrices
    arXiv: Methodology, 2017
    Co-Authors: Michael Fop, Thomas Brendan Murphy, Luca Scrucca
    Abstract:

    Finite Gaussian mixture models are widely used for Model-Based Clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious Model-Based Clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality.

  • Model-Based Clustering with Sparse Covariance Matrices
    2017
    Co-Authors: Michael Fop, Murphy, Thomas Brendan, Luca Scrucca
    Abstract:

    We introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. The framework allows a parsimonious Model-Based Clustering of the data, where clusters are characterized by sparse covariance matrices and the associated dependence structures are represented by graphs. The graphical models pose a set of pairwise independence restrictions on the covariance matrices, resulting in sparsity and a flexible model for the joint distribution of the variables. The model is estimated employing a penalised likelihood approach, whose maximisation is carried out using a genetic algorithm embedded in a structural-EM. The method is naturally extended to allow for Bayesian regularization in the case of high-dimensional data.

  • Genetic Algorithms for Subset Selection in Model-Based Clustering
    Unsupervised Learning Algorithms, 2016
    Co-Authors: Luca Scrucca
    Abstract:

    Model-Based Clustering assumes that the data observed can be represented by a finite mixture model, where each cluster is represented by a parametric distribution. The Gaussian distribution is often employed in the multivariate continuous case. The identification of the subset of relevant Clustering variables enables a parsimonious number of unknown parameters to be achieved, thus yielding a more efficient estimate, a clearer interpretation and often improved Clustering partitions. This paper discusses variable or feature selection for Model-Based Clustering. Following the approach of Raftery and Dean (J Am Stat Assoc 101(473):168–178, 2006), the problem of subset selection is recast as a model comparison problem, and BIC is used to approximate Bayes factors. The criterion proposed is based on the BIC difference between a candidate Clustering model for the given subset and a model which assumes no Clustering for the same subset. Thus, the problem amounts to finding the feature subset which maximises such a criterion. A search over the potentially vast solution space is performed using genetic algorithms, which are stochastic search algorithms that use techniques and concepts inspired by evolutionary biology and natural selection. Numerical experiments using real data applications are presented and discussed.

  • Visualization of Model-Based Clustering Structures
    Data Analysis and Classification, 2009
    Co-Authors: Luca Scrucca
    Abstract:

    Model-Based Clustering based on a finite mixture of Gaussian components is an effective method for looking for groups of observations in a dataset. In this paper we propose a dimension reduction method, called MCLUSTSIR, which is able to show Clustering structures depending on the selected Gaussian mixture model. The method aims at finding those directions which are able to display both variation in cluster means and variations in cluster covariances. The resulting MCLUSTSIR variables are defined as a linear mapping method which projects the data onto a suitable subspace.

Michael Fop - One of the best experts on this subject based on the ideXlab platform.

  • Model-Based Clustering with sparse covariance matrices
    Statistics and Computing, 2018
    Co-Authors: Michael Fop, Thomas Brendan Murphy, Luca Scrucca
    Abstract:

    Finite Gaussian mixture models are widely used for Model-Based Clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious Model-Based Clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality. The general methodology for Model-Based Clustering with sparse covariance matrices is implemented in the R package mixggm, available on CRAN.

  • Variable Selection Methods for Model-Based Clustering
    Statistics Surveys, 2018
    Co-Authors: Michael Fop, Thomas Brendan Murphy
    Abstract:

    Model-Based Clustering is a popular approach for Clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the Model-Based Clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the Clustering results. This review provides a summary of the methods developed for variable selection in Model-Based Clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

  • Model-Based Clustering with Sparse Covariance Matrices
    arXiv: Methodology, 2017
    Co-Authors: Michael Fop, Thomas Brendan Murphy, Luca Scrucca
    Abstract:

    Finite Gaussian mixture models are widely used for Model-Based Clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious Model-Based Clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality.

  • Model-Based Clustering with Sparse Covariance Matrices
    2017
    Co-Authors: Michael Fop, Murphy, Thomas Brendan, Luca Scrucca
    Abstract:

    We introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. The framework allows a parsimonious Model-Based Clustering of the data, where clusters are characterized by sparse covariance matrices and the associated dependence structures are represented by graphs. The graphical models pose a set of pairwise independence restrictions on the covariance matrices, resulting in sparsity and a flexible model for the joint distribution of the variables. The model is estimated employing a penalised likelihood approach, whose maximisation is carried out using a genetic algorithm embedded in a structural-EM. The method is naturally extended to allow for Bayesian regularization in the case of high-dimensional data.

Thomas Brendan Murphy - One of the best experts on this subject based on the ideXlab platform.

  • Model-Based Clustering of Count Processes
    Journal of Classification, 2020
    Co-Authors: Tin Lok James Ng, Thomas Brendan Murphy
    Abstract:

    A Model-Based Clustering method based on Gaussian Cox process is proposed to address the problem of Clustering of count process data. The model allows for nonparametric estimation of intensity functions of Poisson processes, while simultaneous Clustering count process observations. A logistic Gaussian process transformation is imposed on the intensity functions to enforce smoothness. Maximum likelihood parameter estimation is carried out via the EM algorithm, while model selection is addressed using a cross-validated likelihood approach. The proposed model and methodology are applied to two datasets.

  • Model-Based Clustering with sparse covariance matrices
    Statistics and Computing, 2018
    Co-Authors: Michael Fop, Thomas Brendan Murphy, Luca Scrucca
    Abstract:

    Finite Gaussian mixture models are widely used for Model-Based Clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious Model-Based Clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality. The general methodology for Model-Based Clustering with sparse covariance matrices is implemented in the R package mixggm, available on CRAN.

  • Variable Selection Methods for Model-Based Clustering
    Statistics Surveys, 2018
    Co-Authors: Michael Fop, Thomas Brendan Murphy
    Abstract:

    Model-Based Clustering is a popular approach for Clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the Model-Based Clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the Clustering results. This review provides a summary of the methods developed for variable selection in Model-Based Clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

  • Model-Based Clustering with Sparse Covariance Matrices
    arXiv: Methodology, 2017
    Co-Authors: Michael Fop, Thomas Brendan Murphy, Luca Scrucca
    Abstract:

    Finite Gaussian mixture models are widely used for Model-Based Clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for Model-Based Clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious Model-Based Clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality.

  • On Estimation of Parameter Uncertainty in Model-Based Clustering
    arXiv: Computation, 2015
    Co-Authors: Adrian O'hagan, Thomas Brendan Murphy, Isobel Claire Gormley
    Abstract:

    Mixture models are a popular tool in Model-Based Clustering. Such a model is often fitted by a procedure that maximizes the likelihood, such as the EM algorithm. At convergence, the maximum likelihood parameter estimates are typically reported, but in most cases little emphasis is placed on the variability associated with these estimates. In part this may be due to the fact that standard errors are not directly calculated in the model-fitting algorithm, either because they are not required to fit the model, or because they are difficult to compute. The examination of standard errors in Model-Based Clustering is therefore typically neglected. The widely used R package mclust has recently introduced bootstrap and weighted likelihood bootstrap methods to facilitate standard error estimation. This paper provides an empirical comparison of these methods (along with the jackknife method) for producing standard errors and confidence intervals for mixture parameters. These methods are illustrated and contrasted in both a simulation study and in the traditional Old Faithful data set.

Charles Bouveyron - One of the best experts on this subject based on the ideXlab platform.

  • model based Clustering of high dimensional data a review
    Computational Statistics & Data Analysis, 2014
    Co-Authors: Charles Bouveyron, Camille Brunetsaumard
    Abstract:

    Model-Based Clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical Model-Based Clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that Model-Based Clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for Clustering and recent techniques exploit those characteristics. After having recalled the bases of Model-Based Clustering, dimension reduction approaches, regularization-based techniques, parsimonious modeling, subspace Clustering methods and Clustering methods based on variable selection are reviewed. Existing softwares for Model-Based Clustering of high-dimensional data will be also reviewed and their practical use will be illustrated on real-world data sets.

  • Model-Based Clustering of high-dimensional data streams with online mixture of probabilistic PCA
    Advances in Data Analysis and Classification, 2013
    Co-Authors: Anastasios Bellas, Charles Bouveyron, Marie Cottrell, Jérôme Lacaille
    Abstract:

    Model-Based Clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, Model-Based Clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of Model-Based Clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.

  • Model-Based Clustering of High-Dimensional Data: A review
    Computational Statistics and Data Analysis, 2013
    Co-Authors: Charles Bouveyron, Camille Brunet
    Abstract:

    Model-Based Clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical Model-Based Clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that Model-Based Clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for Clustering and recent techniques exploit those characteristics. After having recalled the bases of Model-Based Clustering, this article will review dimension reduction approaches, regularization-based techniques, parsimonious modeling, subspace Clustering methods and Clustering methods based on variable selection. Existing softwares for Model-Based Clustering of high-dimensional data will be also reviewed and their practical use will be illustrated on real-world data sets.

  • HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data
    2012
    Co-Authors: Laurent Bergé, Charles Bouveyron, Stephane Girard
    Abstract:

    HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data

Adrian E Raftery - One of the best experts on this subject based on the ideXlab platform.

  • Bayesian model averaging in Model-Based Clustering and density estimation
    arXiv: Computation, 2015
    Co-Authors: Niamh Russell, Thomas Brendan Murphy, Adrian E Raftery
    Abstract:

    We propose Bayesian model averaging (BMA) as a method for postpro- cessing the results of Model-Based Clustering. Given a number of competing models, appro- priate model summaries are averaged, using the posterior model probabilities, instead of being taken from a single \best" model. We demonstrate the use of BMA in Model-Based Clustering for a number of datasets. We show that BMA provides a useful summary of the Clustering of observations while taking model uncertainty into account. Further, we show that BMA in conjunction with Model-Based Clustering gives a competitive method for density estimation in a multivariate setting. Applying BMA in the Model-Based context is fast and can give enhanced modeling performance.

  • Model-Based Clustering With Dissimilarities: A Bayesian Approach
    Journal of Computational and Graphical Statistics, 2007
    Co-Authors: Adrian E Raftery
    Abstract:

    A Bayesian Model-Based Clustering method is proposed for Clustering objects on the basis of dissimilarites. This combines two basic ideas. The first is that the objects have latent positions in a Euclidean space, and that the observed dissimilarities are measurements of the Euclidean distances with error. The second idea is that the latent positions are generated from a mixture of multivariate normal distributions, each one corresponding to a cluster. We estimate the resulting model in a Bayesian way using Markov chain Monte Carlo. The method carries out multidimensional scaling and Model-Based Clustering simultaneously, and yields good object configurations and good Clustering results with reasonable measures of Clustering uncertainties. In the examples we study, the Clustering results based on low-dimensional configurations were almost as good as those based on high-dimensional ones. Thus, the method can be used as a tool for dimension reduction when Clustering high-dimensional objects, which may be use...

  • variable selection for model based Clustering
    Journal of the American Statistical Association, 2006
    Co-Authors: Adrian E Raftery, Nema Dean
    Abstract:

    We consider the problem of variable or feature selection for Model-Based Clustering. The problem of comparing two nested subsets of variables is recast as a model comparison problem and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the Clustering model simultaneously. We applied the method to several simulated and real examples and found that removing irrelevant variables often improved performance. Compared with methods based on all of the variables, our variable selection method consistently yielded more accurate estimates of the number of groups and lower classification error rates, as well as more parsimonious Clustering models and easier visualization of results.

  • Incremental Model-Based Clustering for Large Datasets with Small Clusters
    Journal of Computational and Graphical Statistics, 2005
    Co-Authors: Chris Fraley, Adrian E Raftery, Ron Wehrens
    Abstract:

    Clustering is often useful for analyzing and summarizing information within large datasets. Model-Based Clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best Clustering method in datasets that are small to moderate in size. For large datasets, current Model-Based Clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a Clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added increme...

  • Model-Based Clustering, discriminant analysis, and density estimation
    Journal of the American Statistical Association, 2002
    Co-Authors: Chris Fraley, Adrian E Raftery
    Abstract:

    Cluster analysis is the automated search for groups of related observations in a dataset. Most Clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most Clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which Clustering method should be used, and how should outliers be handled. We review a general methodology for Model-Based Clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent development...