The Experts below are selected from a list of 97266 Experts worldwide ranked by ideXlab platform
Sanjoy Dasgupta - One of the best experts on this subject based on the ideXlab platform.
-
ICML - Interactive Bayesian Hierarchical Clustering
2016Co-Authors: Sharad Vikram, Sanjoy DasguptaAbstract:Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns with a user's needs. To address this, several methods incorporate constraints obtained from users into Clustering algorithms, but unfortunately do not apply to Hierarchical Clustering. We design an interactive Bayesian algorithm that incorporates user interaction into Hierarchical Clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.
-
Interactive Bayesian Hierarchical Clustering
arXiv: Learning, 2016Co-Authors: Sharad Vikram, Sanjoy DasguptaAbstract:Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns with a user's needs. To address this, several methods incorporate constraints obtained from users into Clustering algorithms, but unfortunately do not apply to Hierarchical Clustering. We design an interactive Bayesian algorithm that incorporates user interaction into Hierarchical Clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.
-
Performance guarantees for Hierarchical Clustering
Journal of Computer and System Sciences, 2005Co-Authors: Sanjoy Dasgupta, Philip M. LongAbstract:We show that for any data set in any metric space, it is possible to construct a Hierarchical Clustering with the guarantee that for every k, the induced k-Clustering has cost at most eight times that of the optimal k-Clustering. Here the cost of a Clustering is taken to be the maximum radius of its clusters. Our algorithm is similar in simplicity and efficiency to popular agglomerative heuristics for Hierarchical Clustering, and we show that these heuristics have unbounded approximation factors.
-
COLT - Performance Guarantees for Hierarchical Clustering
Lecture Notes in Computer Science, 2002Co-Authors: Sanjoy DasguptaAbstract:We show that for any data set in any metric space, it is possible to construct a Hierarchical Clustering with the guarantee that for every k, the induced k-Clustering has cost at most eight times that of the optimal k-Clustering. Here the cost of a Clustering is taken to be the maximum radius of its clusters. Our algorithm is similar in simplicity and efficiency to common heuristics for Hierarchical Clustering, and we show that these heuristics have poorer approximation factors.
Ruben H. Zamar - One of the best experts on this subject based on the ideXlab platform.
-
Multi-rank Sparse Hierarchical Clustering
arXiv: Machine Learning, 2014Co-Authors: Hongyang Zhang, Ruben H. ZamarAbstract:There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, we propose the Multi-rank sparse Hierarchical Clustering (MrSHC). We show that, using simulation studies and real data examples, MrSHC produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.
-
A natural framework for sparse Hierarchical Clustering.
arXiv: Machine Learning, 2014Co-Authors: Hongyang Zhang, Ruben H. ZamarAbstract:There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fi?elds. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, another sparse Hierarchical Clustering (SHC) framework is proposed. We show that, using simulation studies and real data examples, the proposed framework produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.
Hongyang Zhang - One of the best experts on this subject based on the ideXlab platform.
-
Multi-rank Sparse Hierarchical Clustering
arXiv: Machine Learning, 2014Co-Authors: Hongyang Zhang, Ruben H. ZamarAbstract:There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, we propose the Multi-rank sparse Hierarchical Clustering (MrSHC). We show that, using simulation studies and real data examples, MrSHC produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.
-
A natural framework for sparse Hierarchical Clustering.
arXiv: Machine Learning, 2014Co-Authors: Hongyang Zhang, Ruben H. ZamarAbstract:There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fi?elds. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, another sparse Hierarchical Clustering (SHC) framework is proposed. We show that, using simulation studies and real data examples, the proposed framework produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.
Jian Chen - One of the best experts on this subject based on the ideXlab platform.
-
Towards understanding Hierarchical Clustering: A data distribution perspective
Neurocomputing, 2009Co-Authors: Hui Xiong, Jian ChenAbstract:A very important category of Clustering methods is Hierarchical Clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the Hierarchical Clustering process. In this paper, our goal is to provide a systematic understanding of Hierarchical Clustering from a data distribution perspective. Specifically, we investigate the issues about how the ''true'' cluster distribution can make impact on the Clustering performance, and what is the relationship between Hierarchical Clustering schemes and validation measures with respect to different data distributions. To this end, we provide an organized study to illustrate these issues. Indeed, one of our key findings reveals that Hierarchical Clustering tends to produce clusters with high variation on cluster sizes regardless of ''true'' cluster distributions. Also, our results show that F-measure, an external Clustering validation measure, has bias towards Hierarchical Clustering algorithms which tend to increase the variation on cluster sizes. Viewed in light of this, we propose F"n"o"r"m, the normalized version of the F-measure, to solve the cluster validation problem for Hierarchical Clustering. Experimental results show that F"n"o"r"m is indeed more suitable than the unnormalized F-measure in evaluating the Hierarchical Clustering results across data sets with different data distributions.
Chenjian - One of the best experts on this subject based on the ideXlab platform.
-
Towards understanding Hierarchical Clustering
Neurocomputing, 2009Co-Authors: Wujunjie, Xionghui, ChenjianAbstract:A very important category of Clustering methods is Hierarchical Clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the Hierarchical clust...