Hierarchical Clustering

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 97266 Experts worldwide ranked by ideXlab platform

Sanjoy Dasgupta - One of the best experts on this subject based on the ideXlab platform.

  • ICML - Interactive Bayesian Hierarchical Clustering
    2016
    Co-Authors: Sharad Vikram, Sanjoy Dasgupta
    Abstract:

    Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns with a user's needs. To address this, several methods incorporate constraints obtained from users into Clustering algorithms, but unfortunately do not apply to Hierarchical Clustering. We design an interactive Bayesian algorithm that incorporates user interaction into Hierarchical Clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.

  • Interactive Bayesian Hierarchical Clustering
    arXiv: Learning, 2016
    Co-Authors: Sharad Vikram, Sanjoy Dasgupta
    Abstract:

    Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns with a user's needs. To address this, several methods incorporate constraints obtained from users into Clustering algorithms, but unfortunately do not apply to Hierarchical Clustering. We design an interactive Bayesian algorithm that incorporates user interaction into Hierarchical Clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.

  • Performance guarantees for Hierarchical Clustering
    Journal of Computer and System Sciences, 2005
    Co-Authors: Sanjoy Dasgupta, Philip M. Long
    Abstract:

    We show that for any data set in any metric space, it is possible to construct a Hierarchical Clustering with the guarantee that for every k, the induced k-Clustering has cost at most eight times that of the optimal k-Clustering. Here the cost of a Clustering is taken to be the maximum radius of its clusters. Our algorithm is similar in simplicity and efficiency to popular agglomerative heuristics for Hierarchical Clustering, and we show that these heuristics have unbounded approximation factors.

  • COLT - Performance Guarantees for Hierarchical Clustering
    Lecture Notes in Computer Science, 2002
    Co-Authors: Sanjoy Dasgupta
    Abstract:

    We show that for any data set in any metric space, it is possible to construct a Hierarchical Clustering with the guarantee that for every k, the induced k-Clustering has cost at most eight times that of the optimal k-Clustering. Here the cost of a Clustering is taken to be the maximum radius of its clusters. Our algorithm is similar in simplicity and efficiency to common heuristics for Hierarchical Clustering, and we show that these heuristics have poorer approximation factors.

Ruben H. Zamar - One of the best experts on this subject based on the ideXlab platform.

  • Multi-rank Sparse Hierarchical Clustering
    arXiv: Machine Learning, 2014
    Co-Authors: Hongyang Zhang, Ruben H. Zamar
    Abstract:

    There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, we propose the Multi-rank sparse Hierarchical Clustering (MrSHC). We show that, using simulation studies and real data examples, MrSHC produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.

  • A natural framework for sparse Hierarchical Clustering.
    arXiv: Machine Learning, 2014
    Co-Authors: Hongyang Zhang, Ruben H. Zamar
    Abstract:

    There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fi?elds. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, another sparse Hierarchical Clustering (SHC) framework is proposed. We show that, using simulation studies and real data examples, the proposed framework produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.

Hongyang Zhang - One of the best experts on this subject based on the ideXlab platform.

  • Multi-rank Sparse Hierarchical Clustering
    arXiv: Machine Learning, 2014
    Co-Authors: Hongyang Zhang, Ruben H. Zamar
    Abstract:

    There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fields. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, we propose the Multi-rank sparse Hierarchical Clustering (MrSHC). We show that, using simulation studies and real data examples, MrSHC produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.

  • A natural framework for sparse Hierarchical Clustering.
    arXiv: Machine Learning, 2014
    Co-Authors: Hongyang Zhang, Ruben H. Zamar
    Abstract:

    There has been a surge in the number of large and flat data sets - data sets containing a large number of features and a relatively small number of observations - due to the growing ability to collect and store information in medical research and other fi?elds. Hierarchical Clustering is a widely used Clustering tool. In Hierarchical Clustering, large and flat data sets may allow for a better coverage of Clustering features (features that help explain the true underlying clusters) but, such data sets usually include a large fraction of noise features (non-Clustering features) that may hide the underlying clusters. Witten and Tibshirani (2010) proposed a sparse Hierarchical Clustering framework to cluster the observations using an adaptively chosen subset of the features, however, we show that this framework has some limitations when the data sets contain Clustering features with complex structure. In this paper, another sparse Hierarchical Clustering (SHC) framework is proposed. We show that, using simulation studies and real data examples, the proposed framework produces superior feature selection and Clustering performance comparing to the classical (of-the-shelf) Hierarchical Clustering and the existing sparse Hierarchical Clustering framework.

Jian Chen - One of the best experts on this subject based on the ideXlab platform.

  • Towards understanding Hierarchical Clustering: A data distribution perspective
    Neurocomputing, 2009
    Co-Authors: Hui Xiong, Jian Chen
    Abstract:

    A very important category of Clustering methods is Hierarchical Clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the Hierarchical Clustering process. In this paper, our goal is to provide a systematic understanding of Hierarchical Clustering from a data distribution perspective. Specifically, we investigate the issues about how the ''true'' cluster distribution can make impact on the Clustering performance, and what is the relationship between Hierarchical Clustering schemes and validation measures with respect to different data distributions. To this end, we provide an organized study to illustrate these issues. Indeed, one of our key findings reveals that Hierarchical Clustering tends to produce clusters with high variation on cluster sizes regardless of ''true'' cluster distributions. Also, our results show that F-measure, an external Clustering validation measure, has bias towards Hierarchical Clustering algorithms which tend to increase the variation on cluster sizes. Viewed in light of this, we propose F"n"o"r"m, the normalized version of the F-measure, to solve the cluster validation problem for Hierarchical Clustering. Experimental results show that F"n"o"r"m is indeed more suitable than the unnormalized F-measure in evaluating the Hierarchical Clustering results across data sets with different data distributions.

Chenjian - One of the best experts on this subject based on the ideXlab platform.