latent dirichlet allocation

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 10998 Experts worldwide ranked by ideXlab platform

Max Welling - One of the best experts on this subject based on the ideXlab platform.

  • stochastic collapsed variational bayesian inference for latent dirichlet allocation
    2013
    Co-Authors: James R Foulds, Padhraic Smyth, Levi Boyles, Christopher Dubois, Max Welling
    Abstract:

    There has been an explosion in the amount of digital text information available in recent years, leading to challenges of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent dirichlet allocation (LDA) have made it feasible to learn topic models on very large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than previous methods. Human-subject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software.

  • stochastic collapsed variational bayesian inference for latent dirichlet allocation
    2013
    Co-Authors: James R Foulds, Padhraic Smyth, Levi Boyles, Christopher Dubois, Max Welling
    Abstract:

    In the internet era there has been an explosion in the amount of digital text information available, leading to difficulties of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent dirichlet allocation (LDA) have made it feasible to learn topic models on large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. We show connections between collapsed variational Bayesian inference and MAP estimation for LDA, and leverage these connections to prove convergence properties of the proposed algorithm. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than the previous method. Human-subject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software.

  • Fast collapsed gibbs sampling for latent dirichlet allocation
    2008
    Co-Authors: Alexander Ihler, Ian Porteous, Arthur Asuncion, Padhraic Smyth, Max Welling, David Newman
    Abstract:

    In this paper we introduce a novel collapsed Gibbs sampling method\nfor the widely used latent dirichlet allocation (LDA) model. Our\nnew method results in significant speedups on real world text corpora.\nConventional Gibbs sampling schemes for LDA require O(K) operations\nper sample where K is the number of topics in the model. Our proposed\nmethod draws equivalent samples but requires on average significantly\nless then K operations per sample. On real-word corpora FastLDA can\nbe as much as 8 times faster than the standard collapsed Gibbs sampler\nfor LDA. No approximations are necessary, and we show that our fast\nsampling scheme produces exactly the same results as the standard\n(but slower) sampling scheme. Experiments on four real world data\nsets demonstrate speedups for a wide range of collection sizes. For\nthe PubMed collection of over 8 million documents with a required\ncomputation time of 6 CPU months for LDA, our speedup of 5.7 can\nsave 5 CPU months of computation.

  • Distributed Inference for latent dirichlet allocation
    2007
    Co-Authors: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
    Abstract:

    We investigate the problem of learning a widely-used latent-variable model – the latent dirichlet allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees\tof the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across\tprocessors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

  • a collapsed variational bayesian inference algorithm for latent dirichlet allocation
    2006
    Co-Authors: Yee Whye Teh, David Newman, Max Welling
    Abstract:

    latent dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

Naoki Hayashi - One of the best experts on this subject based on the ideXlab platform.

  • the exact asymptotic form of bayesian generalization error in latent dirichlet allocation
    2021
    Co-Authors: Naoki Hayashi
    Abstract:

    Abstract latent dirichlet allocation (LDA) obtains essential information from data by using Bayesian inference. It is applied to knowledge discovery via dimension reducing and clustering in many fields. However, its generalization error had not been yet clarified since it is a singular statistical model where there is no one-to-one mapping from parameters to probability distributions. In this paper, we give the exact asymptotic form of its generalization error and marginal likelihood, by theoretical analysis of its learning coefficient using algebraic geometry. The theoretical result shows that the Bayesian generalization error in LDA is expressed in terms of that in matrix factorization and a penalty from the simplex restriction of LDA’s parameter region. A numerical experiment is consistent with the theoretical result.

  • the exact asymptotic form of bayesian generalization error in latent dirichlet allocation
    2020
    Co-Authors: Naoki Hayashi
    Abstract:

    latent dirichlet allocation (LDA) obtains essential information from data by using Bayesian inference. It is applied to knowledge discovery via dimension reducing and clustering in many fields. However, its generalization error had not been yet clarified since it is a singular statistical model where there is no one to one map from parameters to probability distributions. In this paper, we give the exact asymptotic form of its generalization error and marginal likelihood, by theoretical analysis of its learning coefficient using algebraic geometry. The theoretical result shows that the Bayesian generalization error in LDA is expressed in terms of that in matrix factorization and a penalty from the simplex restriction of LDA's parameter region.

  • Asymptotic Bayesian Generalization Error in latent dirichlet allocation and Stochastic Matrix Factorization
    2020
    Co-Authors: Naoki Hayashi, Sumio Watanabe
    Abstract:

    latent dirichlet allocation (LDA) is useful in document analysis, image processing, and many information systems; however, its generalization performance has been left unknown because it is a singular learning machine to which regular statistical theory can not be applied. Stochastic matrix factorization (SMF) is a restricted matrix factorization in which matrix factors are stochastic; the column of the matrix is in a simplex. SMF is being applied to image recognition and text mining. We can understand SMF as a statistical model by which a stochastic matrix of given data is represented by a product of two stochastic matrices, whose generalization performance has also been left unknown because of non-regularity. In this paper, by using an algebraic and geometric method, we show the analytic equivalence of LDA and SMF, both of which have the same real log canonical threshold (RLCT), resulting in that they asymptotically have the same Bayesian generalization error and the same log marginal likelihood. Moreover, we derive the upper bound of the RLCT and prove that it is smaller than the dimension of the parameter divided by two, hence the Bayesian generalization errors of them are smaller than those of regular statistical models.

Mihai Datcu - One of the best experts on this subject based on the ideXlab platform.

  • multisensor earth observation image classification based on a multimodal latent dirichlet allocation model
    2018
    Co-Authors: Reza Bahmanyar, Daniela Espinozamolina, Mihai Datcu
    Abstract:

    Many previous researches have already shown the advantages of multisensor land-cover classification. Here, we propose an innovative land-cover classification approach based on learning a joint latent model of synthetic aperture radar (SAR) and multispectral satellite images using multimodal latent dirichlet allocation (mmLDA), a probabilistic generative model. It has already been successfully applied to various other problems dealing with multimodal data. For our experiments, we chose overlapping SAR and multispectral images of two regions of interest. The images were tiled into patches and their local primitive features were extracted. Then each image patch is represented by SAR and multispectral bag-of-words (BoW) models. The BoW values are both fed to the mmLDA, resulting in a joint latent data model. A qualitative and quantitative validation of the topics based on ground-truth data demonstrate that the land-cover categories of the regions are correctly classified, outperforming the topics obtained using individual single modality data.

  • latent dirichlet allocation for spatial analysis of satellite images
    2013
    Co-Authors: Corina V??duva, Inge Gav??t, Mihai Datcu
    Abstract:

    This paper describes research that seeks to supersede human inductive learning and reasoning in high-level scene understanding and content extraction. Searching for relevant knowledge with a semantic meaning consists mostly in visual human inspection of the data, regardless of the application. The method presented in this paper is an innovation in the field of information retrieval. It aims to discover latent semantic classes containing pairs of objects characterized by a certain spatial positioning. A hierarchical structure is recommended for the image content. This approach is based on a method initially developed for topics discovery in text, applied this time to invariant descriptors of image region or objects configurations. First, invariant spatial signatures are computed for pairs of objects, based on a measure of their interaction, as attributes for describing spatial arrangements inside the scene. Spatial visual words are then defined through a simple classification, extracting new patterns of similar object configurations. Further, the scene is modeled according to these new patterns (spatial visual words) using the latent dirichlet allocation model into a finite mixture over an underlying set of topics. In the end, some statistics are done to achieve a better understanding of the spatial distributions inside the discovered semantic classes.

  • Semantic annotation of satellite images using latent dirichlet allocation
    2010
    Co-Authors: Marie Liénou, Henri Maitre, Mihai Datcu
    Abstract:

    In this letter, we are interested in the annotation of large satellite images, using semantic concepts defined by the user. This annotation task combines a step of supervised classification of patches of the large image and the integration of the spatial information between these patches. Given a training set of images for each concept, learning is based on the latent dirichlet allocation (LDA) model. This hierarchical model represents each item of a collection as a random mixture of latent topics, where each topic is characterized by a distribution over words. The LDA-based image representation is obtained using simple features extracted from image words. We then exploit the capability of the LDA model to assign probabilities to unseen images, in order to classify the patches of the large image into the semantic concepts, using the maximum-likelihood method. We conduct experiments on panchromatic QuickBird images with 60-cm resolution. Taking into account the spatial information between the patches shows to improve the annotation performance.

David Newman - One of the best experts on this subject based on the ideXlab platform.

  • Online learning for latent dirichlet allocation
    2014
    Co-Authors: David M Dm Blei, Blei@cs Berkeley Edu, Ang@cs Stanford Edu, Jordan@cs Berkeley Edu, Jh Jey Han Lau, Karl Grieser, Michael I. Jordan, Andrew Y Ng, David Newman, Timothy Baldwin
    Abstract:

    We develop an online variational Bayes (VB) algorithm for latent dirichlet allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1

  • Understanding errors in approximate distributed latent dirichlet allocation
    2012
    Co-Authors: Alexander Ihler, David Newman
    Abstract:

    latent dirichlet allocation (LDA) is a popular algorithm for discovering\nsemantic structure in large collections of text or other data. Although\nits complexity is linear in the data size, its use on increasingly\nmassive collections has created considerable interest in parallel\nimplementations. &x0201C;Approximate distributed&x0201D; LDA, or\nAD-LDA, approximates the popular collapsed Gibbs sampling algorithm\nfor LDA models while running on a distributed architecture. Although\nthis algorithm often appears to perform well in practice, its quality\nis not well understood theoretically or easily assessed on new data.\nIn this work, we theoretically justify the approximation, and modify\nAD-LDA to track an error bound on performance. Specifically, we upper\nbound the probability of making a sampling error at each step of\nthe algorithm (compared to an exact, sequential Gibbs sampler), given\nthe samples drawn thus far. We show empirically that our bound is\nsufficiently tight to give a meaningful and intuitive measure of\napproximation error in AD-LDA, allowing the user to track the tradeoff\nbetween accuracy and efficiency while executing in parallel.

  • Fast collapsed gibbs sampling for latent dirichlet allocation
    2008
    Co-Authors: Alexander Ihler, Ian Porteous, Arthur Asuncion, Padhraic Smyth, Max Welling, David Newman
    Abstract:

    In this paper we introduce a novel collapsed Gibbs sampling method\nfor the widely used latent dirichlet allocation (LDA) model. Our\nnew method results in significant speedups on real world text corpora.\nConventional Gibbs sampling schemes for LDA require O(K) operations\nper sample where K is the number of topics in the model. Our proposed\nmethod draws equivalent samples but requires on average significantly\nless then K operations per sample. On real-word corpora FastLDA can\nbe as much as 8 times faster than the standard collapsed Gibbs sampler\nfor LDA. No approximations are necessary, and we show that our fast\nsampling scheme produces exactly the same results as the standard\n(but slower) sampling scheme. Experiments on four real world data\nsets demonstrate speedups for a wide range of collection sizes. For\nthe PubMed collection of over 8 million documents with a required\ncomputation time of 6 CPU months for LDA, our speedup of 5.7 can\nsave 5 CPU months of computation.

  • Distributed Inference for latent dirichlet allocation
    2007
    Co-Authors: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
    Abstract:

    We investigate the problem of learning a widely-used latent-variable model – the latent dirichlet allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees\tof the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across\tprocessors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

  • a collapsed variational bayesian inference algorithm for latent dirichlet allocation
    2006
    Co-Authors: Yee Whye Teh, David Newman, Max Welling
    Abstract:

    latent dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

Yee Whye Teh - One of the best experts on this subject based on the ideXlab platform.

  • a collapsed variational bayesian inference algorithm for latent dirichlet allocation
    2006
    Co-Authors: Yee Whye Teh, David Newman, Max Welling
    Abstract:

    latent dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.