Text Categorization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 16137 Experts worldwide ranked by ideXlab platform

Xian Jia-yang - One of the best experts on this subject based on the ideXlab platform.

  • Text Categorization Algorithm Based on Centroid
    Computer Engineering, 2009
    Co-Authors: Xian Jia-yang
    Abstract:

    The performance of Text Categorization algorithm based on centroid is poor when the documents are dispersive or existing more than one peak value.Aiming at this problem,this paper proposes an improved Text Categorization algorithm whose performance is higher than classical Categorization algorithm based on centroid.Experimental results in the documents set provided by Wisers Information Limited show that this algorithm can obtain satisfactory efficiency and precision.

Qin Gang - One of the best experts on this subject based on the ideXlab platform.

  • Active Learning Based Text Categorization
    Computer Science, 2003
    Co-Authors: Qin Gang
    Abstract:

    In the field of Text Categorization,the number of unlabeled documents is generally much gretaer than that of labeled documents. Text Categorization is the problem of Categorization in high-dimension vector space, and more training samples will generally improve the accuracy of Text classifier. How to add the unlabeled documents of training set so as to expand training set is a valuable problem. The theory of active learning is introducted and applied to the field of Text Categorization in this paper,exploring the method of using unlabeled documents to improve the accuracy of Text classifier. It is expected that such technology will improve Text classifier's accuracy through adopting relatively large number of unlabelled documents samples. We brought forward an active learning based algorithm for Text Categorization,and the experiments on Reuters news corpus showed that when enough training samples available,it's effective for the algorithm to promote Text classifier's accuracy through adopting unlabelled document samples.

Michael R Lyu - One of the best experts on this subject based on the ideXlab platform.

  • CIKM - Semi-supervised Text Categorization by active search
    Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08, 2008
    Co-Authors: Rong Jin, Michael R Lyu, Kaizhu Huang, Irwin King
    Abstract:

    In automated Text Categorization, given a small number of labeled documents, it is very challenging, if not impossible, to build a reliable classifier that is able to achieve high classification accuracy. To address this problem, a novel web-assisted Text Categorization framework is proposed in this paper. Important keywords are first automatically identified from the available labeled documents to form the queries. Search engines are then utilized to retrieve from the Web a multitude of relevant documents, which are then exploited by a semi-supervised framework. To our best knowledge, this work is the first study of this kind. Extensive experimental study shows the encouraging results of the proposed Text Categorization framework: using Google as the web search engine, the proposed framework is able to reduce the classification error by 30% when compared with the state-of-the-art supervised Text Categorization method.

  • Large-scale Text Categorization by batch mode active learning
    Proceedings of the 15th international conference on World Wide Web - WWW '06, 2006
    Co-Authors: Steven Chu Hong Hoi, Rong Jin, Michael R Lyu
    Abstract:

    Large-scale Text Categorization is an important research topic for Web data mining. One of the challenges in large-scale Text Categorization is how to reduce the human efforts in labeling Text documents for building reliable classification models. In the past, there have been many studies on applying active learning methods to automatic Text Categorization, which try to select the most informative documents for labeling manually. Most of these studies focused on selecting a single unlabeled document in each iteration. As a result, the Text Categorization model has to be retrained after each labeled document is solicited. In this paper, we present a novel active learning algorithm that selects a batch of Text documents for labeling manually in each iteration. The key of the batch mode active learning is how to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we use the Fisher information matrix as the measurement of model uncertainty and choose the set of documents to effectively maximize the Fisher information of a classification model. Extensive experiments with three different datasets have shown that our algorithm is more effective than the state-of-the-art active learning techniques for Text Categorization and can be a promising tool toward large-scale Text Categorization for World Wide Web documents.

Rong Jin - One of the best experts on this subject based on the ideXlab platform.

  • CIKM - Semi-supervised Text Categorization by active search
    Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08, 2008
    Co-Authors: Rong Jin, Michael R Lyu, Kaizhu Huang, Irwin King
    Abstract:

    In automated Text Categorization, given a small number of labeled documents, it is very challenging, if not impossible, to build a reliable classifier that is able to achieve high classification accuracy. To address this problem, a novel web-assisted Text Categorization framework is proposed in this paper. Important keywords are first automatically identified from the available labeled documents to form the queries. Search engines are then utilized to retrieve from the Web a multitude of relevant documents, which are then exploited by a semi-supervised framework. To our best knowledge, this work is the first study of this kind. Extensive experimental study shows the encouraging results of the proposed Text Categorization framework: using Google as the web search engine, the proposed framework is able to reduce the classification error by 30% when compared with the state-of-the-art supervised Text Categorization method.

  • Large-scale Text Categorization by batch mode active learning
    Proceedings of the 15th international conference on World Wide Web - WWW '06, 2006
    Co-Authors: Steven Chu Hong Hoi, Rong Jin, Michael R Lyu
    Abstract:

    Large-scale Text Categorization is an important research topic for Web data mining. One of the challenges in large-scale Text Categorization is how to reduce the human efforts in labeling Text documents for building reliable classification models. In the past, there have been many studies on applying active learning methods to automatic Text Categorization, which try to select the most informative documents for labeling manually. Most of these studies focused on selecting a single unlabeled document in each iteration. As a result, the Text Categorization model has to be retrained after each labeled document is solicited. In this paper, we present a novel active learning algorithm that selects a batch of Text documents for labeling manually in each iteration. The key of the batch mode active learning is how to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we use the Fisher information matrix as the measurement of model uncertainty and choose the set of documents to effectively maximize the Fisher information of a classification model. Extensive experiments with three different datasets have shown that our algorithm is more effective than the state-of-the-art active learning techniques for Text Categorization and can be a promising tool toward large-scale Text Categorization for World Wide Web documents.

Enhong Chen - One of the best experts on this subject based on the ideXlab platform.

  • On the strength of hyperclique patterns for Text Categorization
    Information Sciences, 2007
    Co-Authors: Tieyun Qian, Hui Xiong, Yuanzhen Wang, Enhong Chen
    Abstract:

    The use of association patterns for Text Categorization has attracted great interest and a variety of useful methods have been developed. However, the key characteristics of pattern-based Text Categorization remain unclear. Indeed, there are still no concrete answers for the following two questions: Firstly, what kind of association pattern is the best candidate for pattern-based Text Categorization? Secondly, what is the most desirable way to use patterns for Text Categorization? In this paper, we focus on answering the above two questions. More specifically, we show that hyperclique patterns are more desirable than frequent patterns for Text Categorization. Along this line, we develop an algorithm for Text Categorization using hyperclique patterns. As demonstrated by our experimental results on various real-world Text documents, our method provides much better computational performance than state-of-the-art methods while retaining classification accuracy.

  • CIKM - Adapting association patterns for Text Categorization: weaknesses and enhancements
    Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006
    Co-Authors: Tieyun Qian, Hui Xiong, Yuanzhen Wang, Enhong Chen
    Abstract:

    The use of association patterns for Text Categorization has attracted great interest and a variety of useful methods have been developed. However, the key characteristics of pattern-based Text Categorization remain unclear. Indeed, there are still no concrete answers for the following two questions: First, what kind of association patterns are the best candidate for pattern-based Text Categorization? Second, what is the most desirable way to use patterns for Text Categorization? In this paper, we focus on answering the above two questions. Specifically, we show that hyperclique patterns are more desirable than frequent patterns for Text Categorization. Along this line, we develop an algorithm for Text Categorization using hyperclique patterns. The experimental results show that our method provides better performance than state-of-the-art methods in terms of both computational performance and classification accuracy.