Text Representation

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 3519 Experts worldwide ranked by ideXlab platform

Qing Li - One of the best experts on this subject based on the ideXlab platform.

  • ICONIP (5) - Fuzzy Bag-of-Topics Model for Short Text Representation
    Neural Information Processing, 2020
    Co-Authors: Qing Li
    Abstract:

    Text Representation is the keystone in many NLP tasks. For short Text Representation learning, the traditional Bag-of-Words model (BoW) is often criticized for sparseness and neglecting semantic information. Fuzzy Bag-of-Words (FBoW) and Fuzzy Bag-of-Words Cluster (FBoWC) model are the improved model of BoW, which can learn dense and meaningful document vectors. However, word clusters in FBoWC model are obtained by K-means cluster algorithm, which is unstable and may result in incoherent word clusters if not initialized properly. In this paper, we propose the Fuzzy Bag-of-Topics model (FBoT) to learn short Text vector. In FBoT model, word communities, which are more coherent than word clusters in FBoWC, are used as basis terms in Text vector. Experimental results of short Text classification on two datasets show that FBoT achieves the highest classification accuracies.

  • a multi level Text Representation model within background knowledge based on human cognitive process for big data analysis
    Cluster Computing, 2016
    Co-Authors: Jun Zhang, Daniel Dajun Zeng, Qing Li
    Abstract:

    Text Representation is part of the most fundamental work in Text comprehension, processing, and search. Various kinds of work has been proposed to mine the semantics in Texts and then to represent them. However, most of them only focus on how to mine semantics from the Text itself, while few of them take the background knowledge into consideration, which is very important to Text understanding. In this paper, on the basis of human cognitive process, we propose a multi-level Text Representation model within background knowledge, called TRMBK. It is composed of three levels, which are machine surface code, machine Text base and machine situational model. All of them are able to be constructed automatically to acquire semantics both inside and outside of the Texts. Simultaneously, we also propose a method to establish background knowledge automatically and offer supports for the current Text comprehension. Finally, experiments and comparisons have been presented to show the better performance of TRMBK.

  • A multi-level Text Representation model within background knowledge based on human cognitive process
    2013 IEEE 12th International Conference on Cognitive Informatics and Cognitive Computing, 2013
    Co-Authors: Jun Zhang, Qing Li
    Abstract:

    Text Representation is one of the most fundamental works in Text comprehension, processing, and search. Various works have been proposed to mine the semantics in Texts and then to represent them. However, most of them only focus on how to mine semantics from the Text itself while the background knowledge, which is very important to Text understanding, is not taken into consideration. In this paper, on the basis of human cognitive process, we propose a multi-level Text Representation model within background knowledge, called TRMBK. It is composed of three levels, which are machine surface code (MSC), machine Text base (MTB) and machine situational model (MSM). All of the three are able to be automatically constructed to acquire semantics both inside and outside of the Text. Simultaneously, we also propose a method to automatically establish background knowledge and offer supports for the current Text comprehension. Finally, experiments and comparisons have been presented to show the better performance of TRMBK.

  • ICCI*CC - A multi-level Text Representation model within background knowledge based on human cognitive process
    2013 IEEE 12th International Conference on Cognitive Informatics and Cognitive Computing, 2013
    Co-Authors: Jun Zhang, Qing Li
    Abstract:

    Text Representation is one of the most fundamental works in Text comprehension, processing, and search. Various works have been proposed to mine the semantics in Texts and then to represent them. However, most of them only focus on how to mine semantics from the Text itself while the background knowledge, which is very important to Text understanding, is not taken into consideration. In this paper, on the basis of human cognitive process, we propose a multi-level Text Representation model within background knowledge, called TRMBK. It is composed of three levels, which are machine surface code (MSC), machine Text base (MTB) and machine situational model (MSM). All of the three are able to be automatically constructed to acquire semantics both inside and outside of the Text. Simultaneously, we also propose a method to automatically establish background knowledge and offer supports for the current Text comprehension. Finally, experiments and comparisons have been presented to show the better performance of TRMBK.

Omar Mohammed Barukub - One of the best experts on this subject based on the ideXlab platform.

  • Graph-based Text Representation and Matching: A Review of the State of the Art and Future Challenges
    IEEE Access, 1
    Co-Authors: Ahmed Hamza Osman, Omar Mohammed Barukub
    Abstract:

    Graph-based Text Representation is one of the important preprocessing steps in data and Text mining, Natural Language Processing (NLP), and information retrieval approaches. The graph-based methods focus on how to represent Text documents in the shape of a graph to exploit the best features of their characteristics. This study reviews and lists the advantages and disadvantages of such methods employed or developed in graph-based Text Representations. The literature shows that some of the proposed graph-based methods suffer from a lack of representing Texts in certain situations. Currently, several techniques are commonly used in graph-based Text Representation. However, there are still some weaknesses and shortages in these techniques and tools that significantly affect the success of graph Representation and graph matching. In this review, we conduct an inclusive survey of the state of the art in graph-based Text Representation and learning. We provide a formal description of the problem of graph-based Text Representation and introduce some basic concepts. More significantly, this study proposes a new taxonomy of graph-based Text Representation, categorizing the existing studies based on Representation characteristics and scheme techniques. In terms of the Representation scheme taxonomy, we introduce four main types of conceptual graph schemes and summarize the challenges faced in each scheme. The main issues of graph Representation, such as research topics and the sub-taxonomy of graph models for web documents, are introduced and categorized. This research also covers some tasks of understanding natural language processing (NLP) that depend on different types of graph structures. In addition, the graph matching taxonomy implements three main categories based on the matching approach, including structural-, semantic-, and similarity-based approaches. Moreover, a deep comparison of these approaches is discussed and reported in terms of methods and tools, the concepts of matching and locality, and the application domains that use these tools. Finally, the paper recommends seven promising future study directions in the graph-based Text Representation field. These recommendation points are summarized and highlighted as open problems and challenges of graph-based Text Representation and learning to facilitate and fill the research gaps for scientific researchers in this field.

Rohini K Srihari - One of the best experts on this subject based on the ideXlab platform.

  • graph based Text Representation and knowledge discovery
    ACM Symposium on Applied Computing, 2007
    Co-Authors: Rohini K Srihari
    Abstract:

    For information retrieval and Text-mining, a robust scalable framework is required to represent the information extracted from documents and enable visualization and query of such information. One very widely used model is the vector space model which is based on the bag-of-words approach. However, it suffers from the fact that it loses important information about the original Text, such as information about the order of the terms in the Text or about the frontiers between sentences or paragraphs. In this paper, we propose a graph-based Text Representation, which is capable of capturing (i) Term order (ii) Term frequency (iii) Term co-occurrence (iv) Term conText in documents. We also apply the graph model into our Text mining task, which is to discover unapparent associations between two and more concepts (e.g. individuals) from a large Text corpus. Counterterrorism corpus is used to evaluate the performance of various retrieval models, which demonstrates feasibility and effectiveness of graphic Text Representation in information retrieval and Text mining.

  • SAC - Graph-based Text Representation and knowledge discovery
    Proceedings of the 2007 ACM symposium on Applied computing - SAC '07, 2007
    Co-Authors: Rohini K Srihari
    Abstract:

    For information retrieval and Text-mining, a robust scalable framework is required to represent the information extracted from documents and enable visualization and query of such information. One very widely used model is the vector space model which is based on the bag-of-words approach. However, it suffers from the fact that it loses important information about the original Text, such as information about the order of the terms in the Text or about the frontiers between sentences or paragraphs. In this paper, we propose a graph-based Text Representation, which is capable of capturing (i) Term order (ii) Term frequency (iii) Term co-occurrence (iv) Term conText in documents. We also apply the graph model into our Text mining task, which is to discover unapparent associations between two and more concepts (e.g. individuals) from a large Text corpus. Counterterrorism corpus is used to evaluate the performance of various retrieval models, which demonstrates feasibility and effectiveness of graphic Text Representation in information retrieval and Text mining.

Duoqian Miao - One of the best experts on this subject based on the ideXlab platform.

  • Smoothing Text Representation Models Based on Rough Set
    Quantitative Semantics and Soft Computing Methods for the Web, 2020
    Co-Authors: Duoqian Miao, Ruizhi Wang, Zhifei Zhang
    Abstract:

    Text Representation is the prerequisite of various document processing tasks, such as information retrieval, Text classification, Text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of Text Representation model, especially for the performance of unbalanced corpus.

  • Comparing different Text Representation and feature selection methods on Chinese Text classification using Character n-grams
    2020
    Co-Authors: Jean-hugues Chauchat, Duoqian Miao
    Abstract:

    In this paper, we perform Chinese Text categorization using n-gram Text Representation on TanCorpV1.0 which is a new corpus, special for Chinese Text classification of more than 14,000 Texts divided in 12 classes. We use a combination of methods, including between inter-class feature reduction methods and cross-class feature selection methods. We use the C-SVC classifier (with a linear kernel) which is the SVM algorithm made for the multi-classification task. We perform our experiments in the TANAGRA platform. Our experiments concern: (1) the performance comparison between using both 1-, 2-grams and using 1-, 2-, 3gram in Chinese Text Representation; (2) the performance comparison between using different feature Representations: absolute Text frequency, relative Text frequency, absolute n-gram frequency and relative n-gram frequency; (3) the comparison of the sparseness in the “Text*feature” matrix between using n-gram frequency and frequency in feature selection; (4) the performance comparison between two Text coding methods: the 0/1 logical value and the n-gram frequency numeric value. We found out that in the case of using less than 3,000 features, the feature selection methods based on n-gram frequency (absolute or relative) always yield better results.

  • N-grams based feature selection and Text Representation for Chinese Text Classification
    International Journal of Computational Intelligence Systems, 2009
    Co-Authors: Duoqian Miao, Jean-hugues Chauchat, Rui Zhao, Wen Li
    Abstract:

    In this paper, Text Representation and feature selection strategies for Chinese Text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three Text Representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 Texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute Text frequency, relative Text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “Text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gr...

Suhang Wang - One of the best experts on this subject based on the ideXlab platform.

  • privacy preserving Text Representation learning
    ACM Conference on Hypertext, 2019
    Co-Authors: Ghazaleh Beigi, Suhang Wang
    Abstract:

    Online users generate tremendous amounts of Textual information by participating in different online activities. This data provides opportunities for researchers and business partners to understand individuals. However, this user-generated Textual data not only can reveal the identity of the user but also may contain individual's private attribute information. Publishing the Textual data thus compromises the privacy of users. It is challenging to design effective anonymization techniques for Textual information which minimize the chances of re-identification and does not contain private information while retaining the Textual semantic meaning. In this paper, we study this problem and propose a novel double privacy preserving Text Representation learning framework, DPText. We show the effectiveness of DPText in preserving privacy and utility.

  • i am not what i write privacy preserving Text Representation learning
    arXiv: Cryptography and Security, 2019
    Co-Authors: Ghazaleh Beigi, Suhang Wang
    Abstract:

    Online users generate tremendous amounts of Textual information by participating in different activities, such as writing reviews and sharing tweets. This Textual data provides opportunities for researchers and business partners to study and understand individuals. However, this user-generated Textual data not only can reveal the identity of the user but also may contain individual's private information (e.g., age, location, gender). Hence, "you are what you write" as the saying goes. Publishing the Textual data thus compromises the privacy of individuals who provided it. The need arises for data publishers to protect people's privacy by anonymizing the data before publishing it. It is challenging to design effective anonymization techniques for Textual information which minimizes the chances of re-identification and does not contain users' sensitive information (high privacy) while retaining the semantic meaning of the data for given tasks (high utility). In this paper, we study this problem and propose a novel double privacy preserving Text Representation learning framework, DPText, which learns a Textual Representation that (1) is differentially private, (2) does not contain private information and (3) retains high utility for the given task. Evaluating on two natural language processing tasks, i.e., sentiment analysis and part of speech tagging, we show the effectiveness of this approach in terms of preserving both privacy and utility.