Project Gutenberg

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 822 Experts worldwide ranked by ideXlab platform

Sarana Nutanong - One of the best experts on this subject based on the ideXlab platform.

  • An Effective and Scalable Framework for Authorship Attribution Query Processing
    IEEE Access, 2018
    Co-Authors: Raheem Sarwar, Chenyun Yu, Dickson Chow, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Thanawin Rakthanmanon, Sarana Nutanong
    Abstract:

    Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors, while each author may have a small number of text samples, e.g., 5-10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.

  • ICDM - A Scalable Framework for Stylometric Analysis Query Processing
    2016 IEEE 16th International Conference on Data Mining (ICDM), 2016
    Co-Authors: Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, Dickson Chow
    Abstract:

    Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

  • A Scalable Framework for Stylometric Analysis Query Processing
    2016 IEEE 16th International Conference on Data Mining (ICDM), 2016
    Co-Authors: Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, Dickson Chow
    Abstract:

    Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

Graeme Hirst - One of the best experts on this subject based on the ideXlab platform.

  • Using models of lexical style to quantify free indirect discourse in modernist fiction
    Digital Scholarship in the Humanities, 2016
    Co-Authors: Julian Brooke, Adam Hammond, Graeme Hirst
    Abstract:

    Modernist authors such as Virginia Woolf and James Joyce greatly expanded the use of ‘free indirect discourse’, a form of third-person narration that is strongly influenced by the language of a viewpoint character. Unlike traditional approaches to analyzing characterization using common words, such as those based on Burrows (1987), the nature of free indirect discourse and the sparseness of our data require that we understand the stylistic connotations of rarer words and expressions which cannot be gleaned directly from our target texts. To this end, we apply methods introduced in our recent work to derive information with regards to six stylistic aspects from a large corpus of texts from Project Gutenberg. We thus build high-coverage, finely grained lexicons that include common multiword collocations. Using this information along with student annotations of two modernist texts, Woolf’s To The Lighthouse and Joyce’s The Dead , we confirm that free indirect discourse does, at a stylistic level, reflect a mixture of narration and direct speech, and we investigate the extent to which social attributes of the various characters (in particular age, class, and gender) are reflected in their lexical stylistic profile.

  • gutentag an nlp driven tool for digital humanities research in the Project Gutenberg corpus
    North American Chapter of the Association for Computational Linguistics, 2015
    Co-Authors: Julian Brooke, Adam Hammond, Graeme Hirst
    Abstract:

    This paper introduces a software tool, GutenTag, which is aimed at giving literary researchers direct access to NLP techniques for the analysis of texts in the Project Gutenberg corpus. We discuss several facets of the tool, including the handling of formatting and structure, the use and expansion of metadata which is used to identify relevant subcorpora of interest, and a general tagging framework which is intended to cover a wide variety of future NLP modules. Our hope that the shared ground created by this tool will help create new kinds of interaction between the computational linguistics and digital humanities communities, to the benefit of both.

  • CLfL@NAACL-HLT - GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus
    Proceedings of the Fourth Workshop on Computational Linguistics for Literature, 2015
    Co-Authors: Julian Brooke, Adam Hammond, Graeme Hirst
    Abstract:

    This paper introduces a software tool, GutenTag, which is aimed at giving literary researchers direct access to NLP techniques for the analysis of texts in the Project Gutenberg corpus. We discuss several facets of the tool, including the handling of formatting and structure, the use and expansion of metadata which is used to identify relevant subcorpora of interest, and a general tagging framework which is intended to cover a wide variety of future NLP modules. Our hope that the shared ground created by this tool will help create new kinds of interaction between the computational linguistics and digital humanities communities, to the benefit of both.

Dickson Chow - One of the best experts on this subject based on the ideXlab platform.

  • An Effective and Scalable Framework for Authorship Attribution Query Processing
    IEEE Access, 2018
    Co-Authors: Raheem Sarwar, Chenyun Yu, Dickson Chow, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Thanawin Rakthanmanon, Sarana Nutanong
    Abstract:

    Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors, while each author may have a small number of text samples, e.g., 5-10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.

  • ICDM - A Scalable Framework for Stylometric Analysis Query Processing
    2016 IEEE 16th International Conference on Data Mining (ICDM), 2016
    Co-Authors: Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, Dickson Chow
    Abstract:

    Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

  • A Scalable Framework for Stylometric Analysis Query Processing
    2016 IEEE 16th International Conference on Data Mining (ICDM), 2016
    Co-Authors: Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, Dickson Chow
    Abstract:

    Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

Raheem Sarwar - One of the best experts on this subject based on the ideXlab platform.

  • An Effective and Scalable Framework for Authorship Attribution Query Processing
    IEEE Access, 2018
    Co-Authors: Raheem Sarwar, Chenyun Yu, Dickson Chow, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Thanawin Rakthanmanon, Sarana Nutanong
    Abstract:

    Authorship attribution aims at identifying the original author of an anonymous text from a given set of candidate authors and has a wide range of applications. The main challenge in authorship attribution problem is that the real-world applications tend to have hundreds of authors, while each author may have a small number of text samples, e.g., 5-10 texts/author. As a result, building a predictive model that can accurately identify the author of an anonymous text is a challenging task. In fact, existing authorship attribution solutions based on long text focus on application scenarios, where the number of candidate authors is limited to 50. These solutions generally report a significant performance reduction as the number of authors increases. To overcome this challenge, we propose a novel data representation model that captures stylistic variations within each document, which transforms the problem of authorship attribution into a similarity search problem. Based on this data representation model, we also propose a similarity query processing technique that can effectively handle outliers. We assess the accuracy of our proposed method against the state-of-the-art authorship attribution methods using real-world data sets extracted from Project Gutenberg. Our data set contains 3000 novels from 500 authors. Experimental results from this paper show that our method significantly outperforms all competitors. Specifically, as for the closed-set and open-set authorship attribution problems, our method have achieved higher than 95% accuracy.

  • ICDM - A Scalable Framework for Stylometric Analysis Query Processing
    2016 IEEE 16th International Conference on Data Mining (ICDM), 2016
    Co-Authors: Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, Dickson Chow
    Abstract:

    Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

  • A Scalable Framework for Stylometric Analysis Query Processing
    2016 IEEE 16th International Conference on Data Mining (ICDM), 2016
    Co-Authors: Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, Dickson Chow
    Abstract:

    Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.

Jeannet Molopyane - One of the best experts on this subject based on the ideXlab platform.