Natural Language Toolkit

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 297 Experts worldwide ranked by ideXlab platform

Steven Bird - One of the best experts on this subject based on the ideXlab platform.

  • NLTK : The Natural Language Toolkit NLTK : The Natural Language Toolkit
    Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume , 2016
    Co-Authors: Steven Bird, Edward Loper
    Abstract:

    NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical Natural Language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated models from the outset.

  • Curating lexical databases for minority Languages
    2009
    Co-Authors: Greg Aumann, Steven Bird
    Abstract:

    One of the biggest challenges in compiling a dictionary of a minority Language is managing the large quantity of lexical data. Decisions about the format and content of the dictionary or the orthography typically evolve over the years that such projects usually take. This results in inconsistencies between older and newer entries. Revising the data for publication as a dictionary introduces further inconsistencies as does having multiple contributors and/or editors. Proofreading a lexical database takes a great deal of time and the richer its structure the more this is the case. The tools described in this presentation significantly reduce this effort. Tools developed for checking the consistency of the lexical database in the Iu Mien—Chinese—English dictionary project have proven extremely helpful. Two basic approaches are used: 1) use of a program written to check for likely errors that scans the lexical database and produces an error report that is used by a lexicographer to make appropriate corrections. 2) outputting the lexical data in alternate forms that make it easier for the lexicographer to spot problem areas. These alternative forms include the reverse indexes and views structured according to semantic domains. The Iu Mien—Chinese—English dictionary project, like many minority Language dictionary projects, uses SIL's Toolbox software. It is very flexible software but its capabilities to enforce consistency are quite limited. Some parts of the approach described here are specific to MDF (Multi-Dictionary Formatter) lexical databases in Toolbox but will be equally useful for other MDF databases. Other parts are specific to each of the three Languages involved but will be useful for non-Toolbox lexical databases. Every dictionary is unique and this applies not only to content of the entries but also the decisions about how entries should be arranged to suit the Languages involved. Other decisions about the structure are likely to be made differently even in other dictionaries of the same Languages. It is the way that each dictionary combines themes that are found in many dictionaries that makes them unique, e.g. to be root based or not, to have include subentries. Therefore our approach is to use a Toolkit based approach to curating lexical databases. This allows checking techniques to be mixed and matched to suit the unique aspects of a lexical project. The checking software is written in Python and relies on the toolbox module in NLTK (The Natural Language Toolkit http://nltk.sourceforge.net).

  • Natural Language Processing with Python
    O'Reilly Media, 2009
    Co-Authors: Steven Bird, Ewan Klein, Edward Loper
    Abstract:

    This book offers a highly accessible introduction to Natural Language processing, the field that supports a variety of Language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you'll learn how to write Python programs that work with large collections of unstructured text. You'll access richly annotated datasets using a comprehensive range of linguistic data structures, and you'll understand the main algorithms for analyzing the content and structure of written communication. Packed with examples and exercises, Natural Language Processing with Python will help you: Extract information from unstructured text, either to guess the topic or identify "named entities" Analyze linguistic structure in text, including parsing and semantic analysis Access popular linguistic databases, including WordNet and treebanks Integrate techniques drawn from fields as diverse as linguistics and artificial intelligence This book will help you gain practical skills in Natural Language processing using the Python programming Language and the Natural Language Toolkit (NLTK) open source library. If you're interested in developing web applications, analyzing multilingual news sources, or documenting endangered Languages - or if you're simply curious to have a programmer's perspective on how human Language works - you'll find Natural Language Processing with Python both fascinating and immensely useful.

  • multidisciplinary instruction with the Natural Language Toolkit
    Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, 2008
    Co-Authors: Steven Bird, Edward Loper, Ewan Klein, Jason Baldridge
    Abstract:

    The Natural Language Toolkit (NLTK) is widely used for teaching Natural Language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issues: getting started with a course, delivering interactive demonstrations in the classroom, and organizing assignments and projects. In each case, we report on practical experience and make recommendations on how to use NLTK to maximum effect.

  • managing fieldwork data with toolbox and the Natural Language Toolkit
    Language Documentation & Conservation, 2007
    Co-Authors: Stuart Robinson, Greg Aumann, Steven Bird
    Abstract:

    This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming Language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its strengths and weaknesses are reviewed. Its underlying data format is briefly discussed, and Toolbox processing capabilities of NLTK are introduced, showing ways in which it can be used to extend the functionality of Toolbox. This is illustrated with a few simple scripts that demonstrate basic data management tasks relevant to Language documentation, such as printing out the contents of a lexicon as HTML. 1. BACKGROUND. One of the oldest and best known software tools for field linguistics is Shoebox, a program produced by SIL International (formerly the Summer Institute of Linguistics) that provides linguists with the ability to maintain a lexicon and use it to interlinear gloss texts within an integrated environment. The description of the program provided on the Shoebox homepage (http://www.sil.org/computing/shoebox/) explains its purpose and the motivation for its name:

Edward Loper - One of the best experts on this subject based on the ideXlab platform.

  • NLTK : The Natural Language Toolkit NLTK : The Natural Language Toolkit
    Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume , 2016
    Co-Authors: Steven Bird, Edward Loper
    Abstract:

    NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical Natural Language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated models from the outset.

  • Natural Language Processing with Python
    O'Reilly Media, 2009
    Co-Authors: Steven Bird, Ewan Klein, Edward Loper
    Abstract:

    This book offers a highly accessible introduction to Natural Language processing, the field that supports a variety of Language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you'll learn how to write Python programs that work with large collections of unstructured text. You'll access richly annotated datasets using a comprehensive range of linguistic data structures, and you'll understand the main algorithms for analyzing the content and structure of written communication. Packed with examples and exercises, Natural Language Processing with Python will help you: Extract information from unstructured text, either to guess the topic or identify "named entities" Analyze linguistic structure in text, including parsing and semantic analysis Access popular linguistic databases, including WordNet and treebanks Integrate techniques drawn from fields as diverse as linguistics and artificial intelligence This book will help you gain practical skills in Natural Language processing using the Python programming Language and the Natural Language Toolkit (NLTK) open source library. If you're interested in developing web applications, analyzing multilingual news sources, or documenting endangered Languages - or if you're simply curious to have a programmer's perspective on how human Language works - you'll find Natural Language Processing with Python both fascinating and immensely useful.

  • multidisciplinary instruction with the Natural Language Toolkit
    Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, 2008
    Co-Authors: Steven Bird, Edward Loper, Ewan Klein, Jason Baldridge
    Abstract:

    The Natural Language Toolkit (NLTK) is widely used for teaching Natural Language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issues: getting started with a course, delivering interactive demonstrations in the classroom, and organizing assignments and projects. In each case, we report on practical experience and make recommendations on how to use NLTK to maximum effect.

  • NLTK: The Natural Language Toolkit
    Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004
    Co-Authors: Steven Bird, Edward Loper
    Abstract:

    The Natural LanguageToolkit is a suite of program mod- ules, data sets, tutorials and exercises, covering symbolic and statistical Natural Language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past three years, NLTK has become popular in teaching and research. We describe the Toolkit and report on its current state of developmen

  • nltk the Natural Language Toolkit
    arXiv: Computation and Language, 2002
    Co-Authors: Edward Loper, Steven Bird
    Abstract:

    NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical Natural Language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated models from the outset.

Eric Atwell - One of the best experts on this subject based on the ideXlab platform.

  • Using an Islamic Question and Answer Knowledge Base to answer questions about the holy Quran
    2017
    Co-Authors: Bothaina Hamoud, Eric Atwell
    Abstract:

    This paper presents the QAEQAS Quranic Arabic/English Question Answering System, which relies on a specialized search dataset corpus, and data redundancy. Our corpus is composed of questions along with their answers. The questions are phrased in many different ways in differing contexts to optimize Question Answering (QA) performance. As a complete question answering solution, the Python NLTK Natural Language Toolkit has been used to process the user question as well as to implement the search engine to retrieve candidate results and then extract the best answer. The system takes and accepts a Natural Language (NL) question in English or Arabic from the user - through a GUI - as an input, then matches this question with the knowledge base questions, and then returns the corresponding answer. A keyword based search was used. First the user question was tokenized to get the keywords, and then the stop words were removed. The remaining keywords were used for searching the corpus looking for matched questions. After that, the system used scoring and ranking to find the best matched question and then return the corresponding answer for this question. QAEQAS deals with a wide range of question types including facts, definitions. It produces both short and long answers with a precision of 79% and a recall of 76 for Arabic version; and a precision of 75% and a recall of 73% for English version.

  • LREC - ProPOSEC: A Prosody and PoS Annotated Spoken English Corpus
    2010
    Co-Authors: C Brierley, Eric Atwell
    Abstract:

    We have previously reported on ProPOSEL, a purpose-built Prosody and PoS English Lexicon compatible with the Python Natural Language Toolkit. ProPOSEC is a new corpus research resource built using this lexicon, intended for distribution with the Aix-MARSEC dataset. ProPOSEC comprises multi-level parallel annotations, juxtaposing prosodic and syntactic information from different versions of the Spoken English Corpus, with canonical dictionary forms, in a query format optimized for Perl, Python, and text processing programs. The order and content of fields in the text file is as follows: (1) Aix-MARSEC file number; (2) word; (3) LOB PoS-tag; (4) C5 PoS-tag; (5) Aix SAM-PA phonetic transcription; (6) SAM-PA phonetic transcription from ProPOSEL; (7) syllable count; (8) lexical stress pattern; (9) default content or function word tag; (10) DISC stressed and syllabified phonetic transcription; (11) alternative DISC representation, incorporating lexical stress pattern; (12) nested arrays of phonemes and tonic stress marks from Aix. As an experimental dataset, ProPOSEC can be used to study correlations between these annotation tiers, where significant findings are then expressed as additional features for phrasing models integral to Text-to-Speech and Speech Recognition. As a training set, ProPOSEC can be used for machine learning tasks in Information Retrieval and Speech Understanding systems.

  • exploring imagery in literary corpora with the Natural Language Toolkit
    Proceedings of the Corpus Linguistics Conference 2009 (CL2009) 2009 pág. 135, 2009
    Co-Authors: Eric Atwell
    Abstract:

    This paper presents a middle way for corpus linguists between use of “off-the-shelf” corpus analysis software and building tools from scratch, which presupposes competence in a general-purpose programming Language. The Python Natural Language Toolkit (NLTK) offers a range of sophisticated Natural Language processing tools which we have applied to literary analysis, through case studies in Macbeth and Hamlet, with code snippets and experiments that can be replicated for research and research-led teaching with other literary texts.

  • ProPOSEL: a human-oriented prosody and PoS English lexicon for machine-learning and NLP
    Proceedings of the workshop on Cognitive Aspects of the Lexicon - COGALEX '08, 2008
    Co-Authors: C Brierley, Eric Atwell
    Abstract:

    ProPOSEL is a prosody and PoS English lexicon, purpose-built to integrate and leverage domain knowledge from several well-established lexical resources for machine learning and NLP applications. The lexicon of 104049 separate entries is in accessible text file format, is human and machine-readable, and is intended for open source distribution with the Natural Language Toolkit. It is therefore supported by Python software tools which transform ProPOSEL into a Python dictionary or associative array of linguistic concepts mapped to compound lookup keys. Users can also conduct searches on a subset of the lexicon and access entries by word class, phonetic transcription, syllable count and lexical stress pattern. ProPOSEL caters for a range of different cognitive aspects of the lexicon©.

  • LREC - ProPOSEL: A prosody and POS English lexicon for Language engineering
    2008
    Co-Authors: C Brierley, Eric Atwell
    Abstract:

    ProPOSEL is a prototype prosody and PoS (part-of-speech) English lexicon for Language Engineering, derived from the following Language resources: the computer-usable dictionary CUVPlus, the CELEX-2 database, the Carnegie-Mellon Pronouncing Dictionary, and the BNC, LOB and Penn Treebank PoS-tagged corpora. The lexicon is designed for the target application of prosodic phrase break prediction but is also relevant to other machine learning and Language engineering tasks. It supplements the existing record structure for wordform entries in CUVPlus with syntactic annotations from rival PoS-tagging schemes, mapped to fields for default closed and open-class word categories and for lexical stress patterns representing the rhythmic structure of wordforms and interpreted as potential new text-based features for automatic phrase break classifiers. The current version of the lexicon comes as a textfile of 104052 separate entries and is intended for distribution with the Natural Language Toolkit; it is therefore accompanied by supporting Python software for manipulating the data so that it can be used for Natural Language Processing (NLP) and corpus-based research in speech synthesis and speech recognition.

David Claveau - One of the best experts on this subject based on the ideXlab platform.

  • ICTAI - A System to Convey the Emotional Content of Text Using a Humanoid Robot
    2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), 2016
    Co-Authors: Peter Sylvester, David Claveau
    Abstract:

    Soon we will have anthropomorphic robots that will assist us at home and at work and will be able to recite our text-based communications to us in an animated and engaging manner. This paper describes some steps taken to explore this idea with a simple humanoid robot, the NAO Aldebaran, which is capable of speech of varying pitch, speed and loudness, and has evocative color LED lights around its eyes. To convey the emotional content of text, the proposed system maps ASCII text to these capabilities. The extraction of emotion makes use of the Natural Language Toolkit (NLTK), an open-source tool for computational linguistics using Python. A two-dimensional emotion space is used to map the extracted emotion to robot action. The effectiveness of the system is explored with a simple example of emotional text and a video of the results is provided along with a time-plot of the actions.

  • A System to Convey the Emotional Content of Text Using a Humanoid Robot
    2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), 2016
    Co-Authors: Peter Sylvester, David Claveau
    Abstract:

    Soon we will have anthropomorphic robots that will assist us at home and at work and will be able to recite our text-based communications to us in an animated and engaging manner. This paper describes some steps taken to explore this idea with a simple humanoid robot, the NAO Aldebaran, which is capable of speech of varying pitch, speed and loudness, and has evocative color LED lights around its eyes. To convey the emotional content of text, the proposed system maps ASCII text to these capabilities. The extraction of emotion makes use of the Natural Language Toolkit (NLTK), an open-source tool for computational linguistics using Python. A two-dimensional emotion space is used to map the extracted emotion to robot action. The effectiveness of the system is explored with a simple example of emotional text and a video of the results is provided along with a time-plot of the actions.

Alina Putintseva - One of the best experts on this subject based on the ideXlab platform.

  • russian tagging and dependency parsing models for stanford corenlp Natural Language Toolkit
    International Conference on Knowledge Engineering and the Semantic Web, 2017
    Co-Authors: Liubov Kovriguina, Ivan Shilin, Alexander Shipilo, Alina Putintseva
    Abstract:

    The paper concerns implementing maximum entropy tagging model and neural net dependency parser model for Russian Language in Stanford CoreNLP Toolkit, an extensible pipeline that provides core Natural Language analysis. Russian belongs to morphologically rich Languages and demands full morphological analysis including annotating input texts with POS tags, features and lemmas (unlike the case of case-, person-, etc. insensitive Languages when stemming and POS-tagging give enough information about grammatical behavior of a word form). Rich morphology is accompanied by free word order in Russian which adds indeterminacy to head finding rules in parsing procedures. In the paper we describe training data, linguistic features used to learn the classifiers, training and evaluation of tagging and parsing models.

  • KESW - Russian Tagging and Dependency Parsing Models for Stanford CoreNLP Natural Language Toolkit
    Communications in Computer and Information Science, 2017
    Co-Authors: Liubov Kovriguina, Ivan Shilin, Alexander Shipilo, Alina Putintseva
    Abstract:

    The paper concerns implementing maximum entropy tagging model and neural net dependency parser model for Russian Language in Stanford CoreNLP Toolkit, an extensible pipeline that provides core Natural Language analysis. Russian belongs to morphologically rich Languages and demands full morphological analysis including annotating input texts with POS tags, features and lemmas (unlike the case of case-, person-, etc. insensitive Languages when stemming and POS-tagging give enough information about grammatical behavior of a word form). Rich morphology is accompanied by free word order in Russian which adds indeterminacy to head finding rules in parsing procedures. In the paper we describe training data, linguistic features used to learn the classifiers, training and evaluation of tagging and parsing models.