Language Documentation

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 44673 Experts worldwide ranked by ideXlab platform

Hilaria Cruz - One of the best experts on this subject based on the ideXlab platform.

  • endangered Languages meet modern nlp
    International Conference on Computational Linguistics, 2020
    Co-Authors: Antonios Anastasopoulos, Graham Neubig, Christopher Cox, Hilaria Cruz
    Abstract:

    This tutorial will focus on NLP for endangered Languages Documentation and revitalization. First, we will acquaint the attendees with the process and the challenges of Language Documentation, showing how the needs of the Language communities and the documentary linguists map to specific NLP tasks. We will then present the state-of-the-art in NLP applied in this particularly challenging setting (extremely low-resource datasets, noisy transcriptions, limited annotations, non-standard orthographies). In doing so, we will also analyze the challenges of working in this domain and expand on both the capabilities and the limitations of current NLP approaches. Our ultimate goal is to motivate more NLP practitioners to work towards this very important direction, and also provide them with the tools and understanding of the limitations/challenges, both of which are needed in order to have an impact.

  • a summary of the first workshop on Language technology for Language Documentation and revitalization
    arXiv: Computation and Language, 2020
    Co-Authors: Graham Neubig, Alexis Palmer, Hilaria Cruz, Shruti Rijhwani, Jordan Mackenzie, Matthew C H Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati
    Abstract:

    Despite recent advances in natural Language processing and other Language technology, the application of such technology to Language Documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together Language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical Language revitalization technologies. This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine Languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw'ida, Kwak'wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

  • a summary of the first workshop on Language technology for Language Documentation and revitalization
    Workshop Spoken Language Technologies for Under-resourced Languages, 2020
    Co-Authors: Graham Neubig, Alexis Palmer, Hilaria Cruz, Shruti Rijhwani, Jordan Mackenzie, Matthew C H Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati
    Abstract:

    Despite recent advances in natural Language processing and other Language technology, the application of such technology to Language Documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together Language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical Language revitalization technologies. The workshop focused on developing technologies to aid Language Documentation and revitalization in four areas: 1) spoken Language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (Language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine Languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

  • public access to research data in Language Documentation challenges and possible strategies
    Language Documentation & Conservation, 2019
    Co-Authors: Mandana Seyfeddinipur, Hilaria Cruz, Sebastian Drude, Felix K Ameka, Lissant Bolton, Jonathan Blumtritt, Brian Carpenter, Patience Epps, Vera Ferreira, Ana Vilacy Galucio
    Abstract:

    The Open Access Movement promotes free and unfettered access to research publications and, increasingly, to the primary data which underly those publications. As the field of documentary linguistics seeks to record and preserve culturally and linguistically relevant materials, the question of how openly accessible these materials should be becomes increasingly important. This paper aims to guide researchers and other stakeholders in finding an appropriate balance between accessibility and confidentiality of data, addressing community questions and legal, institutional, and intellectual issues that pose challenges to accessible data.

  • Evaluating phonemic transcription of low-resource tonal Languages for Language Documentation
    2018
    Co-Authors: Oliver Adams, Steven Bird, Trevor Cohn, Graham Neubig, Hilaria Cruz, Alexis Michaud
    Abstract:

    Transcribing speech is an important part of Language Documentation, yet speech recognition technology has not been widely harnessed to aid linguists. We explore the use of a neural network architecture with the connectionist temporal classification loss function for phonemic and tonal transcription in a Language Documentation setting. In this framework, we explore jointly modelling phonemes and tones versus modelling them separately, and assess the importance of pitch information versus phonemic context for tonal prediction. Experiments on two tonal Languages, Yongning Na and Eastern Chatino, show the changes in recognition performance as training data is scaled from 10 minutes up to 50 minutes for Chatino, and up to 224 minutes for Na. We discuss the findings from incorporating this technology into the linguistic workflow for documenting Yongning Na, which show the method's promise in improving efficiency, minimizing typographical errors, and maintaining the transcription's faithfulness to the acoustic signal, while highlighting phonetic and phonemic facts for linguistic consideration.

Laurent Besacier - One of the best experts on this subject based on the ideXlab platform.

  • Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
    2020
    Co-Authors: Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
    Abstract:

    For endangered Languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken Language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation Language affects the posterior Documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of Language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models' input representation increases their translation and alignment quality, specially for challenging Language pairs.

  • how does Language influence Documentation workflow unsupervised word discovery using translations in multiple Languages
    arXiv: Computation and Language, 2019
    Co-Authors: Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
    Abstract:

    For Language Documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013). Recently, collecting aligned translations in well-resourced Languages became a popular solution for ensuring posterior interpretability of the recordings (Adda et al. 2016). In this paper we investigate Language-related impact in automatic approaches for computational Language Documentation. We translate the bilingual Mboshi-French parallel corpus (Godard et al. 2017) into four other Languages, and we perform bilingual-rooted unsupervised word discovery. Our results hint towards an impact of the well-resourced Language in the quality of the output. However, by combining the information learned by different bilingual models, we are only able to marginally increase the quality of the segmentation.

  • unsupervised word segmentation from speech with attention
    Conference of the International Speech Communication Association, 2018
    Co-Authors: Pierre Godard, François Yvon, Aline Villavicencio, Marcely Zanon Boito, Lucas Ondel, Alexandre Berard, Laurent Besacier
    Abstract:

    We present a first attempt to perform attentional word segmen-tation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten Language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced Language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi; comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for Language Documentation.

  • A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
    2018
    Co-Authors: P. Godard, Guy-Noël Kouarata, Lori F Lamel, Martine Adda-decker, Gilles Adda, Laurent Besacier, J Benjumea, J Cooper-leavitt, H Maynard, M. Müller
    Abstract:

    Most speech and Language technologies are trained with massive amounts of speech and text information. However, most of the world Languages do not have such resources and some even lack a stable orthography. Building systems under these almost zero resource conditions is not only promising for speech technology but also for computational Language Documentation. The goal of computational Language Documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered, unwritten Languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic Language Documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the Language phonology. We detail how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational Language Documentation experiments and their evaluation.

  • Innovative technologies for under-resourced Language Documentation: The BULB Project
    2017
    Co-Authors: Gilles Adda, Guy-Noël Kouarata, Mark Van De Velde, Laurent Besacier, Annie Rialland, Martine Adda-decker, François Yvon, Pierre Godard, Héì Ene Bonneau-maynard, Emmanuel-Moselly Makasso, Lori F Lamel, Elodie Gauthier, Dmitry Idiatov, Fatima Hamlaoui, David Blachon, Odette Ambouroue, Sebastian Stuker, Sabine Zerbian
    Abstract:

    The project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten Languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural Language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African Languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per Language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu Languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and Language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three Languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping.

Graham Neubig - One of the best experts on this subject based on the ideXlab platform.

  • endangered Languages meet modern nlp
    International Conference on Computational Linguistics, 2020
    Co-Authors: Antonios Anastasopoulos, Graham Neubig, Christopher Cox, Hilaria Cruz
    Abstract:

    This tutorial will focus on NLP for endangered Languages Documentation and revitalization. First, we will acquaint the attendees with the process and the challenges of Language Documentation, showing how the needs of the Language communities and the documentary linguists map to specific NLP tasks. We will then present the state-of-the-art in NLP applied in this particularly challenging setting (extremely low-resource datasets, noisy transcriptions, limited annotations, non-standard orthographies). In doing so, we will also analyze the challenges of working in this domain and expand on both the capabilities and the limitations of current NLP approaches. Our ultimate goal is to motivate more NLP practitioners to work towards this very important direction, and also provide them with the tools and understanding of the limitations/challenges, both of which are needed in order to have an impact.

  • a summary of the first workshop on Language technology for Language Documentation and revitalization
    arXiv: Computation and Language, 2020
    Co-Authors: Graham Neubig, Alexis Palmer, Hilaria Cruz, Shruti Rijhwani, Jordan Mackenzie, Matthew C H Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati
    Abstract:

    Despite recent advances in natural Language processing and other Language technology, the application of such technology to Language Documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together Language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical Language revitalization technologies. This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine Languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw'ida, Kwak'wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

  • a summary of the first workshop on Language technology for Language Documentation and revitalization
    Workshop Spoken Language Technologies for Under-resourced Languages, 2020
    Co-Authors: Graham Neubig, Alexis Palmer, Hilaria Cruz, Shruti Rijhwani, Jordan Mackenzie, Matthew C H Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati
    Abstract:

    Despite recent advances in natural Language processing and other Language technology, the application of such technology to Language Documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together Language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical Language revitalization technologies. The workshop focused on developing technologies to aid Language Documentation and revitalization in four areas: 1) spoken Language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (Language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine Languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

  • Evaluating phonemic transcription of low-resource tonal Languages for Language Documentation
    2018
    Co-Authors: Oliver Adams, Steven Bird, Trevor Cohn, Graham Neubig, Hilaria Cruz, Alexis Michaud
    Abstract:

    Transcribing speech is an important part of Language Documentation, yet speech recognition technology has not been widely harnessed to aid linguists. We explore the use of a neural network architecture with the connectionist temporal classification loss function for phonemic and tonal transcription in a Language Documentation setting. In this framework, we explore jointly modelling phonemes and tones versus modelling them separately, and assess the importance of pitch information versus phonemic context for tonal prediction. Experiments on two tonal Languages, Yongning Na and Eastern Chatino, show the changes in recognition performance as training data is scaled from 10 minutes up to 50 minutes for Chatino, and up to 224 minutes for Na. We discuss the findings from incorporating this technology into the linguistic workflow for documenting Yongning Na, which show the method's promise in improving efficiency, minimizing typographical errors, and maintaining the transcription's faithfulness to the acoustic signal, while highlighting phonetic and phonemic facts for linguistic consideration.

  • integrating automatic transcription into the Language Documentation workflow experiments with na data and the persephone toolkit
    Language Documentation & Conservation, 2018
    Co-Authors: Alexis Michaud, Oliver Adams, Trevor Cohn, Graham Neubig, Severine Guillaume
    Abstract:

    Automatic speech recognition tools have potential for facilitating Language Documentation, but in practice these tools remain little-used by linguists for a variety of reasons, such as that the technology is still new (and evolving rapidly), user-friendly interfaces are still under development, and case studies demonstrating the practical usefulness of automatic recognition in a low-resource setting remain few. This article reports on a success story in integrating automatic transcription into the Language Documentation workflow, specifically for Yongning Na, a Language of Southwest China. Using PERSEPHONE, an open-source toolkit, a single-speaker speech transcription tool was trained over five hours of manually transcribed speech. The experiments found that this method can achieve a remarkably low error rate (on the order of 17%), and that automatic transcriptions were useful as a canvas for the linguist. The present report is intended for linguists with little or no knowledge of speech processing. It aims to provide insights into (i) the way the tool operates and (ii) the process of collaborating with natural Language processing specialists. Practical recommendations are offered on how to anticipate the requirements of this type of technology from the early stages of data collection in the field.

Alexis Palmer - One of the best experts on this subject based on the ideXlab platform.

  • a summary of the first workshop on Language technology for Language Documentation and revitalization
    arXiv: Computation and Language, 2020
    Co-Authors: Graham Neubig, Alexis Palmer, Hilaria Cruz, Shruti Rijhwani, Jordan Mackenzie, Matthew C H Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati
    Abstract:

    Despite recent advances in natural Language processing and other Language technology, the application of such technology to Language Documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together Language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical Language revitalization technologies. This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine Languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw'ida, Kwak'wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

  • a summary of the first workshop on Language technology for Language Documentation and revitalization
    Workshop Spoken Language Technologies for Under-resourced Languages, 2020
    Co-Authors: Graham Neubig, Alexis Palmer, Hilaria Cruz, Shruti Rijhwani, Jordan Mackenzie, Matthew C H Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati
    Abstract:

    Despite recent advances in natural Language processing and other Language technology, the application of such technology to Language Documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together Language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical Language revitalization technologies. The workshop focused on developing technologies to aid Language Documentation and revitalization in four areas: 1) spoken Language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (Language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine Languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

  • computational strategies for reducing annotation effort in Language Documentation
    Linguistic Issues in Language Technology, 2010
    Co-Authors: Alexis Palmer, Taesun Moon, Jason Baldridge, Katrin Erk, Eric W Campbell, Telma Can
    Abstract:

    With the urgent need to document the world's dying Languages, it is important to explore ways to speed up Language Documentation efforts. One promising avenue is to use techniques from computational linguistics to automate some of the process. Here we consider unsupervised morphological segmentation and active learning for creating interlinear glossed text (IGT) for the Mayan Language Uspanteko. The practical goal is to produce a totally annotated corpus that is as accurate as possible given limited time for manual annotation. We discuss results from several experiments that suggest there is indeed much promise in these methods but also show that further development is necessary to make them robustly useful for a wide range of conditions and tasks. We also provide a detailed discussion of how two documentary linguists perceived machine support in IGT production and how their annotation performance varied with different levels of machine support.

  • how well does active learning emphactually work time based evaluation of cost reduction strategies for Language Documentation
    Empirical Methods in Natural Language Processing, 2009
    Co-Authors: Jason Baldridge, Alexis Palmer
    Abstract:

    Machine involvement has the potential to speed up Language Documentation. We assess this potential with timed annotation experiments that consider annotator expertise, example selection methods, and suggestions from a machine classifier. We find that better example selection and label suggestions improve efficiency, but effectiveness depends strongly on annotator expertise. Our expert performed best with uncertainty selection, but gained little from suggestions. Our non-expert performed best with random selection and suggestions. The results underscore the importance both of measuring annotation cost reductions with respect to time and of the need for cost-sensitive learning methods that adapt to annotators.

  • evaluating automation strategies in Language Documentation
    North American Chapter of the Association for Computational Linguistics, 2009
    Co-Authors: Alexis Palmer, Taesun Moon, Jason Baldridge
    Abstract:

    This paper presents pilot work integrating machine labeling and active learning with human annotation of data for the Language Documentation task of creating interlinearized gloss text (IGT) for the Mayan Language Uspanteko. The practical goal is to produce a totally annotated corpus that is as accurate as possible given limited time for manual annotation. We describe ongoing pilot studies which examine the influence of three main factors on reducing the time spent to annotate IGT: suggestions from a machine labeler, sample selection methods, and annotator expertise.

Antonios Anastasopoulos - One of the best experts on this subject based on the ideXlab platform.

  • endangered Languages meet modern nlp
    International Conference on Computational Linguistics, 2020
    Co-Authors: Antonios Anastasopoulos, Graham Neubig, Christopher Cox, Hilaria Cruz
    Abstract:

    This tutorial will focus on NLP for endangered Languages Documentation and revitalization. First, we will acquaint the attendees with the process and the challenges of Language Documentation, showing how the needs of the Language communities and the documentary linguists map to specific NLP tasks. We will then present the state-of-the-art in NLP applied in this particularly challenging setting (extremely low-resource datasets, noisy transcriptions, limited annotations, non-standard orthographies). In doing so, we will also analyze the challenges of working in this domain and expand on both the capabilities and the limitations of current NLP approaches. Our ultimate goal is to motivate more NLP practitioners to work towards this very important direction, and also provide them with the tools and understanding of the limitations/challenges, both of which are needed in order to have an impact.

  • spoken term discovery for Language Documentation using translations
    Empirical Methods in Natural Language Processing, 2017
    Co-Authors: Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, Adam Lopez
    Abstract:

    Vast amounts of speech data collected for Language Documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered Languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.

  • a case study on using speech to translation alignments for Language Documentation
    Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, 2017
    Co-Authors: Antonios Anastasopoulos, David Chiang
    Abstract:

    For many low-resource or endangered Languages, spoken Language resources are more likely to be annotated with translations than with transcriptions. Recent work exploits such annotations to produce speech-to-translation alignments, without access to any text transcriptions. We investigate whether providing such information can aid in producing better (mismatched) crowdsourced transcriptions, which in turn could be valuable for training speech recognition systems, and show that they can indeed be beneficial through a small-scale case study as a proof-of-concept. We also present a simple phonetically aware string averaging technique that produces transcriptions of higher quality.