Language Identification

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 4731 Experts worldwide ranked by ideXlab platform

Timothy Baldwin - One of the best experts on this subject based on the ideXlab platform.

  • LREC - Reconsidering Language Identification for written Language resources
    2020
    Co-Authors: Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, Andrew Mackinlay
    Abstract:

    The task of identifying the Language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written Language Identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written Language Identification reveals a number of questions which remain open and ripe for further investigation.

  • accurate Language Identification of twitter messages
    Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), 2014
    Co-Authors: Timothy Baldwin
    Abstract:

    We present an evaluation of “off-theshelf” Language Identification systems as applied to microblog messages from Twitter. A key challenge is the lack of an adequate corpus of messages annotated for Language that reflects the linguistic diversity present on Twitter. We overcome this through a “mostly-automated” approach to gathering Language-labeled Twitter messages for evaluating Language Identification. We present the method to construct this dataset, as well as empirical results over existing datasets and off-theshelf Language identifiers. We also test techniques that have been proposed in the literature to boost Language Identification performance over Twitter messages. We find that simple voting over three specific systems consistently outperforms any specific system, and achieves state-of-the-art accuracy on the task.

  • langid py an off the shelf Language Identification tool
    Meeting of the Association for Computational Linguistics, 2012
    Co-Authors: Timothy Baldwin
    Abstract:

    We present langid.py, an off-the-shelf Language Identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require Language Identification without wanting to invest in preparation of in-domain training data.

  • cross domain feature selection for Language Identification
    International Joint Conference on Natural Language Processing, 2011
    Co-Authors: Timothy Baldwin
    Abstract:

    We show that transductive (cross-domain) learning is an important consideration in building a general-purpose Language Identification system, and develop a feature selection method that generalizes across domains. Our results demonstrate that our method provides improvements in transductive transfer learning for Language Identification. We provide an implementation of the method and show that our system is faster than popular standalone Language Identification systems, while maintaining competitive accuracy.

  • IJCNLP - Cross-domain Feature Selection for Language Identification
    2011
    Co-Authors: Timothy Baldwin
    Abstract:

    We show that transductive (cross-domain) learning is an important consideration in building a general-purpose Language Identification system, and develop a feature selection method that generalizes across domains. Our results demonstrate that our method provides improvements in transductive transfer learning for Language Identification. We provide an implementation of the method and show that our system is faster than popular standalone Language Identification systems, while maintaining competitive accuracy.

Hagen Soltau - One of the best experts on this subject based on the ideXlab platform.

  • Confidence measure based Language Identification
    2000 IEEE International Conference on Acoustics Speech and Signal Processing. Proceedings (Cat. No.00CH37100), 2000
    Co-Authors: Florian Metze, Thomas Kemp, Thomas Schaaf, T. Schultz, Hagen Soltau
    Abstract:

    In this paper we present a new application for confidence measures in spoken Language processing. In today's computerized dialogue systems, Language Identification (LID) is typically achieved via dedicated modules. In our approach, LID is integrated into the speech recognizer, therefore profiting from high-level linguistic knowledge at very little extra cost. Our new approach is based on a word lattice based confidence measure (Kemp and Schaaf, 1997), which was originally devised for unsupervised training. In this work, we show that the confidence based Language Identification algorithm outperforms conventional score based methods. Also, this method is less dependent on the acoustic characteristics of the transmission channel than score based methods. By introducing additional parameters, unknown Languages can be rejected. The proposed method is compared to a score based approach on the Verbmobil database, a three Language task.

  • ICASSP - Confidence measure based Language Identification
    2000 IEEE International Conference on Acoustics Speech and Signal Processing. Proceedings (Cat. No.00CH37100), 2000
    Co-Authors: Florian Metze, Thomas Kemp, Tanja Schultz, Thomas Schaaf, Hagen Soltau
    Abstract:

    In this paper we present a new application for confidence measures in spoken Language processing. In today's computerized dialogue systems, Language Identification (LID) is typically achieved via dedicated modules. In our approach, LID is integrated into the speech recognizer, therefore profiting from high-level linguistic knowledge at very little extra cost. Our new approach is based on a word lattice based confidence measure (Kemp and Schaaf, 1997), which was originally devised for unsupervised training. In this work, we show that the confidence based Language Identification algorithm outperforms conventional score based methods. Also, this method is less dependent on the acoustic characteristics of the transmission channel than score based methods. By introducing additional parameters, unknown Languages can be rejected. The proposed method is compared to a score based approach on the Verbmobil database, a three Language task.

T. Schultz - One of the best experts on this subject based on the ideXlab platform.

  • Confidence measure based Language Identification
    2000 IEEE International Conference on Acoustics Speech and Signal Processing. Proceedings (Cat. No.00CH37100), 2000
    Co-Authors: Florian Metze, Thomas Kemp, Thomas Schaaf, T. Schultz, Hagen Soltau
    Abstract:

    In this paper we present a new application for confidence measures in spoken Language processing. In today's computerized dialogue systems, Language Identification (LID) is typically achieved via dedicated modules. In our approach, LID is integrated into the speech recognizer, therefore profiting from high-level linguistic knowledge at very little extra cost. Our new approach is based on a word lattice based confidence measure (Kemp and Schaaf, 1997), which was originally devised for unsupervised training. In this work, we show that the confidence based Language Identification algorithm outperforms conventional score based methods. Also, this method is less dependent on the acoustic characteristics of the transmission channel than score based methods. By introducing additional parameters, unknown Languages can be rejected. The proposed method is compared to a score based approach on the Verbmobil database, a three Language task.

  • LVCSR-based Language Identification
    1996 IEEE International Conference on Acoustics Speech and Signal Processing Conference Proceedings, 1996
    Co-Authors: T. Schultz, I. Rogina, A. Waibel
    Abstract:

    Automatic Language Identification is an important problem in building multilingual speech recognition and understanding systems. Building a Language Identification module for four Languages we studied the influence of applying different levels of knowledge sources on a large vocabulary continuous speech recognition (LVCSR) approach, i.e. phonetic, phonotactic, lexical, and syntactic-semantic knowledge. The resulting Language Identification (LID) module can identify spontaneous speech input and can be used as a front end for the multilingual speech-to-speech translation system JANUS-II. A comparison of five LID systems showed that the incorporation of lexical and linguistic knowledge reduces the Language Identification error for the 2-Language tests up to 50%. Based on these results we build a LID module for German, English, Spanish, and Japanese which yields 84% Identification rate on the spontaneous scheduling task (SST).

Haizhou Li - One of the best experts on this subject based on the ideXlab platform.

  • Using Language cluster models in hierarchical Language Identification
    Speech Communication, 2018
    Co-Authors: Saad Irtza, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
    Abstract:

    Abstract Hierarchical Language Identification systems can be employed to take advantage of similarities and disparities between Languages to organize them into clusters and decompose the Language Identification problem into a tree of potentially simpler sub-problems of Language group Identifications. In this paper, a novel approach is proposed to incorporate knowledge of the Language clusters into the front-ends of the classification systems employed in each node of a hierarchical Language Identification system. This approach investigates the use of feature representations tuned to the particular Language cluster Identification sub-problem at each node. In addition, we explore a novel decision strategy that incorporates information about Language cluster model memberships into the front-ends at each node. Experimental results included in this paper demonstrate that both approaches lead to improved Language Identification performance of the overall hierarchical system on the NIST LRE 2015 database.

  • End-to-End Hierarchical Language Identification System
    2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2018
    Co-Authors: Saad Irtza, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
    Abstract:

    Recently, hierarchical Language Identification systems have shown significant improvement over single level systems in both closed and open set Language Identification tasks. However, developing such a system requires the features and classifier selection at each node in the hierarchical structure to be hand crafted. Motivated by the superior ability of end-to-end deep neural network architecture to jointly optimize the feature extraction and classification process, we propose a novel approach developing an end-to-end hierarchical Language Identification system. The proposed approach also demonstrates the in -built ability of the end-to-end hierarchical structure training that enables an out-of-set Language model, without using any additional out-of-set Language training data. Experiments are conducted on the NIST LRE 2015 data set. The overall results show relative improvements of 18.6% and 27.3% in terms of Cavg in closed and open set tasks over the corresponding baseline systems.

  • ICASSP - End-to-End Hierarchical Language Identification System
    2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2018
    Co-Authors: Saad Irtza, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
    Abstract:

    Recently, hierarchical Language Identification systems have shown significant improvement over single level systems in both closed and open set Language Identification tasks. However, developing such a system requires the features and classifier selection at each node in the hierarchical structure to be hand crafted. Motivated by the superior ability of end-to-end deep neural network architecture to jointly optimize the feature extraction and classification process, we propose a novel approach developing an end-to-end hierarchical Language Identification system. The proposed approach also demonstrates the in -built ability of the end-to-end hierarchical structure training that enables an out-of-set Language model, without using any additional out-of-set Language training data. Experiments are conducted on the NIST LRE 2015 data set. The overall results show relative improvements of 18.6% and 27.3% in terms of C avg in closed and open set tasks over the corresponding baseline systems.

  • Language Identification a tutorial
    IEEE Circuits and Systems Magazine, 2011
    Co-Authors: Eliathamby Ambikairajah, Haizhou Li, Liang Wang, Vidhyasaharan Sethu
    Abstract:

    This tutorial presents an overview of the progression of spoken Language Identification (LID) systems and current developments. The introduction provides a background on automatic Language Identification systems using syntactic, morphological, and in particular, acoustic, phonetic, phonotactic and prosodic level information. Different frontend features that are used in LID systems are presented. Several normalization and Language modelling techniques have also been presented. We also discuss different LID system architectures that embrace a variety of front-ends and back-ends, and configurations such as hierarchical and fusion classifiers. Evaluations of the LID system are presented using NIST Language recognition evaluation tasks.

  • ICASSP - Tuning phone decoders for Language Identification
    2010 IEEE International Conference on Acoustics Speech and Signal Processing, 2010
    Co-Authors: C. Santhosh Kumar, Haizhou Li, Rong Tong, Pavel Matejka, Lukas Burget, Jan Cernocky
    Abstract:

    Phonotactic approach, phone recognition to be followed by Language modeling, is one of the most popular approaches to Language Identification (LID). In this work, we explore how Language Identification accuracy of a phone decoder can be enhanced by varying acoustic resolution of the phone decoder, and subsequently how multiresolution versions of the same decoder can be integrated to improve the LID accuracy. We use mutual information to select the optimum set of phones for a specific acoustic resolution. Further, we propose strategies for building multilingual systems suitable for LID applications, and subsequently fine tune these systems to enhance the overall accuracy.

Eliathamby Ambikairajah - One of the best experts on this subject based on the ideXlab platform.

  • Using Language cluster models in hierarchical Language Identification
    Speech Communication, 2018
    Co-Authors: Saad Irtza, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
    Abstract:

    Abstract Hierarchical Language Identification systems can be employed to take advantage of similarities and disparities between Languages to organize them into clusters and decompose the Language Identification problem into a tree of potentially simpler sub-problems of Language group Identifications. In this paper, a novel approach is proposed to incorporate knowledge of the Language clusters into the front-ends of the classification systems employed in each node of a hierarchical Language Identification system. This approach investigates the use of feature representations tuned to the particular Language cluster Identification sub-problem at each node. In addition, we explore a novel decision strategy that incorporates information about Language cluster model memberships into the front-ends at each node. Experimental results included in this paper demonstrate that both approaches lead to improved Language Identification performance of the overall hierarchical system on the NIST LRE 2015 database.

  • End-to-End Hierarchical Language Identification System
    2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2018
    Co-Authors: Saad Irtza, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
    Abstract:

    Recently, hierarchical Language Identification systems have shown significant improvement over single level systems in both closed and open set Language Identification tasks. However, developing such a system requires the features and classifier selection at each node in the hierarchical structure to be hand crafted. Motivated by the superior ability of end-to-end deep neural network architecture to jointly optimize the feature extraction and classification process, we propose a novel approach developing an end-to-end hierarchical Language Identification system. The proposed approach also demonstrates the in -built ability of the end-to-end hierarchical structure training that enables an out-of-set Language model, without using any additional out-of-set Language training data. Experiments are conducted on the NIST LRE 2015 data set. The overall results show relative improvements of 18.6% and 27.3% in terms of Cavg in closed and open set tasks over the corresponding baseline systems.

  • ICASSP - End-to-End Hierarchical Language Identification System
    2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2018
    Co-Authors: Saad Irtza, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Haizhou Li
    Abstract:

    Recently, hierarchical Language Identification systems have shown significant improvement over single level systems in both closed and open set Language Identification tasks. However, developing such a system requires the features and classifier selection at each node in the hierarchical structure to be hand crafted. Motivated by the superior ability of end-to-end deep neural network architecture to jointly optimize the feature extraction and classification process, we propose a novel approach developing an end-to-end hierarchical Language Identification system. The proposed approach also demonstrates the in -built ability of the end-to-end hierarchical structure training that enables an out-of-set Language model, without using any additional out-of-set Language training data. Experiments are conducted on the NIST LRE 2015 data set. The overall results show relative improvements of 18.6% and 27.3% in terms of C avg in closed and open set tasks over the corresponding baseline systems.

  • Advances in Feature Extraction and Modelling for Short Duration Language Identification
    2018 IEEE International Conference on Information and Automation for Sustainability (ICIAfS), 2018
    Co-Authors: Sarith Fernando, Saad Irtza, Vidhyasaharan Sethu, Eliathamby Ambikairajah
    Abstract:

    This paper presents an overview of the progression of short duration spoken Language Identification systems and current developments. It reviews different Language Identification architectures including single level, hierarchical, fully connected, convolutional neural network and bidirectional long-short term memory. The work presented in this paper aims to explore the most effective Language Identification architecture, frontend and backend for the short duration classification task. Specifically, the use of frequency domain linear prediction features is proposed with bidirectional long short-term memory recurrent neural networks for Language Identification, which aims to model temporal dependencies between past and future frame based features in short utterances. It shows significant improvements of 26.1% in terms of Cavg compared to state-of-the-art i-vector approach. Evaluations of the LID systems are presented using AP17-OLR Language recognition evaluation tasks.

  • Language Identification a tutorial
    IEEE Circuits and Systems Magazine, 2011
    Co-Authors: Eliathamby Ambikairajah, Haizhou Li, Liang Wang, Vidhyasaharan Sethu
    Abstract:

    This tutorial presents an overview of the progression of spoken Language Identification (LID) systems and current developments. The introduction provides a background on automatic Language Identification systems using syntactic, morphological, and in particular, acoustic, phonetic, phonotactic and prosodic level information. Different frontend features that are used in LID systems are presented. Several normalization and Language modelling techniques have also been presented. We also discuss different LID system architectures that embrace a variety of front-ends and back-ends, and configurations such as hierarchical and fusion classifiers. Evaluations of the LID system are presented using NIST Language recognition evaluation tasks.