Asian Languages

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 9849 Experts worldwide ranked by ideXlab platform

Sivaji Bandyopadhyay - One of the best experts on this subject based on the ideXlab platform.

  • A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi
    Linguistic Issues in Language Technology, 2009
    Co-Authors: Asif Ekbal, Sivaji Bandyopadhyay
    Abstract:

    This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian Languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

  • language independent named entity recognition in indian Languages
    International Joint Conference on Natural Language Processing, 2008
    Co-Authors: Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka, Sivaji Bandyopadhyay
    Abstract:

    This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian Languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1 . We have

G Navarro - One of the best experts on this subject based on the ideXlab platform.

  • spaces trees and colors the algorithmic landscape of document retrieval on sequences
    ACM Computing Surveys, 2014
    Co-Authors: G Navarro
    Abstract:

    Document retrieval is one of the best-established information retrieval activities since the ’60s, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian Languages and other scenarios where the “natural language” assumptions do not hold. Inthis survey, we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.

  • spaces trees and colors the algorithmic landscape of document retrieval on sequences
    arXiv: Information Retrieval, 2013
    Co-Authors: G Navarro
    Abstract:

    Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to "natural language" text collections, where inverted indices are the preferred solution. As successful as this paradigm has been, it fails to properly handle some East Asian Languages and other scenarios where the "natural language" assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many others. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and others.

Barbara Forsyth - One of the best experts on this subject based on the ideXlab platform.

  • translation of a tobacco survey into spanish and Asian Languages the tobacco use supplement to the current population survey
    Nicotine & Tobacco Research, 2008
    Co-Authors: Gordon Willis, Anne M Hartman, Deirdre Lawrence, Martha Stapleton Kudela, Kerry Y Levin, Barbara Forsyth
    Abstract:

    Because of the vital need to attain cross-cultural comparability of estimates of tobacco use across subgroups of the U.S. population that differ in primary language use, the National Cancer Institute (NCI) Tobacco Use Special Cessation Supplement to the Current Population Survey (TUSCS-CPS) was translated into Spanish, Chinese (Mandarin and Cantonese), Korean, Vietnamese, and Khmer (Cambodian). The questionnaire translations were extensively tested using an eight-step process that focused on both translation procedures and empirical pretesting. The resulting translations are available on the Internet at http://riskfactor.cancer.gov/studies/tus-cps/translation/questionnaires.html for tobacco researchers to use in their own surveys, either in full, or as material to be selected as appropriate. This manuscript provides information to guide researchers in accessing and using the translations, and describes the empirical procedures used to develop and pretest them (cognitive interviewing and behavior coding). We also provide recommendations concerning the further development of questionnaire translations.

Gurpreet Singh Lehal - One of the best experts on this subject based on the ideXlab platform.

  • A word segmentation system for handling space omission problem
    2013
    Co-Authors: Gurpreet Singh Lehal
    Abstract:

    Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian Languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this paper we present a word segmentation system for handling space omission problem in Urdu script with application to Urdu-Devnagri Transliteration system. Instead of using manually segmented monolingual corpora to train segmenters, we make use of bilingual corpora and statistical word disambiguation techniques. Though our approach is adapted for the specific transliteration task at hand by taking the corresponding target (Hindi) language into account, the techniques suggested can be adapted to independently solve the space omission Urdu word segmentation problems. The two major components of our system are: identification of merged words for segmentation and proper segmentation of the merged words. The system was tested on 1.61 million word Urdu test data. The recall and precision for the merged word recognition component were found to be 99.29 % and 99.38 % respectively. The words are correctly segmented with 99.15% accuracy.

  • a word segmentation system for handling space omission problem in urdu script
    Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, 2010
    Co-Authors: Gurpreet Singh Lehal
    Abstract:

    Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian Languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this paper we present a word segmentation system for handling space omission problem in Urdu script with application to Urdu-Devnagri Transliteration system. Instead of using manually segmented monolingual corpora to train segmenters, we make use of bilingual corpora and statistical word disambiguation techniques. Though our approach is adapted for the specific transliteration task at hand by taking the corresponding target (Hindi) language into account, the techniques suggested can be adapted to independently solve the space omission Urdu word segmentation problems. The two major components of our system are : identification of merged words for segmentation and proper segmentation of the merged words. The system was tested on 1.61 million word Urdu test data. The recall and precision for the merged word recognition component were found to be 99.29% and 99.38% respectively. The words are correctly segmented with 99.15% accuracy.

Asif Ekbal - One of the best experts on this subject based on the ideXlab platform.

  • A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi
    Linguistic Issues in Language Technology, 2009
    Co-Authors: Asif Ekbal, Sivaji Bandyopadhyay
    Abstract:

    This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian Languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

  • language independent named entity recognition in indian Languages
    International Joint Conference on Natural Language Processing, 2008
    Co-Authors: Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka, Sivaji Bandyopadhyay
    Abstract:

    This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian Languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1 . We have