Asian Languages

The Experts below are selected from a list of 9849 Experts worldwide ranked by ideXlab platform

Sivaji Bandyopadhyay - One of the best experts on this subject based on the ideXlab platform.

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

Linguistic Issues in Language Technology, 2009

Co-Authors: Asif Ekbal, Sivaji Bandyopadhyay

Abstract:

This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian Languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

15 days free trial to Access Article
language independent named entity recognition in indian Languages

International Joint Conference on Natural Language Processing, 2008

Co-Authors: Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka, Sivaji Bandyopadhyay

Abstract:

This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian Languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1 . We have

15 days free trial to Access Article

G Navarro - One of the best experts on this subject based on the ideXlab platform.

spaces trees and colors the algorithmic landscape of document retrieval on sequences

ACM Computing Surveys, 2014

Co-Authors: G Navarro

Abstract:

Document retrieval is one of the best-established information retrieval activities since the ’60s, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian Languages and other scenarios where the “natural language” assumptions do not hold. Inthis survey, we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.

15 days free trial to Access Article
spaces trees and colors the algorithmic landscape of document retrieval on sequences

arXiv: Information Retrieval, 2013

Co-Authors: G Navarro

Abstract:

Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to "natural language" text collections, where inverted indices are the preferred solution. As successful as this paradigm has been, it fails to properly handle some East Asian Languages and other scenarios where the "natural language" assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many others. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and others.

15 days free trial to Access Article

Barbara Forsyth - One of the best experts on this subject based on the ideXlab platform.

translation of a tobacco survey into spanish and Asian Languages the tobacco use supplement to the current population survey

Nicotine & Tobacco Research, 2008

Co-Authors: Gordon Willis, Anne M Hartman, Deirdre Lawrence, Martha Stapleton Kudela, Kerry Y Levin, Barbara Forsyth

Abstract:

Because of the vital need to attain cross-cultural comparability of estimates of tobacco use across subgroups of the U.S. population that differ in primary language use, the National Cancer Institute (NCI) Tobacco Use Special Cessation Supplement to the Current Population Survey (TUSCS-CPS) was translated into Spanish, Chinese (Mandarin and Cantonese), Korean, Vietnamese, and Khmer (Cambodian). The questionnaire translations were extensively tested using an eight-step process that focused on both translation procedures and empirical pretesting. The resulting translations are available on the Internet at http://riskfactor.cancer.gov/studies/tus-cps/translation/questionnaires.html for tobacco researchers to use in their own surveys, either in full, or as material to be selected as appropriate. This manuscript provides information to guide researchers in accessing and using the translations, and describes the empirical procedures used to develop and pretest them (cognitive interviewing and behavior coding). We also provide recommendations concerning the further development of questionnaire translations.

15 days free trial to Access Article

Gurpreet Singh Lehal - One of the best experts on this subject based on the ideXlab platform.

A word segmentation system for handling space omission problem

2013

Co-Authors: Gurpreet Singh Lehal

Abstract:

Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian Languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this paper we present a word segmentation system for handling space omission problem in Urdu script with application to Urdu-Devnagri Transliteration system. Instead of using manually segmented monolingual corpora to train segmenters, we make use of bilingual corpora and statistical word disambiguation techniques. Though our approach is adapted for the specific transliteration task at hand by taking the corresponding target (Hindi) language into account, the techniques suggested can be adapted to independently solve the space omission Urdu word segmentation problems. The two major components of our system are: identification of merged words for segmentation and proper segmentation of the merged words. The system was tested on 1.61 million word Urdu test data. The recall and precision for the merged word recognition component were found to be 99.29 % and 99.38 % respectively. The words are correctly segmented with 99.15% accuracy.

15 days free trial to Access Article
a word segmentation system for handling space omission problem in urdu script

Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, 2010

Co-Authors: Gurpreet Singh Lehal

Abstract:

Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian Languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this paper we present a word segmentation system for handling space omission problem in Urdu script with application to Urdu-Devnagri Transliteration system. Instead of using manually segmented monolingual corpora to train segmenters, we make use of bilingual corpora and statistical word disambiguation techniques. Though our approach is adapted for the specific transliteration task at hand by taking the corresponding target (Hindi) language into account, the techniques suggested can be adapted to independently solve the space omission Urdu word segmentation problems. The two major components of our system are : identification of merged words for segmentation and proper segmentation of the merged words. The system was tested on 1.61 million word Urdu test data. The recall and precision for the merged word recognition component were found to be 99.29% and 99.38% respectively. The words are correctly segmented with 99.15% accuracy.

15 days free trial to Access Article

Asif Ekbal - One of the best experts on this subject based on the ideXlab platform.

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

Linguistic Issues in Language Technology, 2009

Co-Authors: Asif Ekbal, Sivaji Bandyopadhyay

Abstract:

This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian Languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

15 days free trial to Access Article
language independent named entity recognition in indian Languages

International Joint Conference on Natural Language Processing, 2008

Co-Authors: Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka, Sivaji Bandyopadhyay

Abstract:

This paper reports about the development of a Named Entity Recognition (NER) system for South and South East Asian Languages, particularly for Bengali, Hindi, Telugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task 1 . We have

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Sivaji Bandyopadhyay - One of the best experts on this subject based on the ideXlab platform.

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

language independent named entity recognition in indian Languages

G Navarro - One of the best experts on this subject based on the ideXlab platform.

spaces trees and colors the algorithmic landscape of document retrieval on sequences

spaces trees and colors the algorithmic landscape of document retrieval on sequences

Barbara Forsyth - One of the best experts on this subject based on the ideXlab platform.

translation of a tobacco survey into spanish and Asian Languages the tobacco use supplement to the current population survey

Gurpreet Singh Lehal - One of the best experts on this subject based on the ideXlab platform.

A word segmentation system for handling space omission problem

a word segmentation system for handling space omission problem in urdu script

Asif Ekbal - One of the best experts on this subject based on the ideXlab platform.

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

language independent named entity recognition in indian Languages

Asian Languages

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Sivaji Bandyopadhyay - One of the best experts on this subject based on the ideXlab platform.

G Navarro - One of the best experts on this subject based on the ideXlab platform.

Barbara Forsyth - One of the best experts on this subject based on the ideXlab platform.

Gurpreet Singh Lehal - One of the best experts on this subject based on the ideXlab platform.

Asif Ekbal - One of the best experts on this subject based on the ideXlab platform.

Related terms