The Experts below are selected from a list of 9639 Experts worldwide ranked by ideXlab platform
Keysun Choi - One of the best experts on this subject based on the ideXlab platform.
-
a machine Transliteration model based on correspondence between graphemes and phonemes
ACM Transactions on Asian Language Information Processing, 2006Co-Authors: Keysun Choi, Hitoshi IsaharaAbstract:Machine Transliteration is an automatic method for converting words in one language into phonetically equivalent ones in another language. There has been growing interest in the use of machine Transliteration to assist machine translation and information retrieval. Three types of machine Transliteration models---grapheme-based, phoneme-based, and hybrid---have been proposed. Surprisingly, there have been few reports of efforts to utilize the correspondence between source graphemes and source phonemes, although this correspondence plays an important role in machine Transliteration. Furthermore, little work has been reported on ways to dynamically handle source graphemes and phonemes. In this paper, we propose a Transliteration model that dynamically uses both graphemes and phonemes, particularly the correspondence between them. With this model, we have achieved better performance---improvements of about 15 to 41p in English-to-Korean Transliteration and about 16 to 44p in English-to-Japanese Transliteration---than has been reported for other models.
-
a comparison of different machine Transliteration models
Journal of Artificial Intelligence Research, 2006Co-Authors: Keysun Choi, Hitoshi IsaharaAbstract:Machine Transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine Transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine Transliteration models - grapheme-based Transliteration model, phoneme-based Transliteration model, hybrid Transliteration model, and correspondence-based Transliteration model - have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple Transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine Transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine Transliteration performance.
-
an english korean Transliteration model using pronunciation and contextual rules
International Conference on Computational Linguistics, 2002Co-Authors: Keysun ChoiAbstract:There is increasing concern about English-Korean (E-K) Transliteration recently. In the previous works, direct converting methods from English alphabets to Korean alphabets were a main research topic. In this paper, we present an E-K Transliteration model using pronunciation and contextual rules. Unlike the previous works, our method uses phonetic information such as phoneme and its context. We also use word formation information such as English words of Greek origin, With them, our method shows significant performance increase about 31% in word accuracy.
-
automatic Transliteration and back Transliteration by decision tree learning
Language Resources and Evaluation, 2000Co-Authors: Byungju Kang, Keysun ChoiAbstract:Automatic Transliteration and back-Transliteration across languages with drastically different alphabets and phonemes inventories such as English/Korean, English/Japanese, English/Arabic, English/Chinese, etc, have practical importance in machine translation, crosslingual information retrieval, and automatic bilingual dictionary compilation, etc. In this paper, a bi-directional and to some extent language independent methodology for English/Korean Transliteration and back-Transliteration is described. Our method is composed of character alignment and decision tree learning. We induce Transliteration rules for each English alphabet and back-Transliteration rules for each Korean alphabet. For the training of decision trees we need a large labeled examples of Transliteration and backTransliteration. However this kind of resources are generally not available. Our character alignment algorithm is capable of highly accurately aligning English word and Korean Transliteration in a desired way.
A Kumaran - One of the best experts on this subject based on the ideXlab platform.
-
Improving Cross-Language Information Retrieval by Transliteration Mining and Generation
2020Co-Authors: K Saravanan, Raghavendra Udupa, A KumaranAbstract:Abstract. The retrieval performance of Cross-Language Retrieval (CLIR) systems is a function of the coverage of the translation lexicon used by them. Unfortunately, most translation lexicons do not provide a good coverage of proper nouns and common nouns which are often the most information-bearing terms in a query. As a consequence, many queries cannot be translated without a substantial loss of information and the retrieval performance of the CLIR system is less than satisfactory for those queries. However, proper nouns and common nouns very often appear in their transliterated forms in the target language document collection. In this work, we study two techniques that leverage this fact for addressing the problem, namely, Transliteration Mining and Transliteration Generation. The first technique attempts to mine the Transliterations of out-ofvocabulary query terms from the document collection whereas the second generates the Transliterations. We systematically study the effectiveness of both techniques in the context of the Hindi-English and Tamil-English ad hoc retrieval tasks at FIRE2010. The results of our study show that both techniques are effective in addressing the problem posed by out-of-vocabulary terms with Transliteration Mining technique giving better results than Transliteration Generation
-
report of news 2012 machine Transliteration shared task
Meeting of the Association for Computational Linguistics, 2012Co-Authors: Min Zhang, A Kumaran, Ming LiuAbstract:This report documents the Machine Transliteration Shared Task conducted as a part of the Named Entities Workshop (NEWS 2012), an ACL 2012 workshop. The shared task features machine Transliteration of proper names from English to 11 languages and from 3 languages to English. In total, 14 tasks are provided. 7 teams participated in the evaluations. Finally, 57 standard and 1 non-standard runs are submitted, where diverse Transliteration methodologies are explored and reported on the evaluation data. We report the results with 4 performance metrics. We believe that the shared task has successfully achieved its objective by providing a common benchmarking platform for the research community to evaluate the state-of-the-art technologies that benefit the future research and development.
-
compositional machine Transliteration
ACM Transactions on Asian Language Information Processing, 2010Co-Authors: A Kumaran, Mitesh M Khapra, Pushpak BhattacharyyaAbstract:Machine Transliteration is an important problem in an increasingly multilingual world, as it plays a critical role in many downstream applications, such as machine translation or crosslingual information retrieval systems. In this article, we propose compositional machine Transliteration systems, where multiple Transliteration components may be composed either to improve existing Transliteration quality, or to enable Transliteration functionality between languages even when no direct parallel names corpora exist between them. Specifically, we propose two distinct forms of composition: serial and parallel. Serial compositional system chains individual Transliteration components, say, X → Y and Y → Z systems, to provide Transliteration functionality, X → Z. In parallel composition evidence from multiple Transliteration paths between X → Z are aggregated for improving the quality of a direct system. We demonstrate the functionality and performance benefits of the compositional methodology using a state-of-the-art machine Transliteration framework in English and a set of Indian languages, namely, Hindi, Marathi, and Kannada. Finally, we underscore the utility and practicality of our compositional approach by showing that a CLIR system integrated with compositional Transliteration systems performs consistently on par with, and sometimes better than, that integrated with a direct Transliteration system.
-
report of news 2010 Transliteration mining shared task
Meeting of the Association for Computational Linguistics, 2010Co-Authors: A Kumaran, Mitesh M KhapraAbstract:This report documents the details of the Transliteration Mining Shared Task that was run as a part of the Named Entities Workshop (NEWS 2010), an ACL 2010 workshop. The shared task featured mining of name Transliterations from the paired Wikipedia titles in 5 different language pairs, specifically, between English and one of Arabic, Chinese, Hindi Russian and Tamil. Totally 5 groups took part in this shared task, participating in multiple mining tasks in different languages pairs. The methodology and the data sets used in this shared task are published in the Shared Task White Paper [Kumaran et al, 2010]. We measure and report 3 metrics on the submitted results to calibrate the performance of individual systems on a commonly available Wikipedia dataset. We believe that the significant contribution of this shared task is in (i) assembling a diverse set of participants working in the area of Transliteration mining, (ii) creating a baseline performance of Transliteration mining systems in a set of diverse languages using commonly available Wikipedia data, and (iii) providing a basis for meaningful comparison and analysis of trade-offs between various algorithmic approaches used in mining. We believe that this shared task would complement the NEWS 2010 Transliteration generation shared task, in enabling development of practical systems with a small amount of seed data in a given pair of languages.
-
whitepaper of news 2010 shared task on Transliteration mining
Meeting of the Association for Computational Linguistics, 2010Co-Authors: A Kumaran, Mitesh M KhapraAbstract:Transliteration is generally defined as phonetic translation of names across languages. Machine Transliteration is a critical technology in many domains, such as machine translation, cross-language information retrieval/extraction, etc. Recent research has shown that high quality machine Transliteration systems may be developed in a language-neutral manner, using a reasonably sized good quality corpus (~15--25K parallel names) between a given pair of languages. In this shared task, we focus on acquisition of such good quality names corpora in many languages, thus complementing the machine Transliteration shared task that is concurrently conducted in the same NEWS 2010 workshop. Specifically, this task focuses on mining the Wikipedia paired entities data (aka, inter-wiki-links) to produce high-quality Transliteration data that may be used for Transliteration tasks.
Philippe Grange - One of the best experts on this subject based on the ideXlab platform.
-
training schemes for the Transliteration of the balinese script into the latin script on palm leaf manuscript images
International Conference on Frontiers in Handwriting Recognition, 2018Co-Authors: Made Windu Antara Kesiman, Jean-christophe Burie, Jean Marc Ogier, Philippe GrangeAbstract:Considering the importance of the contents of the Balinese palm leaf manuscripts, Transliteration system has to be developed in order to be able to read easily these manuscripts. The challenge comes from the fact that Balinese script is a syllabic script and the mapping between linguistic symbols and images of symbols is not straightforward. In addition, with a very limited training data availability, some adaptations of LSTM in the Transliteration training scheme need to be designed, to be analyzed and to be evaluated. This paper contributes in proposing and evaluating some adapted segmentation free training schemes for the Transliteration of the Balinese script into the Latin script from palm leaf manuscript images. We describe the generated synthetic dataset and the proposed training schemes at two different levels (word level and text line level) to transliterate the real word and text lines from palm leaf manuscript images. For word Transliteration, in general, training schemes at word level perform better than training schemes at text line level. As comparison, the segmentation based Transliteration method gives a very promising result. For text line Transliteration, segmentation based Transliteration method outperforms all segmentation free training schemes for the less degraded collections, while the segmentation free training schemes contributes in transliterating the text lines for more degraded manuscripts. Training at text line level with a pre-trained model at word level could give a better result in word Transliteration while still keeping the optimal performances for text line Transliteration.
-
Knowledge Representation and Phonological Rules for the Automatic Transliteration of Balinese Script on Palm Leaf Manuscript
Computación Y Sistemas, 2018Co-Authors: Made Windu Antara Kesiman, Jean-christophe Burie, Jean Marc Ogier, Philippe GrangeAbstract:Balinese ancient palm leaf manuscripts record many important knowledges about world civilization histories. They vary from ordinary texts to Bali’s most sacred writings. In reality, the majority of Balinese can not read it because of language obstacles as well as tradition which perceived them as a sacrilege. Palm leaf manuscripts attract the historians, philologists, and archaeologists to discover more about the ancient ways of life. But unfortunately, there is only a limited access to the content of these manuscripts, because of the linguistic difficulties. The Balinese palm leaf manuscripts were written in Balinese script in Balinese language, in the ancient literary texts composed in the old Javanese language of Kawi and Sanskrit. Balinese script is considered to be one of the most complex scripts from Southeast Asia. A Transliteration engine for transliterating the Balinese script of palm leaf manuscript to the Latin-based script is one of the most demanding systems which has to be developed for the collection of palm leaf manuscript images. In this paper, we present an implementation of knowledge representation and phonological rules for the automatic Transliteration of Balinese script on palm leaf manuscript. In this system, a rule-based engine for performing Transliterations is proposed. Our model is based on phonetics which is based on traditional linguistic study of Balinese Transliteration. This automatic Transliteration system is needed to complete the optical character recognition (OCR) process on the palm leaf manuscript images, to make the manuscripts more accessible and readable to a wider audience.
Helmut Schmid - One of the best experts on this subject based on the ideXlab platform.
-
statistical models for unsupervised semi supervised and supervised Transliteration mining
Computational Linguistics, 2017Co-Authors: Hassan Sajjad, Alexander Fraser, Helmut Schmid, Hinrich SchutzeAbstract:We present a generative model that efficiently mines Transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised Transliteration mining. The model interpolates two sub-models, one for the generation of Transliteration pairs and one for the generation of non-Transliteration pairs i.e., noise. The model is trained on noisy unlabeled data using the EM algorithm. During training the Transliteration sub-model learns to generate Transliteration pairs and the fixed non-Transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our Transliteration mining system on data from a Transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% Transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.
-
qcri mes submission at wmt13 using Transliteration mining to improve statistical machine translation
Workshop on Statistical Machine Translation, 2013Co-Authors: Hassan Sajjad, Svetlana Smekalova, Nadir Durrani, Alexander Fraser, Helmut SchmidAbstract:This paper describes QCRI-MES’s submission on the English-Russian dataset to the Eighth Workshop on Statistical Machine Translation. We generate improved word alignment of the training data by incorporating an unsupervised Transliteration mining module to GIZA++ and build a phrase-based machine translation system. For tuning, we use a variation of PRO which provides better weights by optimizing BLEU+1 at corpus-level. We transliterate out-of-vocabulary words in a postprocessing step by using a Transliteration system built on the Transliteration pairs extracted using an unsupervised Transliteration mining system. For the Russian to English translation direction, we apply linguistically motivated pre-processing on the Russian side of the data.
-
a statistical model for unsupervised and semi supervised Transliteration mining
Meeting of the Association for Computational Linguistics, 2012Co-Authors: Hassan Sajjad, Alexander Fraser, Helmut SchmidAbstract:We propose a novel model to automatically extract Transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines Transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model Transliteration mining as an interpolation of Transliteration and non-Transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.
-
an algorithm for unsupervised Transliteration mining with an application to word alignment
Meeting of the Association for Computational Linguistics, 2011Co-Authors: Hassan Sajjad, Alexander Fraser, Helmut SchmidAbstract:We propose a language-independent method for the automatic extraction of Transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on Transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally, we integrate the Transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments.
-
hindi to urdu machine translation through Transliteration
Meeting of the Association for Computational Linguistics, 2010Co-Authors: Nadir Durrani, Hassan Sajjad, Alexander Fraser, Helmut SchmidAbstract:We present a novel approach to integrate Transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both Transliteration and translation when translating a particular Hindi word given the context whereas in previous work Transliteration is only used for translating OOV (out-of-vocabulary) words. We use Transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that Transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.
Pushpak Bhattacharyya - One of the best experts on this subject based on the ideXlab platform.
-
leveraging orthographic similarity for multilingual neural Transliteration
Transactions of the Association for Computational Linguistics, 2018Co-Authors: Anoop Kunchukuttan, Mitesh M Khapra, Gurneet Singh, Pushpak BhattacharyyaAbstract:We address the task of joint training of Transliteration models for multiple language pairs (multilingual Transliteration). This is an instance of multitask learning, where individual tasks (langua...
-
brahmi net a Transliteration and script conversion system for languages of the indian subcontinent
North American Chapter of the Association for Computational Linguistics, 2015Co-Authors: Anoop Kunchukuttan, Ratish Puduppully, Pushpak BhattacharyyaAbstract:We present Brahmi-Net - an online system for Transliteration and script conversion for all major Indian language pairs (306 pairs). The system covers 13 Indo-Aryan languages, 4 Dravidian languages and English. For training the Transliteration systems, we mined parallel Transliteration corpora from parallel translation corpora using an unsupervised method and trained statistical Transliteration systems using the mined corpora. Languages which do not have parallel corpora are supported by Transliteration through a bridge language. Our script conversion system supports conversion between all Brahmi-derived scripts as well as ITRANS romanization scheme. For this, we leverage co-ordinated Unicode ranges between Indic scripts and use an extended ITRANS encoding for transliterating between English and Indic scripts. The system also provides top-k Transliterations and simultaneous Transliteration into multiple output languages. We provide a Python as well as REST API to access these services. The API and the mined Transliteration corpus are made available for research use under an open source license.
-
compositional machine Transliteration
ACM Transactions on Asian Language Information Processing, 2010Co-Authors: A Kumaran, Mitesh M Khapra, Pushpak BhattacharyyaAbstract:Machine Transliteration is an important problem in an increasingly multilingual world, as it plays a critical role in many downstream applications, such as machine translation or crosslingual information retrieval systems. In this article, we propose compositional machine Transliteration systems, where multiple Transliteration components may be composed either to improve existing Transliteration quality, or to enable Transliteration functionality between languages even when no direct parallel names corpora exist between them. Specifically, we propose two distinct forms of composition: serial and parallel. Serial compositional system chains individual Transliteration components, say, X → Y and Y → Z systems, to provide Transliteration functionality, X → Z. In parallel composition evidence from multiple Transliteration paths between X → Z are aggregated for improving the quality of a direct system. We demonstrate the functionality and performance benefits of the compositional methodology using a state-of-the-art machine Transliteration framework in English and a set of Indian languages, namely, Hindi, Marathi, and Kannada. Finally, we underscore the utility and practicality of our compositional approach by showing that a CLIR system integrated with compositional Transliteration systems performs consistently on par with, and sometimes better than, that integrated with a direct Transliteration system.
-
hindi urdu machine Transliteration using finite state transducers
International Conference on Computational Linguistics, 2008Co-Authors: M Abbas G Malik, Christian Boitet, Pushpak BhattacharyyaAbstract:Finite-state Transducers (FST) can be very efficient to implement inter-dialectal Transliteration. We illustrate this on the Hindi and Urdu language pair. FSTs can also be used for translation between surface-close languages. We introduce UIT (universal intermediate transcription) for the same pair on the basis of their common phonetic repository in such a way that it can be extended to other languages like Arabic, Chinese, English, French, etc. We describe a Transliteration model based on FST and UIT, and evaluate it on Hindi and Urdu corpora.