Transliteration

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 9639 Experts worldwide ranked by ideXlab platform

Keysun Choi - One of the best experts on this subject based on the ideXlab platform.

  • a machine Transliteration model based on correspondence between graphemes and phonemes
    ACM Transactions on Asian Language Information Processing, 2006
    Co-Authors: Keysun Choi, Hitoshi Isahara
    Abstract:

    Machine Transliteration is an automatic method for converting words in one language into phonetically equivalent ones in another language. There has been growing interest in the use of machine Transliteration to assist machine translation and information retrieval. Three types of machine Transliteration models---grapheme-based, phoneme-based, and hybrid---have been proposed. Surprisingly, there have been few reports of efforts to utilize the correspondence between source graphemes and source phonemes, although this correspondence plays an important role in machine Transliteration. Furthermore, little work has been reported on ways to dynamically handle source graphemes and phonemes. In this paper, we propose a Transliteration model that dynamically uses both graphemes and phonemes, particularly the correspondence between them. With this model, we have achieved better performance---improvements of about 15 to 41p in English-to-Korean Transliteration and about 16 to 44p in English-to-Japanese Transliteration---than has been reported for other models.

  • a comparison of different machine Transliteration models
    Journal of Artificial Intelligence Research, 2006
    Co-Authors: Keysun Choi, Hitoshi Isahara
    Abstract:

    Machine Transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine Transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine Transliteration models - grapheme-based Transliteration model, phoneme-based Transliteration model, hybrid Transliteration model, and correspondence-based Transliteration model - have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple Transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine Transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine Transliteration performance.

  • an english korean Transliteration model using pronunciation and contextual rules
    International Conference on Computational Linguistics, 2002
    Co-Authors: Keysun Choi
    Abstract:

    There is increasing concern about English-Korean (E-K) Transliteration recently. In the previous works, direct converting methods from English alphabets to Korean alphabets were a main research topic. In this paper, we present an E-K Transliteration model using pronunciation and contextual rules. Unlike the previous works, our method uses phonetic information such as phoneme and its context. We also use word formation information such as English words of Greek origin, With them, our method shows significant performance increase about 31% in word accuracy.

  • automatic Transliteration and back Transliteration by decision tree learning
    Language Resources and Evaluation, 2000
    Co-Authors: Byungju Kang, Keysun Choi
    Abstract:

    Automatic Transliteration and back-Transliteration across languages with drastically different alphabets and phonemes inventories such as English/Korean, English/Japanese, English/Arabic, English/Chinese, etc, have practical importance in machine translation, crosslingual information retrieval, and automatic bilingual dictionary compilation, etc. In this paper, a bi-directional and to some extent language independent methodology for English/Korean Transliteration and back-Transliteration is described. Our method is composed of character alignment and decision tree learning. We induce Transliteration rules for each English alphabet and back-Transliteration rules for each Korean alphabet. For the training of decision trees we need a large labeled examples of Transliteration and backTransliteration. However this kind of resources are generally not available. Our character alignment algorithm is capable of highly accurately aligning English word and Korean Transliteration in a desired way.

A Kumaran - One of the best experts on this subject based on the ideXlab platform.

  • Improving Cross-Language Information Retrieval by Transliteration Mining and Generation
    2020
    Co-Authors: K Saravanan, Raghavendra Udupa, A Kumaran
    Abstract:

    Abstract. The retrieval performance of Cross-Language Retrieval (CLIR) systems is a function of the coverage of the translation lexicon used by them. Unfortunately, most translation lexicons do not provide a good coverage of proper nouns and common nouns which are often the most information-bearing terms in a query. As a consequence, many queries cannot be translated without a substantial loss of information and the retrieval performance of the CLIR system is less than satisfactory for those queries. However, proper nouns and common nouns very often appear in their transliterated forms in the target language document collection. In this work, we study two techniques that leverage this fact for addressing the problem, namely, Transliteration Mining and Transliteration Generation. The first technique attempts to mine the Transliterations of out-ofvocabulary query terms from the document collection whereas the second generates the Transliterations. We systematically study the effectiveness of both techniques in the context of the Hindi-English and Tamil-English ad hoc retrieval tasks at FIRE2010. The results of our study show that both techniques are effective in addressing the problem posed by out-of-vocabulary terms with Transliteration Mining technique giving better results than Transliteration Generation

  • report of news 2012 machine Transliteration shared task
    Meeting of the Association for Computational Linguistics, 2012
    Co-Authors: Min Zhang, A Kumaran, Ming Liu
    Abstract:

    This report documents the Machine Transliteration Shared Task conducted as a part of the Named Entities Workshop (NEWS 2012), an ACL 2012 workshop. The shared task features machine Transliteration of proper names from English to 11 languages and from 3 languages to English. In total, 14 tasks are provided. 7 teams participated in the evaluations. Finally, 57 standard and 1 non-standard runs are submitted, where diverse Transliteration methodologies are explored and reported on the evaluation data. We report the results with 4 performance metrics. We believe that the shared task has successfully achieved its objective by providing a common benchmarking platform for the research community to evaluate the state-of-the-art technologies that benefit the future research and development.

  • compositional machine Transliteration
    ACM Transactions on Asian Language Information Processing, 2010
    Co-Authors: A Kumaran, Mitesh M Khapra, Pushpak Bhattacharyya
    Abstract:

    Machine Transliteration is an important problem in an increasingly multilingual world, as it plays a critical role in many downstream applications, such as machine translation or crosslingual information retrieval systems. In this article, we propose compositional machine Transliteration systems, where multiple Transliteration components may be composed either to improve existing Transliteration quality, or to enable Transliteration functionality between languages even when no direct parallel names corpora exist between them. Specifically, we propose two distinct forms of composition: serial and parallel. Serial compositional system chains individual Transliteration components, say, X → Y and Y → Z systems, to provide Transliteration functionality, X → Z. In parallel composition evidence from multiple Transliteration paths between X → Z are aggregated for improving the quality of a direct system. We demonstrate the functionality and performance benefits of the compositional methodology using a state-of-the-art machine Transliteration framework in English and a set of Indian languages, namely, Hindi, Marathi, and Kannada. Finally, we underscore the utility and practicality of our compositional approach by showing that a CLIR system integrated with compositional Transliteration systems performs consistently on par with, and sometimes better than, that integrated with a direct Transliteration system.

  • report of news 2010 Transliteration mining shared task
    Meeting of the Association for Computational Linguistics, 2010
    Co-Authors: A Kumaran, Mitesh M Khapra
    Abstract:

    This report documents the details of the Transliteration Mining Shared Task that was run as a part of the Named Entities Workshop (NEWS 2010), an ACL 2010 workshop. The shared task featured mining of name Transliterations from the paired Wikipedia titles in 5 different language pairs, specifically, between English and one of Arabic, Chinese, Hindi Russian and Tamil. Totally 5 groups took part in this shared task, participating in multiple mining tasks in different languages pairs. The methodology and the data sets used in this shared task are published in the Shared Task White Paper [Kumaran et al, 2010]. We measure and report 3 metrics on the submitted results to calibrate the performance of individual systems on a commonly available Wikipedia dataset. We believe that the significant contribution of this shared task is in (i) assembling a diverse set of participants working in the area of Transliteration mining, (ii) creating a baseline performance of Transliteration mining systems in a set of diverse languages using commonly available Wikipedia data, and (iii) providing a basis for meaningful comparison and analysis of trade-offs between various algorithmic approaches used in mining. We believe that this shared task would complement the NEWS 2010 Transliteration generation shared task, in enabling development of practical systems with a small amount of seed data in a given pair of languages.

  • whitepaper of news 2010 shared task on Transliteration mining
    Meeting of the Association for Computational Linguistics, 2010
    Co-Authors: A Kumaran, Mitesh M Khapra
    Abstract:

    Transliteration is generally defined as phonetic translation of names across languages. Machine Transliteration is a critical technology in many domains, such as machine translation, cross-language information retrieval/extraction, etc. Recent research has shown that high quality machine Transliteration systems may be developed in a language-neutral manner, using a reasonably sized good quality corpus (~15--25K parallel names) between a given pair of languages. In this shared task, we focus on acquisition of such good quality names corpora in many languages, thus complementing the machine Transliteration shared task that is concurrently conducted in the same NEWS 2010 workshop. Specifically, this task focuses on mining the Wikipedia paired entities data (aka, inter-wiki-links) to produce high-quality Transliteration data that may be used for Transliteration tasks.

Philippe Grange - One of the best experts on this subject based on the ideXlab platform.

  • training schemes for the Transliteration of the balinese script into the latin script on palm leaf manuscript images
    International Conference on Frontiers in Handwriting Recognition, 2018
    Co-Authors: Made Windu Antara Kesiman, Jean-christophe Burie, Jean Marc Ogier, Philippe Grange
    Abstract:

    Considering the importance of the contents of the Balinese palm leaf manuscripts, Transliteration system has to be developed in order to be able to read easily these manuscripts. The challenge comes from the fact that Balinese script is a syllabic script and the mapping between linguistic symbols and images of symbols is not straightforward. In addition, with a very limited training data availability, some adaptations of LSTM in the Transliteration training scheme need to be designed, to be analyzed and to be evaluated. This paper contributes in proposing and evaluating some adapted segmentation free training schemes for the Transliteration of the Balinese script into the Latin script from palm leaf manuscript images. We describe the generated synthetic dataset and the proposed training schemes at two different levels (word level and text line level) to transliterate the real word and text lines from palm leaf manuscript images. For word Transliteration, in general, training schemes at word level perform better than training schemes at text line level. As comparison, the segmentation based Transliteration method gives a very promising result. For text line Transliteration, segmentation based Transliteration method outperforms all segmentation free training schemes for the less degraded collections, while the segmentation free training schemes contributes in transliterating the text lines for more degraded manuscripts. Training at text line level with a pre-trained model at word level could give a better result in word Transliteration while still keeping the optimal performances for text line Transliteration.

  • Knowledge Representation and Phonological Rules for the Automatic Transliteration of Balinese Script on Palm Leaf Manuscript
    Computación Y Sistemas, 2018
    Co-Authors: Made Windu Antara Kesiman, Jean-christophe Burie, Jean Marc Ogier, Philippe Grange
    Abstract:

    Balinese ancient palm leaf manuscripts record many important knowledges about world civilization histories. They vary from ordinary texts to Bali’s most sacred writings. In reality, the majority of Balinese can not read it because of language obstacles as well as tradition which perceived them as a sacrilege. Palm leaf manuscripts attract the historians, philologists, and archaeologists to discover more about the ancient ways of life. But unfortunately, there is only a limited access to the content of these manuscripts, because of the linguistic difficulties. The Balinese palm leaf manuscripts were written in Balinese script in Balinese language, in the ancient literary texts composed in the old Javanese language of Kawi and Sanskrit. Balinese script is considered to be one of the most complex scripts from Southeast Asia. A Transliteration engine for transliterating the Balinese script of palm leaf manuscript to the Latin-based script is one of the most demanding systems which has to be developed for the collection of palm leaf manuscript images. In this paper, we present an implementation of knowledge representation and phonological rules for the automatic Transliteration of Balinese script on palm leaf manuscript. In this system, a rule-based engine for performing Transliterations is proposed. Our model is based on phonetics which is based on traditional linguistic study of Balinese Transliteration. This automatic Transliteration system is needed to complete the optical character recognition (OCR) process on the palm leaf manuscript images, to make the manuscripts more accessible and readable to a wider audience.

Helmut Schmid - One of the best experts on this subject based on the ideXlab platform.

  • statistical models for unsupervised semi supervised and supervised Transliteration mining
    Computational Linguistics, 2017
    Co-Authors: Hassan Sajjad, Alexander Fraser, Helmut Schmid, Hinrich Schutze
    Abstract:

    We present a generative model that efficiently mines Transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised Transliteration mining. The model interpolates two sub-models, one for the generation of Transliteration pairs and one for the generation of non-Transliteration pairs i.e., noise. The model is trained on noisy unlabeled data using the EM algorithm. During training the Transliteration sub-model learns to generate Transliteration pairs and the fixed non-Transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our Transliteration mining system on data from a Transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% Transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

  • qcri mes submission at wmt13 using Transliteration mining to improve statistical machine translation
    Workshop on Statistical Machine Translation, 2013
    Co-Authors: Hassan Sajjad, Svetlana Smekalova, Nadir Durrani, Alexander Fraser, Helmut Schmid
    Abstract:

    This paper describes QCRI-MES’s submission on the English-Russian dataset to the Eighth Workshop on Statistical Machine Translation. We generate improved word alignment of the training data by incorporating an unsupervised Transliteration mining module to GIZA++ and build a phrase-based machine translation system. For tuning, we use a variation of PRO which provides better weights by optimizing BLEU+1 at corpus-level. We transliterate out-of-vocabulary words in a postprocessing step by using a Transliteration system built on the Transliteration pairs extracted using an unsupervised Transliteration mining system. For the Russian to English translation direction, we apply linguistically motivated pre-processing on the Russian side of the data.

  • a statistical model for unsupervised and semi supervised Transliteration mining
    Meeting of the Association for Computational Linguistics, 2012
    Co-Authors: Hassan Sajjad, Alexander Fraser, Helmut Schmid
    Abstract:

    We propose a novel model to automatically extract Transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines Transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model Transliteration mining as an interpolation of Transliteration and non-Transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

  • an algorithm for unsupervised Transliteration mining with an application to word alignment
    Meeting of the Association for Computational Linguistics, 2011
    Co-Authors: Hassan Sajjad, Alexander Fraser, Helmut Schmid
    Abstract:

    We propose a language-independent method for the automatic extraction of Transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on Transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally, we integrate the Transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments.

  • hindi to urdu machine translation through Transliteration
    Meeting of the Association for Computational Linguistics, 2010
    Co-Authors: Nadir Durrani, Hassan Sajjad, Alexander Fraser, Helmut Schmid
    Abstract:

    We present a novel approach to integrate Transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both Transliteration and translation when translating a particular Hindi word given the context whereas in previous work Transliteration is only used for translating OOV (out-of-vocabulary) words. We use Transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional probability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that Transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.

Pushpak Bhattacharyya - One of the best experts on this subject based on the ideXlab platform.

  • leveraging orthographic similarity for multilingual neural Transliteration
    Transactions of the Association for Computational Linguistics, 2018
    Co-Authors: Anoop Kunchukuttan, Mitesh M Khapra, Gurneet Singh, Pushpak Bhattacharyya
    Abstract:

    We address the task of joint training of Transliteration models for multiple language pairs (multilingual Transliteration). This is an instance of multitask learning, where individual tasks (langua...

  • brahmi net a Transliteration and script conversion system for languages of the indian subcontinent
    North American Chapter of the Association for Computational Linguistics, 2015
    Co-Authors: Anoop Kunchukuttan, Ratish Puduppully, Pushpak Bhattacharyya
    Abstract:

    We present Brahmi-Net - an online system for Transliteration and script conversion for all major Indian language pairs (306 pairs). The system covers 13 Indo-Aryan languages, 4 Dravidian languages and English. For training the Transliteration systems, we mined parallel Transliteration corpora from parallel translation corpora using an unsupervised method and trained statistical Transliteration systems using the mined corpora. Languages which do not have parallel corpora are supported by Transliteration through a bridge language. Our script conversion system supports conversion between all Brahmi-derived scripts as well as ITRANS romanization scheme. For this, we leverage co-ordinated Unicode ranges between Indic scripts and use an extended ITRANS encoding for transliterating between English and Indic scripts. The system also provides top-k Transliterations and simultaneous Transliteration into multiple output languages. We provide a Python as well as REST API to access these services. The API and the mined Transliteration corpus are made available for research use under an open source license.

  • compositional machine Transliteration
    ACM Transactions on Asian Language Information Processing, 2010
    Co-Authors: A Kumaran, Mitesh M Khapra, Pushpak Bhattacharyya
    Abstract:

    Machine Transliteration is an important problem in an increasingly multilingual world, as it plays a critical role in many downstream applications, such as machine translation or crosslingual information retrieval systems. In this article, we propose compositional machine Transliteration systems, where multiple Transliteration components may be composed either to improve existing Transliteration quality, or to enable Transliteration functionality between languages even when no direct parallel names corpora exist between them. Specifically, we propose two distinct forms of composition: serial and parallel. Serial compositional system chains individual Transliteration components, say, X → Y and Y → Z systems, to provide Transliteration functionality, X → Z. In parallel composition evidence from multiple Transliteration paths between X → Z are aggregated for improving the quality of a direct system. We demonstrate the functionality and performance benefits of the compositional methodology using a state-of-the-art machine Transliteration framework in English and a set of Indian languages, namely, Hindi, Marathi, and Kannada. Finally, we underscore the utility and practicality of our compositional approach by showing that a CLIR system integrated with compositional Transliteration systems performs consistently on par with, and sometimes better than, that integrated with a direct Transliteration system.

  • hindi urdu machine Transliteration using finite state transducers
    International Conference on Computational Linguistics, 2008
    Co-Authors: M Abbas G Malik, Christian Boitet, Pushpak Bhattacharyya
    Abstract:

    Finite-state Transducers (FST) can be very efficient to implement inter-dialectal Transliteration. We illustrate this on the Hindi and Urdu language pair. FSTs can also be used for translation between surface-close languages. We introduce UIT (universal intermediate transcription) for the same pair on the basis of their common phonetic repository in such a way that it can be extended to other languages like Arabic, Chinese, English, French, etc. We describe a Transliteration model based on FST and UIT, and evaluate it on Hindi and Urdu corpora.