Dravidian Languages

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 306 Experts worldwide ranked by ideXlab platform

Mccrae, John P. - One of the best experts on this subject based on the ideXlab platform.

  • Multilingual multimodal machine translation for Dravidian Languages utilizing phonetic transcription
    European Association for Machine Translation, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Priyadharshini Ruba, Stearns Bernardo, Jayapal Arun, Sridevy S., Arcan Mihael, Zarrouk Manel, Mccrae, John P.
    Abstract:

    Multimodal machine translation is the task of translating from a source text into the target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced Languages. In addition to that, these datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian Languages. It comprises of 30,000 sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an Multilingual Multimodal Neural Machine Translation system (MMNMT) for closely related Dravidian Languages to take advantage of multilingual corpus and other modalities. We evaluate our translations generated by the proposed approach with human-annotated evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced Languages.This work is supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289 and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-` LLOD

  • Multilingual multimodal machine translation for Dravidian Languages utilizing phonetic transcription
    European Association for Machine Translation, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Priyadharshini Ruba, Stearns Bernardo, Jayapal Arun, Sridevy S., Arcan Mihael, Zarrouk Manel, Mccrae, John P.
    Abstract:

    Multimodal machine translation is the task of translating from a source text into the target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced Languages. In addition to that, these datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian Languages. It comprises of 30,000 sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an Multilingual Multimodal Neural Machine Translation system (MMNMT) for closely related Dravidian Languages to take advantage of multilingual corpus and other modalities. We evaluate our translations generated by the proposed approach with human-annotated evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced Languages.This work is supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289 and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-` LLOD.non-peer-reviewe

  • Improving wordnets for under-resourced Languages using machine translation
    Global Wordnet Association, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Arcan Mihael, Mccrae, John P.
    Abstract:

    Wordnets are extensively used in natural language processing, but the current approaches for manually building a wordnet from scratch involves large research groups for a long period of time, which are typically not available for under-resourced Languages. Even if wordnet-like resources are available for under-resourced Languages, they are often not easily accessible, which can alter the results of applications using these resources. Our proposed method presents an expand approach for improving and generating wordnets with the help of machine translation. We apply our methods to improve and extend wordnets for the Dravidian Languages, i.e., Tamil, Telugu, Kannada, which are severly under-resourced Languages. We report evaluation results of the generated wordnet senses in term of precision for these Languages. In addition to that, we carried out a manual evaluation of the translations for the Tamil language, where we demonstrate that our approach can aid in improving wordnet resources for under-resourced Dravidian Languages

  • Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages
    OASIcs - OpenAccess Series in Informatics. 2nd Conference on Language Data and Knowledge (LDK 2019), 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Arcan Mihael, Mccrae, John P.
    Abstract:

    Under-resourced Languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related Languages can improve machine translation quality of these Languages. While Languages within the same language family share many properties, many under-resourced Languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription

  • Improving wordnets for under-resourced Languages using machine translation
    Global Wordnet Association, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Arcan Mihael, Mccrae, John P.
    Abstract:

    Wordnets are extensively used in natural language processing, but the current approaches for manually building a wordnet from scratch involves large research groups for a long period of time, which are typically not available for under-resourced Languages. Even if wordnet-like resources are available for under-resourced Languages, they are often not easily accessible, which can alter the results of applications using these resources. Our proposed method presents an expand approach for improving and generating wordnets with the help of machine translation. We apply our methods to improve and extend wordnets for the Dravidian Languages, i.e., Tamil, Telugu, Kannada, which are severly under-resourced Languages. We report evaluation results of the generated wordnet senses in term of precision for these Languages. In addition to that, we carried out a manual evaluation of the translations for the Tamil language, where we demonstrate that our approach can aid in improving wordnet resources for under-resourced Dravidian Languages.peer-reviewe

Chakravarthi, Bharathi Raja - One of the best experts on this subject based on the ideXlab platform.

  • Multilingual multimodal machine translation for Dravidian Languages utilizing phonetic transcription
    European Association for Machine Translation, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Priyadharshini Ruba, Stearns Bernardo, Jayapal Arun, Sridevy S., Arcan Mihael, Zarrouk Manel, Mccrae, John P.
    Abstract:

    Multimodal machine translation is the task of translating from a source text into the target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced Languages. In addition to that, these datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian Languages. It comprises of 30,000 sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an Multilingual Multimodal Neural Machine Translation system (MMNMT) for closely related Dravidian Languages to take advantage of multilingual corpus and other modalities. We evaluate our translations generated by the proposed approach with human-annotated evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced Languages.This work is supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289 and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-` LLOD

  • Multilingual multimodal machine translation for Dravidian Languages utilizing phonetic transcription
    European Association for Machine Translation, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Priyadharshini Ruba, Stearns Bernardo, Jayapal Arun, Sridevy S., Arcan Mihael, Zarrouk Manel, Mccrae, John P.
    Abstract:

    Multimodal machine translation is the task of translating from a source text into the target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced Languages. In addition to that, these datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian Languages. It comprises of 30,000 sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an Multilingual Multimodal Neural Machine Translation system (MMNMT) for closely related Dravidian Languages to take advantage of multilingual corpus and other modalities. We evaluate our translations generated by the proposed approach with human-annotated evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced Languages.This work is supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289 and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-` LLOD.non-peer-reviewe

  • Improving wordnets for under-resourced Languages using machine translation
    Global Wordnet Association, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Arcan Mihael, Mccrae, John P.
    Abstract:

    Wordnets are extensively used in natural language processing, but the current approaches for manually building a wordnet from scratch involves large research groups for a long period of time, which are typically not available for under-resourced Languages. Even if wordnet-like resources are available for under-resourced Languages, they are often not easily accessible, which can alter the results of applications using these resources. Our proposed method presents an expand approach for improving and generating wordnets with the help of machine translation. We apply our methods to improve and extend wordnets for the Dravidian Languages, i.e., Tamil, Telugu, Kannada, which are severly under-resourced Languages. We report evaluation results of the generated wordnet senses in term of precision for these Languages. In addition to that, we carried out a manual evaluation of the translations for the Tamil language, where we demonstrate that our approach can aid in improving wordnet resources for under-resourced Dravidian Languages

  • Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages
    OASIcs - OpenAccess Series in Informatics. 2nd Conference on Language Data and Knowledge (LDK 2019), 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Arcan Mihael, Mccrae, John P.
    Abstract:

    Under-resourced Languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related Languages can improve machine translation quality of these Languages. While Languages within the same language family share many properties, many under-resourced Languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription

  • Improving wordnets for under-resourced Languages using machine translation
    Global Wordnet Association, 2019
    Co-Authors: Chakravarthi, Bharathi Raja, Arcan Mihael, Mccrae, John P.
    Abstract:

    Wordnets are extensively used in natural language processing, but the current approaches for manually building a wordnet from scratch involves large research groups for a long period of time, which are typically not available for under-resourced Languages. Even if wordnet-like resources are available for under-resourced Languages, they are often not easily accessible, which can alter the results of applications using these resources. Our proposed method presents an expand approach for improving and generating wordnets with the help of machine translation. We apply our methods to improve and extend wordnets for the Dravidian Languages, i.e., Tamil, Telugu, Kannada, which are severly under-resourced Languages. We report evaluation results of the generated wordnet senses in term of precision for these Languages. In addition to that, we carried out a manual evaluation of the translations for the Tamil language, where we demonstrate that our approach can aid in improving wordnet resources for under-resourced Dravidian Languages.peer-reviewe

John P Mccrae - One of the best experts on this subject based on the ideXlab platform.

  • overview of the track on sentiment analysis for Dravidian Languages in code mixed text
    Forum for Information Retrieval Evaluation, 2020
    Co-Authors: Bharathi Raja Chakravarthi, Ruba Priyadharshini, Vigneshwaran Muralidaran, Shardul Suryawanshi, Navya Jose, Elizabeth Sherly, John P Mccrae
    Abstract:

    Sentiment analysis of Dravidian Languages has received attention in recent years. However, most social media text is code-mixed and there is no research available on sentiment analysis of code-mixed Dravidian Languages. The Dravidian-CodeMix-FIRE 2020, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem. There were two Languages for this track: (i) Tamil, and (ii) Malayalam. The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language. The performance of the systems was evaluated by weighted-F1 score.

  • bilingual lexicon induction across orthographically distinct under resourced Dravidian Languages
    International Conference on Computational Linguistics, 2020
    Co-Authors: Bharathi Raja Chakravarthi, Navaneethan Rajasekaran, Mihael Arcan, Kevin Mcguinness, Noel E Oconnor, John P Mccrae
    Abstract:

    Bilingual lexicons are a vital tool for under-resourced Languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi-supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced Languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these Languages. In this work, we focus on the Dravidian Languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these Languages, we bring the related Languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these Languages many times, making bilingual lexicon induction approaches feasible for such under-resourced Languages.

  • VarDial@COLING - Bilingual Lexicon Induction across Orthographically-distinct Under-Resourced Dravidian Languages
    2020
    Co-Authors: Bharathi Raja Chakravarthi, Navaneethan Rajasekaran, Mihael Arcan, Kevin Mcguinness, Noel E. O'connor, John P Mccrae
    Abstract:

    Bilingual lexicons are a vital tool for under-resourced Languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi-supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced Languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these Languages. In this work, we focus on the Dravidian Languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these Languages, we bring the related Languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these Languages many times, making bilingual lexicon induction approaches feasible for such under-resourced Languages.

  • Multilingual multimodal machine translation for Dravidian Languages utilizing phonetic transcription
    2019
    Co-Authors: Bharathi Raja Chakravarthi, Ruba Priyadharshini, Mihael Arcan, Bernardo Stearns, Arun Jayapal, S. Sridevy, Manel Zarrouk, John P Mccrae
    Abstract:

    This work is supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289 and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-` LLOD.

S. Jothilakshmi - One of the best experts on this subject based on the ideXlab platform.

  • Speech translation system for english to Dravidian Languages
    Applied Intelligence, 2017
    Co-Authors: Jeyabalan Sangeetha, S. Jothilakshmi
    Abstract:

    In this paper the Speech-to-Speech Translation (SST) system, which is mainly focused on translation from English to Dravidian Languages (Tamil and Malayalam) has been proposed. Three major techniques involved in SST system are Automatic continuous speech recognition, machine translation, and text-to-speech synthesis system. In this paper automatic Continuous Speech Recognition (CSR) has been developed based on the Auto Associative Neural Network (AANN), Support Vector Machine (SVM) and Hidden Markov Model (HMM). The HMM yields better results compared with SVM and AANN. Hence the HMM based Speech recognizer for English language has been taken. We propose a hybrid Machine Translation (MT) system (combination of Rule based and Statistical) for converting English to Dravidian Languages text. A syllable based concatenative Text To Speech Synthesis (TTS) for Tamil and Malayalam has been proposed. AANN based prosody prediction has been done for the Tamil language which is used to improve the naturalness and intelligibility. The domain is restricted to sentences that cover the announcements in the railway station, bus stop and airport. This work is framed a novel translation method for English to Dravidian Languages. The improved performance of each module HMM based CSR, Hybrid MT and concatenative TTS increases the overall speech translation performance. This proposed speech translation system can be applied to English to any Indian Languages if we train and create a parallel corpus for those Languages.

  • Automatic continuous speech recogniser for Dravidian Languages using the auto associative neural network
    International Journal of Computational Vision and Robotics, 2016
    Co-Authors: J. Sangeetha, S. Jothilakshmi
    Abstract:

    In recent times with the extensive improvement of computers, numerous methods of data interchange between man and computer are revealed. It aims to provide an efficient way for human to communicate with computers exclusively for people with disabilities who face diversity of obstacles while using computers. This paper predominantly focuses on developing an efficient speech recognition system for Dravidian Languages such as Tamil, Malayalam, Telugu and Kannada. The proposed CSR system comprises of four steps namely pre-processing, feature extraction, automatic continuous speech segmentation and classification. The most powerful and widely used short term energy and zero crossing rate is used for continuous speech segmentation and Mel frequency cepstral coefficients MFCC, linear predictive cepstral coefficients LPCC and shifted delta cepstrum SDC feature extractions are used for recognition system. Experiments are carried out with real time Dravidian Languages speech signal. It is observed from the results that the proposed system gives significant results in AANN classifier with MFCC feature when compared with LPCC and SDC features.

  • An Efficient Continuous Speech Recognition System for Dravidian Languages Using Support Vector Machine
    Advances in Intelligent Systems and Computing, 2014
    Co-Authors: J. Sangeetha, S. Jothilakshmi
    Abstract:

    This paper mainly focuses on developing a novel speech recognition system for Dravidian Languages such as Tamil, Malayalam, Telugu, and Kannada. This research work targets to afford a well-organized way for human to interconnect with computers absolutely for people with disabilities who facade variety of stumbling blocks while using computers. This work would be very helpful to the native speakers in various applications. The proposed CSR system comprises of three steps namely preprocessing, feature extraction, and classification. In the preprocessing step, the input signal is preprocessed through the steps such as pre-emphasis filter, framing, windowing, and band stop filtering in order to remove the background noise and to enrich the signal. The best-filtered and the enriched signal from the preprocessing step is taken as the input for the further process of CSR system. The speech features being the most essential segment in speech recognition system. The most powerful and widely used short-term energy (STE) and zero-crossing rate (ZCR) are used for continuous speech segmentation, and Mel-frequency cepstral coefficients (MFCC) and shifted delta cepstrum (SDC) are used for recognition task. Feature vectors are given as the input to the classifier such as support vector machine (SVM) for classifying and recognizing Dravidian language speech. Experiments are carried out with real-time Dravidian speech signals, and the results reveal that the proposed method competes with the existing methods reported in literature.

  • A Novel Approach for English to Dravidian Language Rule Based Machine Translation
    International Review on Computers and Software, 2013
    Co-Authors: J. Sangeetha, S. Jothilakshmi
    Abstract:

    In this paper, propose a method for translating text from English to Tamil which is one of the Dravidian Languages. Rule based machine translation technique is used here, which involves the formation of rules which helps in re-ordering of the syntactic structures of the source language sentence along with its dependency information which bring that close to the structure of the target sentence. The parser identifies the syntactical elements in English sentences and suggests its Dravidian language translation taking into account various grammatical forms of those Dravidian Languages. The usage of the parser in developing the syntactic structure plays a major role in the translation process. There are mainly two types of rules used here, one is transfer link rule and the other is morphological rules. In this method, the transfer link rules are used for generating target structure. Morphological rules are used for assigning morphological features. Context Free Grammars (CFG) is used in generation of the language structures. By using this approach, given English text can be translated to its Tamil equivalent.

Dipti Misra Sharma - One of the best experts on this subject based on the ideXlab platform.

  • significance of an accurate sandhi splitter in shallow parsing of Dravidian Languages
    Meeting of the Association for Computational Linguistics, 2016
    Co-Authors: Dipti Misra Sharma
    Abstract:

    This paper evaluates the challenges involved in shallow parsing of Dravidian Languages which are highly agglutinative and morphologically rich. Text processing tasks in these Languages are not trivial because multiple words concatenate to form a single string with morpho-phonemic changes at the point of concatenation. This phenomenon known as Sandhi, in turn complicates the individual word identification. Shallow parsing is the task of identification of correlated group of words given a raw sentence. The current work is an attempt to study the effect of Sandhi in building shallow parsers for Dravidian Languages by evaluating its effect on Malayalam, one of the main Languages from Dravidian family. We provide an in-depth analysis of effect ofSandhi in developing a robust shallow parser pipeline with experimental results emphasizing on how sensitive the individual components of shallow parser are, towards the accuracy of a sandhi splitter. Our work can serve as a guiding light for building robust text processing systems in Dravidian Languages.

  • ACL (Student Research Workshop) - Significance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages
    Proceedings of the ACL 2016 Student Research Workshop, 2016
    Co-Authors: Devadath, Dipti Misra Sharma
    Abstract:

    This paper evaluates the challenges involved in shallow parsing of Dravidian Languages which are highly agglutinative and morphologically rich. Text processing tasks in these Languages are not trivial because multiple words concatenate to form a single string with morpho-phonemic changes at the point of concatenation. This phenomenon known as Sandhi, in turn complicates the individual word identification. Shallow parsing is the task of identification of correlated group of words given a raw sentence. The current work is an attempt to study the effect of Sandhi in building shallow parsers for Dravidian Languages by evaluating its effect on Malayalam, one of the main Languages from Dravidian family. We provide an in-depth analysis of effect ofSandhi in developing a robust shallow parser pipeline with experimental results emphasizing on how sensitive the individual components of shallow parser are, towards the accuracy of a sandhi splitter. Our work can serve as a guiding light for building robust text processing systems in Dravidian Languages.