Urdu

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 11361 Experts worldwide ranked by ideXlab platform

Sarmad Hussain - One of the best experts on this subject based on the ideXlab platform.

  • IJCNLP - Resources for Urdu Language Processing
    2020
    Co-Authors: Sarmad Hussain
    Abstract:

    Urdu is spoken by more than 100 million speakers. This paper summarizes the corpus and lexical resources being developed for Urdu by the CRULP, in Pakistan.

  • segmentation free nastalique Urdu ocr
    World Academy of Science Engineering and Technology International Journal of Computer Electrical Automation Control and Information Engineering, 2010
    Co-Authors: Sobia T. Javed, Ameera Maqbool, Samia Asloob, Sarmad Hussain, Sehrish Jamil, Huma Moin
    Abstract:

    Electronically available Urdu data is in image form which is very difficult to process. Printed Urdu data is the root cause of problem. So for the rapid progress of Urdu language we need an OCR system, which can help us to make Urdu data available for the common person. Research has been carried out for years to automate Arabic and Urdu script. But the biggest hurdle in the development of Urdu OCR is the challenge to recognize Nastalique Script which is taken as standard for writing Urdu language. Nastalique script is written diagonally with no fixed baseline which makes the script somewhat complex. Overlap is present not only in characters but in the ligatures as well. This paper proposes a method which allows successful recognition of Nastalique Script.

  • Segmentation free nastalique Urdu OCR
    World Academy of Science Engineering and Technology, 2010
    Co-Authors: Sobia T. Javed, Ameera Maqbool, Samia Asloob, Sarmad Hussain, Sehrish Jamil, Huma Moin
    Abstract:

    The electronically available Urdu data is in image form which is very difficult to process. Printed Urdu data is the root cause of problem. So for the rapid progress of Urdu language we need an OCR systems, which can help us to make Urdu data available for the common person. Research has been carried out for years to automata Arabic and Urdu script. But the biggest hurdle in the development of Urdu OCR is the challenge to recognize Nastalique Script which is taken as standard for writing Urdu language. Nastalique script is written diagonally with no fixed baseline which makes the script somewhat complex. Overlap is present not only in characters but in the ligatures as well. This paper proposes a method which allows successful recognition of Nastalique Script.

  • resources for Urdu language processing
    International Joint Conference on Natural Language Processing, 2008
    Co-Authors: Sarmad Hussain
    Abstract:

    Urdu is spoken by more than 100 million speakers. This paper summarizes the corpus and lexical resources being developed for Urdu by the CRULP, in Pakistan.

  • letter to sound conversion for Urdu text to speech system
    Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, 2004
    Co-Authors: Sarmad Hussain
    Abstract:

    Urdu is spoken by more than 100 million people across a score countries and is the national language of Pakistan (http://www.ethnologue.com). There is a great need for developing a text-to-speech system for Urdu because this population has low literacy rate and therefore speech interface would greatly assist in providing them access to information. One of the significant parts of a text-to-speech system is a natural language processor which takes textual input and converts it into an annotated phonetic string. To enable this, it is necessary to develop models which map textual input onto phonetic content. These models may be very complex for various languages having unpredictable behaviour (e.g. English), but Urdu shows a relatively regular behaviour and thus Urdu pronunciation may be modelled from Urdu text by defining fairly regular rules. These rules have been identified and explained in this paper.

Faisal Shafait - One of the best experts on this subject based on the ideXlab platform.

  • A Multi-faceted OCR Framework for Artificial Urdu News Ticker Text Recognition
    2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018
    Co-Authors: Burhan Ul Tayyab, Muhammad Ferjad Naeem, Adnan Ul-hasan, Faisal Shafait
    Abstract:

    Content based information search and retrieval has allowed for easier access to data. While Latin based scripts have gained attention and support from academia and industry, there is limited support for cursive script languages, like Urdu. In this paper, we present the first instance of Urdu news ticker detection and recognition and take a micron sized step towards the goal of super intelligence. The presented solution allows for automating the transcription, indexing and captioning of Urdu news video content. We present the first comprehensive data set, to our knowledge, for Urdu news ticker recognition, collected from 41 different news channels. The data set covers both high and low quality channels, distorted and blurred news tickers, making the data set an ideal test case for any automatic Urdu News Recognition system in future. We identify and address the key challenges in Urdu News Ticker text recognition. We further propose an adjustment to the ground-truth labeling strategy focused on improving the readability of recognized output. Finally, we propose and present results from a Bi-Directional Long Short-Term Memory (BDLSTM) network architecture for news ticker text recognition. Our custom trained model outperforms Google's commercial OCR engine in two of the four experiments conducted.

  • Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks
    SpringerPlus, 2016
    Co-Authors: Saeeda Naz, Sheikh Faisal Rashid, Arif Iqbal Umar, Muhammad Imran Razzak, Riaz Ahmed, Faisal Shafait
    Abstract:

    The recognition of Arabic script and its derivatives such as Urdu, Persian, Pashto etc. is a difficult task due to complexity of this script. Particularly, Urdu text recognition is more difficult due to its Nasta’liq writing style. Nasta’liq writing style inherits complex calligraphic nature, which presents major issues to recognition of Urdu text owing to diagonality in writing, high cursiveness, context sensitivity and overlapping of characters. Therefore, the work done for recognition of Arabic script cannot be directly applied to Urdu recognition. We present Multi-dimensional Long Short Term Memory (MDLSTM) Recurrent Neural Networks with an output layer designed for sequence labeling for recognition of printed Urdu text-lines written in the Nasta’liq writing style. Experiments show that MDLSTM attained a recognition accuracy of 98% for the unconstrained Urdu Nasta’liq printed text, which significantly outperforms the state-of-the-art techniques.

  • Urdu nasta liq text recognition using implicit segmentation based on multi dimensional long short term memory neural networks
    SpringerPlus, 2016
    Co-Authors: Arif Iqbal Umar, Sheikh Faisal Rashid, Muhammad Imran Razzak, Riaz Ahmed, Faisal Shafait
    Abstract:

    The recognition of Arabic script and its derivatives such as Urdu, Persian, Pashto etc. is a difficult task due to complexity of this script. Particularly, Urdu text recognition is more difficult due to its Nasta’liq writing style. Nasta’liq writing style inherits complex calligraphic nature, which presents major issues to recognition of Urdu text owing to diagonality in writing, high cursiveness, context sensitivity and overlapping of characters. Therefore, the work done for recognition of Arabic script cannot be directly applied to Urdu recognition. We present Multi-dimensional Long Short Term Memory (MDLSTM) Recurrent Neural Networks with an output layer designed for sequence labeling for recognition of printed Urdu text-lines written in the Nasta’liq writing style. Experiments show that MDLSTM attained a recognition accuracy of 98% for the unconstrained Urdu Nasta’liq printed text, which significantly outperforms the state-of-the-art techniques.

  • a segmentation free approach to arabic and Urdu ocr
    Document Recognition and Retrieval, 2013
    Co-Authors: Nazly Sabbour, Faisal Shafait
    Abstract:

    In this paper, we present a generic Optical Character Recognition system for Arabic script languages called Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and context sensitive. Moreover, Arabic script has different writing styles that vary in complexity. Nabocr is initially trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts. However, it can be trained by users to be used for other Arabic script languages. We have evaluated our system's performance for both Urdu and Arabic. In order to evaluate Urdu recognition, we have generated a dataset of Urdu text called UPTI (Urdu Printed Text Image Database), which measures different aspects of a recognition system. The performance of our system for Urdu clean text is 91%. For Arabic clean text, the performance is 86%. Moreover, we have compared the performance of our system against Tesseract's newly released Arabic recognition, and the performance of both systems on clean images is almost the same.

  • layout analysis of Urdu document images
    IEEE International Multitopic Conference, 2006
    Co-Authors: Faisal Shafait, Daniel Keysers, Thomas M Breuel
    Abstract:

    Layout analysis is a key component of an OCR system. In this paper, we present a layout analysis system for extracting text-lines in reading order from Urdu document images. For this purpose, we evaluate an existing system for Roman script text on Urdu documents and describe its methods and the main changes necessary to adapt it to Urdu script. The main changes are: 1) the text-line model for Roman script is modified to adapt to Urdu text, 2) reading order of an Urdu document is defined. The method is applied to a collection of scanned Urdu documents from various books, magazines, and newspapers. The results show high text-line detection accuracy on scanned images of Urdu prose and poetry books and magazines. The algorithm also works reasonably well on newspaper images. We also identify directions for future work which may further improve the accuracy of the system.

Ali Abidi - One of the best experts on this subject based on the ideXlab platform.

  • An Unconstrained Benchmark Urdu Handwritten Sentence Database with Automatic Line Segmentation
    2012 International Conference on Frontiers in Handwriting Recognition, 2012
    Co-Authors: Ahsen Raza, Imran Siddiqi, Ali Abidi, Fahim Arif
    Abstract:

    In this paper we present and announce a novel off-line sentence database of Urdu handwritten documents along with a few preprocessing and text line segmentation procedures. Despite an increased research interest in Urdu handwritten document analysis over the recent years, a standard benchmark dataset, which could be used in Urdu handwriting recognition tasks, has been missing. Based on our own developed and updated corpus named CENIP-UCCP (Center for Image Processing-Urdu Corpus Construction Project), we have developed an Urdu handwritten database. The corpus is a collection of a variety of Urdu texts that were used to generate forms. These forms were subsequently filled by native writers in their natural handwritings. Six categories of text were used to generate these forms with each category using approximately 66 forms. Up till now, the database comprises 400 digitized forms produced by 200 different writers. The database is completely labeled for content information as well as content detection and supports the evaluation of systems like Urdu handwriting recognition, line segmentation and writer identification. The database was also experimented with the proposed Urdu text line segmentation scheme rendering promising segmentation results.

  • Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach
    2011 International Conference on Document Analysis and Recognition, 2011
    Co-Authors: Ali Abidi, Imran Siddiqi, Khurram Khurshid
    Abstract:

    Libraries in South Asia hold huge collections of valuable printed documents in Urdu and it is of interest to digitize these collections to make them more accessible. The unavailability of an OCR for Urdu however limits the concept of a digital Urdu library to scanning of documents only, offering very limited search facility based on manually assigned tags. We address this issue by proposing a word spotting based keyword search method for information retrieval in digitized collections of printed Urdu documents. The proposed method is based on segmentation of Urdu text in to partial words and representing each partial word by a set of features. To search a specific word (or phrase), the user provides a query in the form of an image. Comparing the features of the partial words in the query image with the ones already indexed, the user is provided with a list of documents containing occurrences of the queried word. The system evaluated on 50 Urdu documents exhibited a recall of 95.17% and a precision of 94.3%.

Gurpreet Singh Lehal - One of the best experts on this subject based on the ideXlab platform.

  • A Hindi to Urdu Transliteration System
    2020
    Co-Authors: Gurpreet Singh Lehal, Tejinder Singh Saini
    Abstract:

    In this paper, we present a high accuracy Hindi to Urdu transliteration system. Hindi and Urdu are variants of the same language, but while Hindi is written in the Devanagari script from left to right, Urdu is written in a script derived from a Persian modification of Arabic script written from right to left. To break this script barrier a Hindi-Urdu transliteration system has been developed. We have tried to overcome the shortcomings of the existing Hindi to Urdu transliteration systems and developed a system which can transliterate any Hindi Unicode text to Urdu at 99.46% accuracy at word level.

  • a word segmentation system for handling space omission problem in Urdu script
    Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, 2010
    Co-Authors: Gurpreet Singh Lehal
    Abstract:

    Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this paper we present a word segmentation system for handling space omission problem in Urdu script with application to Urdu-Devnagri Transliteration system. Instead of using manually segmented monolingual corpora to train segmenters, we make use of bilingual corpora and statistical word disambiguation techniques. Though our approach is adapted for the specific transliteration task at hand by taking the corresponding target (Hindi) language into account, the techniques suggested can be adapted to independently solve the space omission Urdu word segmentation problems. The two major components of our system are : identification of merged words for segmentation and proper segmentation of the merged words. The system was tested on 1.61 million word Urdu test data. The recall and precision for the merged word recognition component were found to be 99.29% and 99.38% respectively. The words are correctly segmented with 99.15% accuracy.

  • a two stage word segmentation system for handling space insertion problem in Urdu script
    2009
    Co-Authors: Gurpreet Singh Lehal
    Abstract:

    Hindi and Urdu are variants of the same language, but while Hindi is written in the Devanagari script from left to right, Urdu is written in a script derived from a Persian modification of Arabic script written from right to left. To break the script barrier an Urdu-Devnagri transliteration system has been developed. The transliteration system faced many problems related to word segmentation of Urdu script, as in many cases space is not properly put between Urdu words. Sometimes it is deleted resulting in many Urdu words being jumbled together and many other times extra space is put in word resulting in over segmentation of that word. In this paper, a two-stage system for handling the extra space insertion problem in Urdu has been presented. In the first stage, Urdu grammar rules have been applied, while a statistical based approach has been employed in the second stage. For statistical analysis, lexical resources from both Urdu and Hindi languages, including Urdu and Hindi unigram and bigram probabilities have been used. In addition the Urdu-Devnagri transliteration module is also executed in parallel to help in decision making. The system was tested on 1.84 million word Urdu corpus and the success rate was 98.57%. This is the first time such a system has been developed for Urdu script.

Awais Adnan - One of the best experts on this subject based on the ideXlab platform.

  • Urdu Optical Character Recognition Systems: Present Contributions and Future Directions
    IEEE Access, 2018
    Co-Authors: Naila Habib Khan, Awais Adnan
    Abstract:

    This paper gives an across-the-board comprehensive review and survey of the most prominent studies in the field of Urdu optical character recognition (OCR). This paper introduces the OCR technology and presents a historical review of the OCR systems, providing comparisons between the English, Arabic, and Urdu systems. Detailed background and literature have also been provided for Urdu script, discussing the script's past, OCR categories, and phases. This paper further reports all state-of-the-art studies for different phases, namely, image acquisition, pre-processing, segmentation, feature extraction, classification/recognition, and post-processing for an Urdu OCR system. In the segmentation section, the analytical and holistic approaches for Urdu text have been emphasized. In the feature extraction section, a comparison has been provided between the feature learning and feature engineering approaches. Deep learning and traditional machine learning approaches have been discussed. The Urdu numeral recognition systems have also been deliberated concisely. The research paper concludes by identifying some open problems and suggesting some future directions.

  • Urdu nastaleeq optical character recognition
    World Academy of Science Engineering and Technology International Journal of Computer Electrical Automation Control and Information Engineering, 2007
    Co-Authors: Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, Awais Adnan
    Abstract:

    This paper discusses the Urdu script characteristics, Urdu Nastaleeq and a simple but a novel and robust technique to recognize the printed Urdu script without a lexicon. Urdu being a family of Arabic script is cursive and complex script in its nature, the main complexity of Urdu compound/connected text is not its connections but the forms/shapes the characters change when it is placed at initial, middle or at the end of a word. The characters recognition technique presented here is using the inherited complexity of Urdu script to solve the problem. A word is scanned and analyzed for the level of its complexity, the point where the level of complexity changes is marked for a character, segmented and feeded to Neural Networks. A prototype of the system has been tested on Urdu text and currently achieves 93.4% accuracy on the average.