Text Compression

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 324 Experts worldwide ranked by ideXlab platform

Shmuel T. Klein - One of the best experts on this subject based on the ideXlab platform.

  • Improved Alignment-Based Algorithm for Multilingual Text Compression
    Mathematics in Computer Science, 2013
    Co-Authors: Ehud S. Conley, Shmuel T. Klein
    Abstract:

    Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual Text alignment, a mapping of words and phrases in one Text to their semantic equivalents in the translation. A new multilingual Text Compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language Text; the incurred Compression loss due to this overhead is smaller than the savings in the compressed target language Texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EUR-Lex website of the European Union. These results show the superiority of the new algorithm as a function of the number of languages.

  • LATA - Improved alignment based algorithm for multilingual Text Compression
    Language and Automata Theory and Applications, 2011
    Co-Authors: Ehud S. Conley, Shmuel T. Klein
    Abstract:

    Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual Text alignment, a mapping of words and phrases in one Text to their semantic equivalents in the translation. A new multilingual Text Compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language Text; the incurred Compression loss due to this overhead is smaller than the savings in the compressed target language Texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EURLex website of the European Union. These results show the superiority of the new algorithm as a function of the number languages.

  • USING ALIGNMENT FOR MULTILINGUAL Text Compression
    International Journal of Foundations of Computer Science, 2008
    Co-Authors: Ehud S. Conley, Shmuel T. Klein
    Abstract:

    Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. We explore the details of this framework and present experimental results for parallel English and French Texts.

  • Stringology - Using alignment for multilingual Text Compression.
    2006
    Co-Authors: Ehud S. Conley, Shmuel T. Klein
    Abstract:

    Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. We explore the details of this framework and present experimental results for parallel English and French Texts.

  • SEMI-LOSSLESS Text Compression
    International Journal of Foundations of Computer Science, 2005
    Co-Authors: Yair Kaufman, Shmuel T. Klein
    Abstract:

    A new notion, that of semi-lossless Text Compression, is introduced, and its applicability in various settings is investigated. First results suggest that it might be hard to exploit the additional redundancy of English Texts, but the new methods could be useful in applications where the correct spelling is not important, such as in short emails, and the new notion raises some interesting research problems in several different areas of Computer Science.

Masayuki Takeda - One of the best experts on this subject based on the ideXlab platform.

  • Linear-Time Text Compression by longest-first substitution
    Algorithms, 2009
    Co-Authors: Ryosuke Nakamura, Shunsuke Inenaga, Hideo Bannai, Takashi Funamoto, Masayuki Takeda, Ayumi Shinohara
    Abstract:

    We consider grammar-based Text Compression with longest first substitution (LFS), where non-overlapping occurrences of a longest repeating factor of the input Text are replaced by a new non-terminal symbol. We present the first linear-time algorithm for LFS. Our algorithm employs a new data structure called sparse lazy suffix trees. We also deal with a more sophisticated version of LFS, called LFS2, that allows better Compression. The first linear-time algorithm for LFS2 is also presented.

  • DCC - Simple Linear-Time Off-Line Text Compression by Longest-First Substitution
    2007 Data Compression Conference (DCC'07), 2007
    Co-Authors: Ryosuke Nakamura, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
    Abstract:

    We consider grammar based Text Compression with longest-first substitution, where non-overlapping occurrences of a longest repeating substring of the input Text are replaced by a new non-terminal symbol. We present a new Text Compression algorithm by simplifying the algorithm presented in S. Inenaga et al., (2003). We give a new formulation of the correctness proof introducing the sparse lazy suffix tree data structure. We also present another type of longest-first substitution strategy that allows better Compression. We show results of preliminary experiments comparing grammar sizes of the two versions of the longest-first strategy and the most frequent strategy

  • Linear-time off-line Text Compression by longest-first substitution
    Lecture Notes in Computer Science, 2003
    Co-Authors: Shunsuke Inenaga, Takashi Funamoto, Masayuki Takeda, Ayumi Shinohara
    Abstract:

    Given a Text, grammar-based Compression is to construct a grammar that generates the Text. There are many kinds of Text Compression techniques of this type. Each Compression scheme is categorized as being either off-line or on-line, according to how a Text is processed. One representative tactics for off-line Compression is to substitute the longest repeated factors of a Text with a production rule. In this paper, we present an algorithm that compresses a Text basing on this longest-first principle, in linear time. The algorithm employs a suitable index structure for a Text, and involves technically efficient operations on the structure.

  • SPIRE - Linear-Time Off-Line Text Compression by Longest-First Substitution
    String Processing and Information Retrieval, 2003
    Co-Authors: Shunsuke Inenaga, Takashi Funamoto, Masayuki Takeda, Ayumi Shinohara
    Abstract:

    Given a Text, grammar-based Compression is to construct a grammar that generates the Text. There are many kinds of Text Compression techniques of this type. Each Compression scheme is categorized as being either off-line or on-line, according to how a Text is processed. One representative tactics for off-line Compression is to substitute the longest repeated factors of a Text with a production rule. In this paper, we present an algorithm that compresses a Text basing on this longest-first principle, in linear time. The algorithm employs a suitable index structure for a Text, and involves technically efficient operations on the structure.

  • speeding up pattern matching by Text Compression
    International Conference on Algorithms and Complexity, 2000
    Co-Authors: Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, Takuya Kida, Shuichi Fukamachi, Takeshi Shinohara, Setsuo Arikawa
    Abstract:

    Byte pair encoding (BPE) is a simple universal Text Compression scheme. DeCompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original Text. However, it has not been so popular since the Compression is rather slow and the Compression ratio is not as good as other methods such as Lempel-Ziv type Compression. In this paper, we bring out a potential advantage of BPE Compression. We show that it is very suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed Text without decompressing it explicitly. We compare running times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv-Welch compressed files, and (3) original Text files, in various situations. Experimental results show that pattern matching in BPE compressed Text is even faster than matching in the original Text. Thus the BPE Compression reduces not only the disk space but also the searching time.

Thierry Lecroq - One of the best experts on this subject based on the ideXlab platform.

  • Pattern-matching and Text-Compression algorithms
    ACM Computing Surveys, 1996
    Co-Authors: Maxime Crochemore, Thierry Lecroq
    Abstract:

    Pattern matching is the problem of locating a specific pattern inside raw data. The pattern is usually a collection of strings described in some formal language. Applications require two kinds of solution depending upon which string, the pattern, or the Text, is given first. Solutions based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern. The notion of indices realized by trees or automata is used in the second kind of solutions. The aim of data Compression is to provide representation of data in a reduced form in order to save both storage place and transmission time. There is no loss of information, the Compression processes are reversible. Pattern-matching and Text-Compression algorithms are two important subjects in the wider domain of Text processing. They apply to the manipulation of Texts (word editors), to the storage of Textual data (Text Compression), and to data retrieval systems (full Text search). They are basic components used in implementations of practical softwares existing under most operating systems. Moreover, they emphasize programming methods that serve as paradigms in other fields of computer science (system or software design). Finally, they also play an important role in theoretical computer science by providing challenging problems. Although data are recorded in various ways, Text remains the main way to exchange information. This is particularly evident in literature or linguistics where data are composed of huge corpora and dictionaries, but applies as well to computer science where a large amount of data is stored in linear files. And it is also the case, for instance, in molecular biology because biological molecules can often be approximated as sequences of nucleotides or amino acids. Furthermore, the quantity of available data in these fields tend to double every 18 months. This is the reason that algorithms must be efficient even if the speed and storage capacity of computers increase continuously.

Sanjay Misra - One of the best experts on this subject based on the ideXlab platform.

  • Syllable-Based Text Compression: A Language Case Study
    Arabian Journal for Science and Engineering, 2016
    Co-Authors: Stephen A. Adubi, Sanjay Misra
    Abstract:

    Compression of Texts has been widely studied by various researchers and in the process, several algorithms have been proposed. However, Compression of Texts using the syllabic structure of words in syllable-based languages has emerged as another dimension to the Compression of Texts. An algorithm for syllable extraction from words should be designed based on the structure of a language due to the ineffectiveness of the presently existing “universal” algorithms. Several syllable-based methods of Compression proposed by different authors are reviewed in this work, including the methodologies used in achieving Text Compression. Finally, an algorithm for syllable extraction from words in the Yoruba language is presented and compared with four universal algorithms, recording the best result (100 % accuracy) among the five; the significance of this is that a dictionary of common syllables does not need to be created to achieve syllable-based Text Compression on the Yoruba Language.

  • Lossless Text Compression Technique Using Syllable Based Morphology
    The International Arab Journal of Information Technology, 2011
    Co-Authors: Ibrahim Akman, Hakan Bayindir, Serkan Ozleme, Zehra Akin, Sanjay Misra
    Abstract:

    In this paper, we present a new lossless Text Compression technique which utilizes syllable-based morphology of multi-syllabic languages. The proposed algorithm is designed to partition words into its syllables and then to produce their shorter bit representations for Compression. The method has six main components namely source file, filtering unit, syllable unit, Compression unit, dictionary file and target file. The number of bits in coding syllables depends on the number of entries in the dictionary file. The proposed algorithm is implemented and tested using 20 different Texts of different lengths collected from different fields. The results indicated a Compression of up to 43%.

Jan Lansky - One of the best experts on this subject based on the ideXlab platform.

  • DATESO - Genetic Algorithms in Syllable-Based Text Compression ?
    2007
    Co-Authors: Tomas Kuthan, Jan Lansky
    Abstract:

    Syllable based Text Compression is a new approach to com- pression by symbols. In this concept syllables are used as the compres- sion symbols instead of the more common characters or words. This new technique has proven itself worthy especially on short to middle-length Text files. The eectiveness of the Compression is greatly aected by the quality of dictionaries of syllables characteristic for the certain language. These dictionaries are usually created with a straight-forward analysis of Text corpora. In this paper we would like to introduce an other way of obtaining these dictionaries - using genetic algorithm. We believe, that dictionaries built this way, may help us lower the compress ratio. We will measure this eect on a set of Czech and English Texts.

  • Text Compression syllables
    DATESO, 2005
    Co-Authors: Jan Lansky, Michal Zemlicka
    Abstract:

    There are two basic types of Text Compression by symbols - in the first case symbols are represented by characters, in the second case by whole words. The first case is useful for very short files, the second case for very long files or large collections. We supposed that there exist yet another way where symbols are represented by units shorter than words - syllables. This paper is focused to specification of syllables, methods for decomposition of words into syllables, and using syllable-based compres- sion in combination of principles of LZW and Human coding. Above mentioned syllable-based methods are compared with their counterpart variants for characters and whole words.

  • DATESO - Text Compression: Syllables
    2005
    Co-Authors: Jan Lansky, Michal Zemlicka
    Abstract:

    There are two basic types of Text Compression by symbols - in the first case symbols are represented by characters, in the second case by whole words. The first case is useful for very short files, the second case for very long files or large collections. We supposed that there exist yet another way where symbols are represented by units shorter than words - syllables. This paper is focused to specification of syllables, methods for decomposition of words into syllables, and using syllable-based compres- sion in combination of principles of LZW and Human coding. Above mentioned syllable-based methods are compared with their counterpart variants for characters and whole words.