Text Compression

The Experts below are selected from a list of 324 Experts worldwide ranked by ideXlab platform

Shmuel T. Klein - One of the best experts on this subject based on the ideXlab platform.

Improved Alignment-Based Algorithm for Multilingual Text Compression

Mathematics in Computer Science, 2013

Co-Authors: Ehud S. Conley, Shmuel T. Klein

Abstract:

Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual Text alignment, a mapping of words and phrases in one Text to their semantic equivalents in the translation. A new multilingual Text Compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language Text; the incurred Compression loss due to this overhead is smaller than the savings in the compressed target language Texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EUR-Lex website of the European Union. These results show the superiority of the new algorithm as a function of the number of languages.

15 days free trial to Access Article
LATA - Improved alignment based algorithm for multilingual Text Compression

Language and Automata Theory and Applications, 2011

Co-Authors: Ehud S. Conley, Shmuel T. Klein

Abstract:

Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. This is done based on bilingual Text alignment, a mapping of words and phrases in one Text to their semantic equivalents in the translation. A new multilingual Text Compression scheme is suggested, which improves over an immediate generalization of bilingual algorithms. The idea is to store the necessary markup data within the source language Text; the incurred Compression loss due to this overhead is smaller than the savings in the compressed target language Texts, for a large enough number of the latter. Experimental results are presented for a parallel corpus in six languages extracted from the EURLex website of the European Union. These results show the superiority of the new algorithm as a function of the number languages.

15 days free trial to Access Article
USING ALIGNMENT FOR MULTILINGUAL Text Compression

International Journal of Foundations of Computer Science, 2008

Co-Authors: Ehud S. Conley, Shmuel T. Klein

Abstract:

Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. We explore the details of this framework and present experimental results for parallel English and French Texts.

15 days free trial to Access Article
Stringology - Using alignment for multilingual Text Compression.

2006

Co-Authors: Ehud S. Conley, Shmuel T. Klein

Abstract:

Multilingual Text Compression exploits the existence of the same Text in several languages to compress the second and subsequent copies by reference to the first. We explore the details of this framework and present experimental results for parallel English and French Texts.

15 days free trial to Access Article
SEMI-LOSSLESS Text Compression

International Journal of Foundations of Computer Science, 2005

Co-Authors: Yair Kaufman, Shmuel T. Klein

Abstract:

A new notion, that of semi-lossless Text Compression, is introduced, and its applicability in various settings is investigated. First results suggest that it might be hard to exploit the additional redundancy of English Texts, but the new methods could be useful in applications where the correct spelling is not important, such as in short emails, and the new notion raises some interesting research problems in several different areas of Computer Science.

15 days free trial to Access Article

Masayuki Takeda - One of the best experts on this subject based on the ideXlab platform.

Linear-Time Text Compression by longest-first substitution

Algorithms, 2009

Co-Authors: Ryosuke Nakamura, Shunsuke Inenaga, Hideo Bannai, Takashi Funamoto, Masayuki Takeda, Ayumi Shinohara

Abstract:

We consider grammar-based Text Compression with longest first substitution (LFS), where non-overlapping occurrences of a longest repeating factor of the input Text are replaced by a new non-terminal symbol. We present the first linear-time algorithm for LFS. Our algorithm employs a new data structure called sparse lazy suffix trees. We also deal with a more sophisticated version of LFS, called LFS2, that allows better Compression. The first linear-time algorithm for LFS2 is also presented.

15 days free trial to Access Article
DCC - Simple Linear-Time Off-Line Text Compression by Longest-First Substitution

2007 Data Compression Conference (DCC'07), 2007

Co-Authors: Ryosuke Nakamura, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract:

We consider grammar based Text Compression with longest-first substitution, where non-overlapping occurrences of a longest repeating substring of the input Text are replaced by a new non-terminal symbol. We present a new Text Compression algorithm by simplifying the algorithm presented in S. Inenaga et al., (2003). We give a new formulation of the correctness proof introducing the sparse lazy suffix tree data structure. We also present another type of longest-first substitution strategy that allows better Compression. We show results of preliminary experiments comparing grammar sizes of the two versions of the longest-first strategy and the most frequent strategy

15 days free trial to Access Article
Linear-time off-line Text Compression by longest-first substitution

Lecture Notes in Computer Science, 2003

Co-Authors: Shunsuke Inenaga, Takashi Funamoto, Masayuki Takeda, Ayumi Shinohara

Abstract:

Given a Text, grammar-based Compression is to construct a grammar that generates the Text. There are many kinds of Text Compression techniques of this type. Each Compression scheme is categorized as being either off-line or on-line, according to how a Text is processed. One representative tactics for off-line Compression is to substitute the longest repeated factors of a Text with a production rule. In this paper, we present an algorithm that compresses a Text basing on this longest-first principle, in linear time. The algorithm employs a suitable index structure for a Text, and involves technically efficient operations on the structure.

15 days free trial to Access Article
SPIRE - Linear-Time Off-Line Text Compression by Longest-First Substitution

String Processing and Information Retrieval, 2003

Co-Authors: Shunsuke Inenaga, Takashi Funamoto, Masayuki Takeda, Ayumi Shinohara

Abstract:

Given a Text, grammar-based Compression is to construct a grammar that generates the Text. There are many kinds of Text Compression techniques of this type. Each Compression scheme is categorized as being either off-line or on-line, according to how a Text is processed. One representative tactics for off-line Compression is to substitute the longest repeated factors of a Text with a production rule. In this paper, we present an algorithm that compresses a Text basing on this longest-first principle, in linear time. The algorithm employs a suitable index structure for a Text, and involves technically efficient operations on the structure.

15 days free trial to Access Article
speeding up pattern matching by Text Compression

International Conference on Algorithms and Complexity, 2000

Co-Authors: Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, Takuya Kida, Shuichi Fukamachi, Takeshi Shinohara, Setsuo Arikawa

Abstract:

Byte pair encoding (BPE) is a simple universal Text Compression scheme. DeCompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original Text. However, it has not been so popular since the Compression is rather slow and the Compression ratio is not as good as other methods such as Lempel-Ziv type Compression. In this paper, we bring out a potential advantage of BPE Compression. We show that it is very suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed Text without decompressing it explicitly. We compare running times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv-Welch compressed files, and (3) original Text files, in various situations. Experimental results show that pattern matching in BPE compressed Text is even faster than matching in the original Text. Thus the BPE Compression reduces not only the disk space but also the searching time.

15 days free trial to Access Article

Thierry Lecroq - One of the best experts on this subject based on the ideXlab platform.

Pattern-matching and Text-Compression algorithms

ACM Computing Surveys, 1996

Co-Authors: Maxime Crochemore, Thierry Lecroq

Abstract:

Pattern matching is the problem of locating a specific pattern inside raw data. The pattern is usually a collection of strings described in some formal language. Applications require two kinds of solution depending upon which string, the pattern, or the Text, is given first. Solutions based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern. The notion of indices realized by trees or automata is used in the second kind of solutions. The aim of data Compression is to provide representation of data in a reduced form in order to save both storage place and transmission time. There is no loss of information, the Compression processes are reversible. Pattern-matching and Text-Compression algorithms are two important subjects in the wider domain of Text processing. They apply to the manipulation of Texts (word editors), to the storage of Textual data (Text Compression), and to data retrieval systems (full Text search). They are basic components used in implementations of practical softwares existing under most operating systems. Moreover, they emphasize programming methods that serve as paradigms in other fields of computer science (system or software design). Finally, they also play an important role in theoretical computer science by providing challenging problems. Although data are recorded in various ways, Text remains the main way to exchange information. This is particularly evident in literature or linguistics where data are composed of huge corpora and dictionaries, but applies as well to computer science where a large amount of data is stored in linear files. And it is also the case, for instance, in molecular biology because biological molecules can often be approximated as sequences of nucleotides or amino acids. Furthermore, the quantity of available data in these fields tend to double every 18 months. This is the reason that algorithms must be efficient even if the speed and storage capacity of computers increase continuously.

15 days free trial to Access Article

Sanjay Misra - One of the best experts on this subject based on the ideXlab platform.

Syllable-Based Text Compression: A Language Case Study

Arabian Journal for Science and Engineering, 2016

Co-Authors: Stephen A. Adubi, Sanjay Misra

Abstract:

Compression of Texts has been widely studied by various researchers and in the process, several algorithms have been proposed. However, Compression of Texts using the syllabic structure of words in syllable-based languages has emerged as another dimension to the Compression of Texts. An algorithm for syllable extraction from words should be designed based on the structure of a language due to the ineffectiveness of the presently existing “universal” algorithms. Several syllable-based methods of Compression proposed by different authors are reviewed in this work, including the methodologies used in achieving Text Compression. Finally, an algorithm for syllable extraction from words in the Yoruba language is presented and compared with four universal algorithms, recording the best result (100 % accuracy) among the five; the significance of this is that a dictionary of common syllables does not need to be created to achieve syllable-based Text Compression on the Yoruba Language.

15 days free trial to Access Article
Lossless Text Compression Technique Using Syllable Based Morphology

The International Arab Journal of Information Technology, 2011

Co-Authors: Ibrahim Akman, Hakan Bayindir, Serkan Ozleme, Zehra Akin, Sanjay Misra

Abstract:

In this paper, we present a new lossless Text Compression technique which utilizes syllable-based morphology of multi-syllabic languages. The proposed algorithm is designed to partition words into its syllables and then to produce their shorter bit representations for Compression. The method has six main components namely source file, filtering unit, syllable unit, Compression unit, dictionary file and target file. The number of bits in coding syllables depends on the number of entries in the dictionary file. The proposed algorithm is implemented and tested using 20 different Texts of different lengths collected from different fields. The results indicated a Compression of up to 43%.

15 days free trial to Access Article

Jan Lansky - One of the best experts on this subject based on the ideXlab platform.

DATESO - Genetic Algorithms in Syllable-Based Text Compression ?

2007

Co-Authors: Tomas Kuthan, Jan Lansky

Abstract:

Syllable based Text Compression is a new approach to com- pression by symbols. In this concept syllables are used as the compres- sion symbols instead of the more common characters or words. This new technique has proven itself worthy especially on short to middle-length Text files. The eectiveness of the Compression is greatly aected by the quality of dictionaries of syllables characteristic for the certain language. These dictionaries are usually created with a straight-forward analysis of Text corpora. In this paper we would like to introduce an other way of obtaining these dictionaries - using genetic algorithm. We believe, that dictionaries built this way, may help us lower the compress ratio. We will measure this eect on a set of Czech and English Texts.

15 days free trial to Access Article
Text Compression syllables

DATESO, 2005

Co-Authors: Jan Lansky, Michal Zemlicka

Abstract:

There are two basic types of Text Compression by symbols - in the first case symbols are represented by characters, in the second case by whole words. The first case is useful for very short files, the second case for very long files or large collections. We supposed that there exist yet another way where symbols are represented by units shorter than words - syllables. This paper is focused to specification of syllables, methods for decomposition of words into syllables, and using syllable-based compres- sion in combination of principles of LZW and Human coding. Above mentioned syllable-based methods are compared with their counterpart variants for characters and whole words.

15 days free trial to Access Article
DATESO - Text Compression: Syllables

2005

Co-Authors: Jan Lansky, Michal Zemlicka

Abstract:

There are two basic types of Text Compression by symbols - in the first case symbols are represented by characters, in the second case by whole words. The first case is useful for very short files, the second case for very long files or large collections. We supposed that there exist yet another way where symbols are represented by units shorter than words - syllables. This paper is focused to specification of syllables, methods for decomposition of words into syllables, and using syllable-based compres- sion in combination of principles of LZW and Human coding. Above mentioned syllable-based methods are compared with their counterpart variants for characters and whole words.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Shmuel T. Klein - One of the best experts on this subject based on the ideXlab platform.

Improved Alignment-Based Algorithm for Multilingual Text Compression

LATA - Improved alignment based algorithm for multilingual Text Compression

USING ALIGNMENT FOR MULTILINGUAL Text Compression

Stringology - Using alignment for multilingual Text Compression.

SEMI-LOSSLESS Text Compression

Masayuki Takeda - One of the best experts on this subject based on the ideXlab platform.

Linear-Time Text Compression by longest-first substitution

DCC - Simple Linear-Time Off-Line Text Compression by Longest-First Substitution

Linear-time off-line Text Compression by longest-first substitution

SPIRE - Linear-Time Off-Line Text Compression by Longest-First Substitution

speeding up pattern matching by Text Compression

Thierry Lecroq - One of the best experts on this subject based on the ideXlab platform.

Pattern-matching and Text-Compression algorithms

Sanjay Misra - One of the best experts on this subject based on the ideXlab platform.

Syllable-Based Text Compression: A Language Case Study

Lossless Text Compression Technique Using Syllable Based Morphology

Jan Lansky - One of the best experts on this subject based on the ideXlab platform.

DATESO - Genetic Algorithms in Syllable-Based Text Compression ?

Text Compression syllables

DATESO - Text Compression: Syllables

Text Compression

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Shmuel T. Klein - One of the best experts on this subject based on the ideXlab platform.

Masayuki Takeda - One of the best experts on this subject based on the ideXlab platform.

Thierry Lecroq - One of the best experts on this subject based on the ideXlab platform.

Sanjay Misra - One of the best experts on this subject based on the ideXlab platform.

Jan Lansky - One of the best experts on this subject based on the ideXlab platform.

Related terms