Hamming Distance

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 19419 Experts worldwide ranked by ideXlab platform

Eric Torng - One of the best experts on this subject based on the ideXlab platform.

  • large scale Hamming Distance query processing
    International Conference on Data Engineering, 2011
    Co-Authors: Alex X Liu, Ke Shen, Eric Torng
    Abstract:

    Hamming Distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming Distance range query problems, where the goal is to find all strings in a database that are within a Hamming Distance bound k from a query string. If k is fixed, we have a static Hamming Distance range query problem. If k is part of the input, we have a dynamic Hamming Distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming Distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming Distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.

  • ICDE - Large scale Hamming Distance query processing
    2011 IEEE 27th International Conference on Data Engineering, 2011
    Co-Authors: Alex X Liu, Ke Shen, Eric Torng
    Abstract:

    Hamming Distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming Distance range query problems, where the goal is to find all strings in a database that are within a Hamming Distance bound k from a query string. If k is fixed, we have a static Hamming Distance range query problem. If k is part of the input, we have a dynamic Hamming Distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming Distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming Distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.

Benjamin Sach - One of the best experts on this subject based on the ideXlab platform.

  • tight cell probe bounds for online Hamming Distance computation
    Symposium on Discrete Algorithms, 2013
    Co-Authors: Raphaël Clifford, Markus Jalsenius, Benjamin Sach
    Abstract:

    We show tight bounds for online Hamming Distance computation in the cell-probe model with word size w. The task is to output the Hamming Distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Ω(δ/w log n) time on average per output, where δ is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.

  • tight cell probe bounds for online Hamming Distance computation
    arXiv: Data Structures and Algorithms, 2012
    Co-Authors: Raphaël Clifford, Markus Jalsenius, Benjamin Sach
    Abstract:

    We show tight bounds for online Hamming Distance computation in the cell-probe model with word size w. The task is to output the Hamming Distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Omega((d/w)*log n) time on average per output, where d is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.

Raphaël Clifford - One of the best experts on this subject based on the ideXlab platform.

  • ICALP - Approximate Hamming Distance in a stream
    2016
    Co-Authors: Raphaël Clifford, Tatiana Starikovskaya
    Abstract:

    We consider the problem of computing a (1+epsilon)-approximation of the Hamming Distance between a pattern of length n and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem. We show the following: - If Alice and Bob both share the pattern and Alice has the first half of the stream and Bob the second half, then there is an O(epsilon^{-4}*log^2(n)) bit randomised one-way communication protocol. - If Alice has the pattern, Bob the first half of the stream and Charlie the second half, then there is an O(epsilon^{-2}*sqrt(n)*log(n)) bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for (1 + epsilon)-approximate Hamming Distance which give worst case running time guarantees per arriving symbol. - For binary input alphabets there is an O(epsilon^{-3}*sqrt(n)*log^2(n)) space and O(epsilon^{-2}*log(n)) time streaming (1 + epsilon)-approximate Hamming Distance algorithm. - For general input alphabets there is an O(epsilon^{-5}*sqrt(n)*log^4(n)) space and O(epsilon^{-4}*log^3(n)) time streaming (1 + epsilon)-approximate Hamming Distance algorithm.

  • approximate Hamming Distance in a stream
    International Colloquium on Automata Languages and Programming, 2016
    Co-Authors: Raphaël Clifford, Tatiana Starikovskaya
    Abstract:

    We consider the problem of computing a (1+epsilon)-approximation of the Hamming Distance between a pattern of length n and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem. We show the following: - If Alice and Bob both share the pattern and Alice has the first half of the stream and Bob the second half, then there is an O(epsilon^{-4}*log^2(n)) bit randomised one-way communication protocol. - If Alice has the pattern, Bob the first half of the stream and Charlie the second half, then there is an O(epsilon^{-2}*sqrt(n)*log(n)) bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for (1 + epsilon)-approximate Hamming Distance which give worst case running time guarantees per arriving symbol. - For binary input alphabets there is an O(epsilon^{-3}*sqrt(n)*log^2(n)) space and O(epsilon^{-2}*log(n)) time streaming (1 + epsilon)-approximate Hamming Distance algorithm. - For general input alphabets there is an O(epsilon^{-5}*sqrt(n)*log^4(n)) space and O(epsilon^{-4}*log^3(n)) time streaming (1 + epsilon)-approximate Hamming Distance algorithm.

  • Approximate Hamming Distance in a stream
    arXiv: Data Structures and Algorithms, 2016
    Co-Authors: Raphaël Clifford, Tatiana Starikovskaya
    Abstract:

    We consider the problem of computing a $(1+\epsilon)$-approximation of the Hamming Distance between a pattern of length $n$ and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an $O(\epsilon^{-4} \log^2 n)$ bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an $O(\epsilon^{-2}\sqrt{n}\log n)$ bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for $(1+\epsilon)$-approximate Hamming Distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an $O(\epsilon^{-3} \sqrt{n} \log^{2} n)$ space and $O(\epsilon^{-2} \log{n})$ time streaming $(1+\epsilon)$-approximate Hamming Distance algorithm. (2) For general input alphabets there is an $O(\epsilon^{-5} \sqrt{n} \log^{4} n)$ space and $O(\epsilon^{-4} \log^3 {n})$ time streaming $(1+\epsilon)$-approximate Hamming Distance algorithm.

  • tight cell probe bounds for online Hamming Distance computation
    Symposium on Discrete Algorithms, 2013
    Co-Authors: Raphaël Clifford, Markus Jalsenius, Benjamin Sach
    Abstract:

    We show tight bounds for online Hamming Distance computation in the cell-probe model with word size w. The task is to output the Hamming Distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Ω(δ/w log n) time on average per output, where δ is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.

  • tight cell probe bounds for online Hamming Distance computation
    arXiv: Data Structures and Algorithms, 2012
    Co-Authors: Raphaël Clifford, Markus Jalsenius, Benjamin Sach
    Abstract:

    We show tight bounds for online Hamming Distance computation in the cell-probe model with word size w. The task is to output the Hamming Distance between a fixed string of length n and the last n symbols of a stream. We give a lower bound of Omega((d/w)*log n) time on average per output, where d is the number of bits needed to represent an input symbol. We argue that this bound is tight within the model. The lower bound holds under randomisation and amortisation.

Alex X Liu - One of the best experts on this subject based on the ideXlab platform.

  • large scale Hamming Distance query processing
    International Conference on Data Engineering, 2011
    Co-Authors: Alex X Liu, Ke Shen, Eric Torng
    Abstract:

    Hamming Distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming Distance range query problems, where the goal is to find all strings in a database that are within a Hamming Distance bound k from a query string. If k is fixed, we have a static Hamming Distance range query problem. If k is part of the input, we have a dynamic Hamming Distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming Distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming Distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.

  • ICDE - Large scale Hamming Distance query processing
    2011 IEEE 27th International Conference on Data Engineering, 2011
    Co-Authors: Alex X Liu, Ke Shen, Eric Torng
    Abstract:

    Hamming Distance has been widely used in many application domains, such as near-duplicate detection and pattern recognition. We study Hamming Distance range query problems, where the goal is to find all strings in a database that are within a Hamming Distance bound k from a query string. If k is fixed, we have a static Hamming Distance range query problem. If k is part of the input, we have a dynamic Hamming Distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming Distance range query algorithm called HEngines, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming Distance range query algorithm called HEngined, which addresses the limitation in prior art using a divide-and-conquer strategy. We implemented our algorithms and conducted side-by-side comparisons on large real-world and synthetic datasets. In our experiments, HEngines uses 4.65 times less space and processes queries 16% faster than the prior art, and HEngined processes queries 46 times faster than linear scan while using only 1.7 times more space.

  • BCB - NcRNA homology search using Hamming Distance seeds
    Proceedings of the 2nd ACM Conference on Bioinformatics Computational Biology and Biomedicine - BCB '11, 2011
    Co-Authors: Osama Aljawad, Alex X Liu, Yanni Sun, Jikai Lei
    Abstract:

    NcRNAs play important roles in many biological processes. Existing genome-scale ncRNA homology search tools identify ncRNAs in local sequence alignments generated by conventional sequence comparison methods. However, some types of ncRNA lack strong sequence conservation and tend to be missed by conventional sequence comparison methods. In this paper, we propose an ncRNA identification framework that is complementary to existing sequence comparison tools. By integrating a filtration step based on Hamming Distance and a local structural alignment program such as FOLDALIGN, we can identify ncRNAs that lack strong sequence conservation. We introduce a coding method by which the Hamming-Distance based filtration can easily distinguish transition from transversion, which show different frequency in functional ncRNAs. Our experiments demonstrate that the carefully designed Hamming Distance seed can achieve better sensitivity in searching for poorly conserved ncRNAs than conventional sequence comparison tools.

Anca L Ralescu - One of the best experts on this subject based on the ideXlab platform.

  • Adaptive measures of similarity---fuzzy Hamming Distance---and its applications to pattern recognition problems
    2006
    Co-Authors: Anca L Ralescu, M Ionescu
    Abstract:

    Similarity measures are the basis of most of the machine learning and pattern recognition algorithms. The choice of the similarity determines the effectiveness of the algorithm in solving the specific problem. This is why finding a relevant similarity measure is an active area of research in machine learning and pattern recognition. Hamming Distance is a simple and efficient similarity measure, but because it was designed to deal with binary vectors, it can not be applied to many problems that uses real-valued vectors. This thesis build upon and extends a generalization of the Hamming Distance, Fuzzy Hamming Distance, that can operate on real-valued vectors and maintain the same meaning as the Hamming Distance: the number of different elements. To assess the effectiveness of this new measure, FHD is employed in several experiments as basis for a Content Image Retrieval system, a banknote validation system and into a conceptual spaces based, knowledge discovery system.

  • fuzzy Hamming Distance in a content based image retrieval system
    IEEE International Conference on Fuzzy Systems, 2004
    Co-Authors: M Ionescu, Anca L Ralescu
    Abstract:

    The performance of content-based image retrieval (CBIR) systems mainly depends on the image similarity measure that it uses. The fuzzy Hamming Distance (D) is an extension of the Hamming Distance for real-valued vectors. Because the feature space of each image is real-valued, the fuzzy Hamming Distance can be successfully used as an image similarity measure. The current study reports on the results of applying D as a similarity measure between the color histograms of two images. The fuzzy Hamming Distance is suitable for this application because it can take into account not only the number of different colors but also the magnitude of this difference.

  • FUZZ-IEEE - Fuzzy Hamming Distance in a content-based image retrieval system
    2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542), 1
    Co-Authors: M Ionescu, Anca L Ralescu
    Abstract:

    The performance of content-based image retrieval (CBIR) systems mainly depends on the image similarity measure that it uses. The fuzzy Hamming Distance (D) is an extension of the Hamming Distance for real-valued vectors. Because the feature space of each image is real-valued, the fuzzy Hamming Distance can be successfully used as an image similarity measure. The current study reports on the results of applying D as a similarity measure between the color histograms of two images. The fuzzy Hamming Distance is suitable for this application because it can take into account not only the number of different colors but also the magnitude of this difference.

  • FUZZ-IEEE - Fuzzy Hamming Distance Based Banknote Validator
    The 14th IEEE International Conference on Fuzzy Systems 2005. FUZZ '05., 1
    Co-Authors: M Ionescu, Anca L Ralescu
    Abstract:

    Banknote validation systems are used to discriminate between genuine and counterfeit banknotes. The paper proposes a one-class classifier for genuine class using a new similarity measure based on the fuzzy Hamming Distance. For each banknote several regions are considered (corresponding to security features) and each region is split in m times n partitions, to include position information. The feature space used by the classifier consists of color histograms of each partition. The fuzzy Hamming Distance proves to have a good discrimination power being able to completely discriminate between the genuine and counterfeit banknotes