Wildcard

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 3957 Experts worldwide ranked by ideXlab platform

Xing Quan Zhu - One of the best experts on this subject based on the ideXlab platform.

  • Efficient sequential pattern mining with Wildcards for keyphrase extraction
    Knowledge-Based Systems, 2017
    Co-Authors: Fei Xie, Xin Dong Wu, Xing Quan Zhu
    Abstract:

    A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, so the flexible Wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.

Jeffrey Scott Vitter - One of the best experts on this subject based on the ideXlab platform.

  • space efficient string indexing for Wildcard pattern matching
    Symposium on Theoretical Aspects of Computer Science, 2014
    Co-Authors: Moshe Lewenstein, Yakov Nekrich, Jeffrey Scott Vitter
    Abstract:

    In this paper we describe compressed indexes that support pattern matching queries for strings with Wildcards. For a constant size alphabet our data structure uses O(n.log^e(n)) bits for any e>0 and reports all occ occurrences of a Wildcard string in O(m+s^g.M(n)+occ) time, where M(n)=o(log(log(log(n)))), s is the alphabet size, m is the number of alphabet symbols and g is the number of Wildcard symbols in the query string. We also present an O(n)-bit index with O((m+s^g+occ).log^e(n)) query time and an O(n{log(log(n))}^2)-bit index with O((m+s^g+occ).log(log(n))) query time. These are the first non-trivial data structures for this problem that need o(n.log(n)) bits of space.

  • STACS - Space-Efficient String Indexing for Wildcard Pattern Matching.
    2014
    Co-Authors: Moshe Lewenstein, Yakov Nekrich, Jeffrey Scott Vitter
    Abstract:

    In this paper we describe compressed indexes that support pattern matching queries for strings with Wildcards. For a constant size alphabet our data structure uses O(n.log^e(n)) bits for any e>0 and reports all occ occurrences of a Wildcard string in O(m+s^g.M(n)+occ) time, where M(n)=o(log(log(log(n)))), s is the alphabet size, m is the number of alphabet symbols and g is the number of Wildcard symbols in the query string. We also present an O(n)-bit index with O((m+s^g+occ).log^e(n)) query time and an O(n{log(log(n))}^2)-bit index with O((m+s^g+occ).log(log(n))) query time. These are the first non-trivial data structures for this problem that need o(n.log(n)) bits of space.

  • Compressed text indexing with Wildcards
    Journal of Discrete Algorithms, 2013
    Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott Vitter
    Abstract:

    Let T=T"[email protected]^k^"^1T"[email protected]^k^"^[email protected]^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH"h+o([email protected])+O(dlogn) bits of space, where H"h is the hth-order empirical entropy (h=o(log"@sn)) of T.

  • SPIRE - Compressed text indexing with Wildcards
    String Processing and Information Retrieval, 2011
    Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott Vitter
    Abstract:

    Let T = T1φk1T2φk2 .... φkdTd+1 be a text of total length n, where characters of each Ti are chosen from an alphabet Σ of size σ, and φ denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nHh + o(n log σ) + O(d log n) bits space, where Hh is the hth-order empirical entropy (h = o(logσ n)) of T.

Xin Dong Wu - One of the best experts on this subject based on the ideXlab platform.

  • Efficient sequential pattern mining with Wildcards for keyphrase extraction
    Knowledge-Based Systems, 2017
    Co-Authors: Fei Xie, Xin Dong Wu, Xing Quan Zhu
    Abstract:

    A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, so the flexible Wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.

  • document specific keyphrase extraction using sequential patterns with Wildcards
    International Conference on Data Mining, 2014
    Co-Authors: Xin Dong Wu
    Abstract:

    Finding good key phrases for a document is beneficial for many applications, such as text summarization, browsing, and indexing. In this paper, we propose a sequential pattern mining based document-specific key phrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, where the flexible Wildcard constraints within a pattern can capture semantic relationships between words. To achieve this goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important key phrases to be captured during the mining process. For each extracted key phrase candidate, we use some statistical pattern features to characterize it. A supervised learning classifier is trained to identify key phrases from a test document. Comparisons on key phrase benchmark datasets confirm that our document-specific key phrase extraction method is effective in improving the quality of extracted key phrases.

  • ICDM - Document-Specific Keyphrase Extraction Using Sequential Patterns with Wildcards
    2014 IEEE International Conference on Data Mining, 2014
    Co-Authors: Xin Dong Wu
    Abstract:

    Finding good key phrases for a document is beneficial for many applications, such as text summarization, browsing, and indexing. In this paper, we propose a sequential pattern mining based document-specific key phrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, where the flexible Wildcard constraints within a pattern can capture semantic relationships between words. To achieve this goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important key phrases to be captured during the mining process. For each extracted key phrase candidate, we use some statistical pattern features to characterize it. A supervised learning classifier is trained to identify key phrases from a test document. Comparisons on key phrase benchmark datasets confirm that our document-specific key phrase extraction method is effective in improving the quality of extracted key phrases.

  • Pattern Matching with Flexible Wildcards
    Journal of Computer Science and Technology, 2014
    Co-Authors: Xin Dong Wu, Jipeng Qiang
    Abstract:

    Pattern matching with Wildcards (PMW) has great theoretical and practical significance in bioinformatics, information retrieval, and pattern mining. Due to the uncertainty of Wildcards, not only is the number of all matches exponential with respect to the maximal gap flexibility and the pattern length, but the matching positions in PMW are also hard to choose. The objective to count the maximal number of matches one by one is computationally infeasible. Therefore, rather than solving the generic PMW problem, many research efforts have further defined new problems within PMW according to different application backgrounds. To break through the limitations of either fixing the number or allowing an unbounded number of Wildcards, pattern matching with flexible Wildcards (PMFW) allows the users to control the ranges of Wildcards. In this paper, we provide a survey on the state-of-the-art algorithms for PMFW, with detailed analyses and comparisons, and discuss challenges and opportunities in PMFW research and applications.

  • mining sequential patterns with periodic Wildcard gaps
    Applied Intelligence, 2014
    Co-Authors: Youxi Wu, Lingling Wang, Wei Ding, Xin Dong Wu
    Abstract:

    Mining frequent patterns with periodic Wildcard gaps is a critical data mining problem to deal with complex real-world problems. This problem can be described as follows: given a subject sequence, a pre-specified threshold, and a variable gap-length with Wildcards between each two consecutive letters. The task is to gain all frequent patterns with periodic Wildcard gaps. State-of-the-art mining algorithms which use matrices or other linear data structures to solve the problem not only consume a large amount of memory but also run slowly. In this study, we use an Incomplete Nettree structure (the last layer of a Nettree which is an extension of a tree) of a sub-pattern P to efficiently create Incomplete Nettrees of all its super-patterns with prefix pattern P and compute the numbers of their supports in a one-way scan. We propose two new algorithms, MAPB (Mining sequentiAl Pattern using incomplete Nettree with Breadth first search) and MAPD (Mining sequentiAl Pattern using incomplete Nettree with Depth first search), to solve the problem effectively with low memory requirements. Furthermore, we design a heuristic algorithm MAPBOK (MAPB for tOp-K) based on MAPB to deal with the Top-K frequent patterns for each length. Experimental results on real-world biological data demonstrate the superiority of the proposed algorithms in running time and space consumption and also show that the pattern matching approach can be employed to mine special frequent patterns effectively.

Fei Xie - One of the best experts on this subject based on the ideXlab platform.

  • Efficient sequential pattern mining with Wildcards for keyphrase extraction
    Knowledge-Based Systems, 2017
    Co-Authors: Fei Xie, Xin Dong Wu, Xing Quan Zhu
    Abstract:

    A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, so the flexible Wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.

Sharma V. Thankachan - One of the best experts on this subject based on the ideXlab platform.

  • document retrieval with one Wildcard
    Mathematical Foundations of Computer Science, 2014
    Co-Authors: Moshe Lewenstein, Yakov Nekrich, Ian J Munro, Sharma V. Thankachan
    Abstract:

    In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a Wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one Wildcard must be enumerated. We describe a linear space data structure that reports all documents containing a substring P in \(O(|P|+\sigma \sqrt{\log\log \log n} + \mathtt{docc})\) time, where σ is the alphabet size and docc is the number of listed documents. We also describe a succinct solution for this problem.

  • MFCS (2) - Document Retrieval with One Wildcard
    Mathematical Foundations of Computer Science 2014, 2014
    Co-Authors: Moshe Lewenstein, Yakov Nekrich, J. Ian Munro, Sharma V. Thankachan
    Abstract:

    In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a Wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one Wildcard must be enumerated. We describe a linear space data structure that reports all documents containing a substring P in \(O(|P|+\sigma \sqrt{\log\log \log n} + \mathtt{docc})\) time, where σ is the alphabet size and docc is the number of listed documents. We also describe a succinct solution for this problem.

  • Compressed text indexing with Wildcards
    Journal of Discrete Algorithms, 2013
    Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott Vitter
    Abstract:

    Let T=T"[email protected]^k^"^1T"[email protected]^k^"^[email protected]^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH"h+o([email protected])+O(dlogn) bits of space, where H"h is the hth-order empirical entropy (h=o(log"@sn)) of T.

  • SPIRE - Compressed text indexing with Wildcards
    String Processing and Information Retrieval, 2011
    Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott Vitter
    Abstract:

    Let T = T1φk1T2φk2 .... φkdTd+1 be a text of total length n, where characters of each Ti are chosen from an alphabet Σ of size σ, and φ denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nHh + o(n log σ) + O(d log n) bits space, where Hh is the hth-order empirical entropy (h = o(logσ n)) of T.