The Experts below are selected from a list of 3957 Experts worldwide ranked by ideXlab platform
Xing Quan Zhu - One of the best experts on this subject based on the ideXlab platform.
-
Efficient sequential pattern mining with Wildcards for keyphrase extraction
Knowledge-Based Systems, 2017Co-Authors: Fei Xie, Xin Dong Wu, Xing Quan ZhuAbstract:A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, so the flexible Wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.
Jeffrey Scott Vitter - One of the best experts on this subject based on the ideXlab platform.
-
space efficient string indexing for Wildcard pattern matching
Symposium on Theoretical Aspects of Computer Science, 2014Co-Authors: Moshe Lewenstein, Yakov Nekrich, Jeffrey Scott VitterAbstract:In this paper we describe compressed indexes that support pattern matching queries for strings with Wildcards. For a constant size alphabet our data structure uses O(n.log^e(n)) bits for any e>0 and reports all occ occurrences of a Wildcard string in O(m+s^g.M(n)+occ) time, where M(n)=o(log(log(log(n)))), s is the alphabet size, m is the number of alphabet symbols and g is the number of Wildcard symbols in the query string. We also present an O(n)-bit index with O((m+s^g+occ).log^e(n)) query time and an O(n{log(log(n))}^2)-bit index with O((m+s^g+occ).log(log(n))) query time. These are the first non-trivial data structures for this problem that need o(n.log(n)) bits of space.
-
STACS - Space-Efficient String Indexing for Wildcard Pattern Matching.
2014Co-Authors: Moshe Lewenstein, Yakov Nekrich, Jeffrey Scott VitterAbstract:In this paper we describe compressed indexes that support pattern matching queries for strings with Wildcards. For a constant size alphabet our data structure uses O(n.log^e(n)) bits for any e>0 and reports all occ occurrences of a Wildcard string in O(m+s^g.M(n)+occ) time, where M(n)=o(log(log(log(n)))), s is the alphabet size, m is the number of alphabet symbols and g is the number of Wildcard symbols in the query string. We also present an O(n)-bit index with O((m+s^g+occ).log^e(n)) query time and an O(n{log(log(n))}^2)-bit index with O((m+s^g+occ).log(log(n))) query time. These are the first non-trivial data structures for this problem that need o(n.log(n)) bits of space.
-
Compressed text indexing with Wildcards
Journal of Discrete Algorithms, 2013Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott VitterAbstract:Let T=T"[email protected]^k^"^1T"[email protected]^k^"^[email protected]^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH"h+o([email protected])+O(dlogn) bits of space, where H"h is the hth-order empirical entropy (h=o(log"@sn)) of T.
-
SPIRE - Compressed text indexing with Wildcards
String Processing and Information Retrieval, 2011Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott VitterAbstract:Let T = T1φk1T2φk2 .... φkdTd+1 be a text of total length n, where characters of each Ti are chosen from an alphabet Σ of size σ, and φ denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nHh + o(n log σ) + O(d log n) bits space, where Hh is the hth-order empirical entropy (h = o(logσ n)) of T.
Xin Dong Wu - One of the best experts on this subject based on the ideXlab platform.
-
Efficient sequential pattern mining with Wildcards for keyphrase extraction
Knowledge-Based Systems, 2017Co-Authors: Fei Xie, Xin Dong Wu, Xing Quan ZhuAbstract:A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, so the flexible Wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.
-
document specific keyphrase extraction using sequential patterns with Wildcards
International Conference on Data Mining, 2014Co-Authors: Xin Dong WuAbstract:Finding good key phrases for a document is beneficial for many applications, such as text summarization, browsing, and indexing. In this paper, we propose a sequential pattern mining based document-specific key phrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, where the flexible Wildcard constraints within a pattern can capture semantic relationships between words. To achieve this goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important key phrases to be captured during the mining process. For each extracted key phrase candidate, we use some statistical pattern features to characterize it. A supervised learning classifier is trained to identify key phrases from a test document. Comparisons on key phrase benchmark datasets confirm that our document-specific key phrase extraction method is effective in improving the quality of extracted key phrases.
-
ICDM - Document-Specific Keyphrase Extraction Using Sequential Patterns with Wildcards
2014 IEEE International Conference on Data Mining, 2014Co-Authors: Xin Dong WuAbstract:Finding good key phrases for a document is beneficial for many applications, such as text summarization, browsing, and indexing. In this paper, we propose a sequential pattern mining based document-specific key phrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, where the flexible Wildcard constraints within a pattern can capture semantic relationships between words. To achieve this goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important key phrases to be captured during the mining process. For each extracted key phrase candidate, we use some statistical pattern features to characterize it. A supervised learning classifier is trained to identify key phrases from a test document. Comparisons on key phrase benchmark datasets confirm that our document-specific key phrase extraction method is effective in improving the quality of extracted key phrases.
-
Pattern Matching with Flexible Wildcards
Journal of Computer Science and Technology, 2014Co-Authors: Xin Dong Wu, Jipeng QiangAbstract:Pattern matching with Wildcards (PMW) has great theoretical and practical significance in bioinformatics, information retrieval, and pattern mining. Due to the uncertainty of Wildcards, not only is the number of all matches exponential with respect to the maximal gap flexibility and the pattern length, but the matching positions in PMW are also hard to choose. The objective to count the maximal number of matches one by one is computationally infeasible. Therefore, rather than solving the generic PMW problem, many research efforts have further defined new problems within PMW according to different application backgrounds. To break through the limitations of either fixing the number or allowing an unbounded number of Wildcards, pattern matching with flexible Wildcards (PMFW) allows the users to control the ranges of Wildcards. In this paper, we provide a survey on the state-of-the-art algorithms for PMFW, with detailed analyses and comparisons, and discuss challenges and opportunities in PMFW research and applications.
-
mining sequential patterns with periodic Wildcard gaps
Applied Intelligence, 2014Co-Authors: Youxi Wu, Lingling Wang, Wei Ding, Xin Dong WuAbstract:Mining frequent patterns with periodic Wildcard gaps is a critical data mining problem to deal with complex real-world problems. This problem can be described as follows: given a subject sequence, a pre-specified threshold, and a variable gap-length with Wildcards between each two consecutive letters. The task is to gain all frequent patterns with periodic Wildcard gaps. State-of-the-art mining algorithms which use matrices or other linear data structures to solve the problem not only consume a large amount of memory but also run slowly. In this study, we use an Incomplete Nettree structure (the last layer of a Nettree which is an extension of a tree) of a sub-pattern P to efficiently create Incomplete Nettrees of all its super-patterns with prefix pattern P and compute the numbers of their supports in a one-way scan. We propose two new algorithms, MAPB (Mining sequentiAl Pattern using incomplete Nettree with Breadth first search) and MAPD (Mining sequentiAl Pattern using incomplete Nettree with Depth first search), to solve the problem effectively with low memory requirements. Furthermore, we design a heuristic algorithm MAPBOK (MAPB for tOp-K) based on MAPB to deal with the Top-K frequent patterns for each length. Experimental results on real-world biological data demonstrate the superiority of the proposed algorithms in running time and space consumption and also show that the pattern matching approach can be employed to mine special frequent patterns effectively.
Fei Xie - One of the best experts on this subject based on the ideXlab platform.
-
Efficient sequential pattern mining with Wildcards for keyphrase extraction
Knowledge-Based Systems, 2017Co-Authors: Fei Xie, Xin Dong Wu, Xing Quan ZhuAbstract:A keyphrase (a multi-word unit) in a document denotes one or multiple keywords capturing a main topic of the underlying document. Finding good keyphrases of a document can quickly summarize knowledge for efficient decision making and benefit domains involving intensive text information. To date, existing keyphrase extraction methods cannot be customized to each specific document, mainly because their patterns used to form paraphrases are too restrictive and may not capture flexible keyword relationships inside the text. In this paper, we propose a sequential pattern mining based document-specific keyphrase extraction method. Our key innovation is to use Wildcards (or gap constraints) to help extract sequential patterns, so the flexible Wildcard constraints within a pattern can capture semantic relationships between words, and the system will have full flexibility to discover different types of sequential patterns as candidates for keyphrase extraction. To achieve the goal, we regard each single document as a sequential dataset, and propose an efficient algorithm to mine sequential patterns with Wildcard and one-off conditions that allows important keyphrases to be captured during the mining process. For each extracted keyphrase candidate, we use some statistical pattern features to characterize it, and further collect all keyphrases from the document to form a training set. A supervised learning classifier is trained to identify keyphrases from a test document. Because our pattern mining and pattern characterization processes are customized to each single document, keyphases extracted from our method are highly specific for each document. Experimental results demonstrate that the proposed sequential pattern mining method outperforms existing pattern mining methods in both runtime performance and completeness. Comparisons on keyphrase benchmark datasets also confirm that the proposed document-specific keyphrase extraction method is effective in improving the quality of extracted keyphrases.
Sharma V. Thankachan - One of the best experts on this subject based on the ideXlab platform.
-
document retrieval with one Wildcard
Mathematical Foundations of Computer Science, 2014Co-Authors: Moshe Lewenstein, Yakov Nekrich, Ian J Munro, Sharma V. ThankachanAbstract:In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a Wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one Wildcard must be enumerated. We describe a linear space data structure that reports all documents containing a substring P in \(O(|P|+\sigma \sqrt{\log\log \log n} + \mathtt{docc})\) time, where σ is the alphabet size and docc is the number of listed documents. We also describe a succinct solution for this problem.
-
MFCS (2) - Document Retrieval with One Wildcard
Mathematical Foundations of Computer Science 2014, 2014Co-Authors: Moshe Lewenstein, Yakov Nekrich, J. Ian Munro, Sharma V. ThankachanAbstract:In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a Wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one Wildcard must be enumerated. We describe a linear space data structure that reports all documents containing a substring P in \(O(|P|+\sigma \sqrt{\log\log \log n} + \mathtt{docc})\) time, where σ is the alphabet size and docc is the number of listed documents. We also describe a succinct solution for this problem.
-
Compressed text indexing with Wildcards
Journal of Discrete Algorithms, 2013Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott VitterAbstract:Let T=T"[email protected]^k^"^1T"[email protected]^k^"^[email protected]^k^"^dT"d"+"1 be a text of total length n, where characters of each T"i are chosen from an alphabet @S of size @s, and @f denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH"h+o([email protected])+O(dlogn) bits of space, where H"h is the hth-order empirical entropy (h=o(log"@sn)) of T.
-
SPIRE - Compressed text indexing with Wildcards
String Processing and Information Retrieval, 2011Co-Authors: Tsung-han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott VitterAbstract:Let T = T1φk1T2φk2 .... φkdTd+1 be a text of total length n, where characters of each Ti are chosen from an alphabet Σ of size σ, and φ denotes a Wildcard symbol. The text indexing with Wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as Wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nHh + o(n log σ) + O(d log n) bits space, where Hh is the hth-order empirical entropy (h = o(logσ n)) of T.