regular expression

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

Michela Becchi - One of the best experts on this subject based on the ideXlab platform.

  • gpu acceleration of regular expression matching for large datasets exploring the implementation space
    Computing Frontiers, 2013
    Co-Authors: Michela Becchi
    Abstract:

    regular expression matching is a central task in several networking (and search) applications and has been accelerated on a variety of parallel architectures, including general purpose multi-core processors, network processors, field programmable gate arrays, and ASIC- and TCAM-based systems. All of these solutions are based on finite automata (either in deterministic or non-deterministic form) and mostly focus on effective memory representations for such automata. More recently, a handful of proposals have exploited the parallelism intrinsic in regular expression matching (i.e., coarse-grained packet-level parallelism and fine-grained data structure parallelism) to propose efficient regex-matching designs for GPUs. However, most GPU solutions aim at achieving good performance on small datasets, which are far less complex and problematic than those used in real-world applications.In this work, we provide a more comprehensive study of regular expression matching on GPUs. To this end, we consider datasets of practical size and complexity and explore advantages and limitations of different automata representations and of various GPU implementation techniques. Our goal is not to show optimal speedup on specific datasets, but to highlight advantages and disadvantages of the GPU hardware in supporting state-of-the-art automata representations and encoding schemes, approaches that have been broadly adopted on other parallel memory-based platforms.

  • a dfa a time and space efficient dfa compression algorithm for fast regular expression evaluation
    ACM Transactions on Architecture and Code Optimization, 2013
    Co-Authors: Michela Becchi, Patrick Crowley
    Abstract:

    Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While Deterministic Finite Automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. Kumar et al. [2006a] have proposed Delayed Input DFAs (D2FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which in turn affects the memory bandwidth required to evaluate regular expressions. In this article we introduce Amortized time − bandwidth overhead DFAs (A − DFAs), a general compression technique that results in at most N(k + 1)/k state traversals when processing a string of length N, k being a positive integer. In comparison to the D2FA approach, our technique achieves comparable levels of compression with lower provable bounds on memory bandwidth (or greater compression for a given bandwidth bound). Moreover, the A-DFA algorithm has lower complexity, can be applied during DFA creation, and is suitable for scenarios where a compressed DFA needs to be dynamically built or updated. Finally, we show how to combine A-DFA with alphabet reduction and multistride DFAs, two techniques aimed at reducing the memory space and bandwidth requirement of DFAs, and discuss memory encoding schemes suitable for A-DFAs.

  • Evaluating regular expression matching engines on network and general purpose processors
    Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, 2009
    Co-Authors: Michela Becchi, Charlie Wiseman, Patrick Crowley
    Abstract:

    In recent years we have witnessed a proliferation of data structure and algorithm proposals for efficient deep packet inspection on memory based architectures. In parallel, we have observed an increasing interest in network processors as target architectures for high performance networking applications. In this paper we explore design alternatives in the implementation of regular expression matching architectures on network processors (NPs) and general purpose processors (GPPs). Specifically, we present a performance evaluation on an Intel IXP2800 NP, on an Intel Xeon GPP and on a multiprocessor system consisting of four AMD Opteron 850 cores. Our study shows how to exploit the Intel IXP2800 architectural features in order to maximize system throughput, identifies and evaluates algorithmic and architectural trade-offs and limitations, and highlights how the presence of caches affects the overall performances. We provide an implementation of our NP designs within the Open Network Laboratory (http://www.onl.wustl.edu).

  • Efficient regular expression evaluation: theory to practice
    Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, 2008
    Co-Authors: Michela Becchi, Patrick Crowley
    Abstract:

    Several algorithms and techniques have been proposed recently to accelerate regular expression matching and enable deep packet inspection at line rate. This work aims to provide a comprehensive practical evaluation of existing techniques, extending them and analyzing their compatibility. The study focuses on two hardware architectures: memory-based ASICs and FPGAs.

  • An Improved Algorithm to Accelerate regular expression Evaluation
    Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2007
    Co-Authors: Michela Becchi, Patrick Crowley
    Abstract:

    Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While deterministic finite automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. In [9], Kumar et al. propose Delayed Input DFAs (D2FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which corresponds directly to the memory bandwidth required to evaluate regular expressions. In this paper we introduce a general compression technique that results in at most 2N state traversals when processing a string of length N. In comparison to the D2FA approach, our technique achieves comparable levels of compression, with lower provable bounds on memory bandwidth (or greater compression for a given bandwidth bound). Moreover, our proposed algorithm has lower complexity, is suitable for scenarios where a compressed DFA needs to be dynamically built or updated, and fosters locality in the traversal process. Finally, we also describe a novel alphabet reduction scheme for DFA-based structures that can yield further dramatic reductions in data structure size.

Patrick Crowley - One of the best experts on this subject based on the ideXlab platform.

  • a dfa a time and space efficient dfa compression algorithm for fast regular expression evaluation
    ACM Transactions on Architecture and Code Optimization, 2013
    Co-Authors: Michela Becchi, Patrick Crowley
    Abstract:

    Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While Deterministic Finite Automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. Kumar et al. [2006a] have proposed Delayed Input DFAs (D2FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which in turn affects the memory bandwidth required to evaluate regular expressions. In this article we introduce Amortized time − bandwidth overhead DFAs (A − DFAs), a general compression technique that results in at most N(k + 1)/k state traversals when processing a string of length N, k being a positive integer. In comparison to the D2FA approach, our technique achieves comparable levels of compression with lower provable bounds on memory bandwidth (or greater compression for a given bandwidth bound). Moreover, the A-DFA algorithm has lower complexity, can be applied during DFA creation, and is suitable for scenarios where a compressed DFA needs to be dynamically built or updated. Finally, we show how to combine A-DFA with alphabet reduction and multistride DFAs, two techniques aimed at reducing the memory space and bandwidth requirement of DFAs, and discuss memory encoding schemes suitable for A-DFAs.

  • Evaluating regular expression matching engines on network and general purpose processors
    Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, 2009
    Co-Authors: Michela Becchi, Charlie Wiseman, Patrick Crowley
    Abstract:

    In recent years we have witnessed a proliferation of data structure and algorithm proposals for efficient deep packet inspection on memory based architectures. In parallel, we have observed an increasing interest in network processors as target architectures for high performance networking applications. In this paper we explore design alternatives in the implementation of regular expression matching architectures on network processors (NPs) and general purpose processors (GPPs). Specifically, we present a performance evaluation on an Intel IXP2800 NP, on an Intel Xeon GPP and on a multiprocessor system consisting of four AMD Opteron 850 cores. Our study shows how to exploit the Intel IXP2800 architectural features in order to maximize system throughput, identifies and evaluates algorithmic and architectural trade-offs and limitations, and highlights how the presence of caches affects the overall performances. We provide an implementation of our NP designs within the Open Network Laboratory (http://www.onl.wustl.edu).

  • Efficient regular expression evaluation: theory to practice
    Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, 2008
    Co-Authors: Michela Becchi, Patrick Crowley
    Abstract:

    Several algorithms and techniques have been proposed recently to accelerate regular expression matching and enable deep packet inspection at line rate. This work aims to provide a comprehensive practical evaluation of existing techniques, extending them and analyzing their compatibility. The study focuses on two hardware architectures: memory-based ASICs and FPGAs.

  • An Improved Algorithm to Accelerate regular expression Evaluation
    Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2007
    Co-Authors: Michela Becchi, Patrick Crowley
    Abstract:

    Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While deterministic finite automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. In [9], Kumar et al. propose Delayed Input DFAs (D2FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which corresponds directly to the memory bandwidth required to evaluate regular expressions. In this paper we introduce a general compression technique that results in at most 2N state traversals when processing a string of length N. In comparison to the D2FA approach, our technique achieves comparable levels of compression, with lower provable bounds on memory bandwidth (or greater compression for a given bandwidth bound). Moreover, our proposed algorithm has lower complexity, is suitable for scenarios where a compressed DFA needs to be dynamically built or updated, and fosters locality in the traversal process. Finally, we also describe a novel alphabet reduction scheme for DFA-based structures that can yield further dramatic reductions in data structure size.

Viktor K Prasanna - One of the best experts on this subject based on the ideXlab platform.

  • High-performance and compact architecture for regular expression matching on FPGA
    IEEE Transactions on Computers, 2012
    Co-Authors: Yi-Hua E. Yang, Viktor K Prasanna
    Abstract:

    We present the design, implementation and evaluation of a high-performance architecture for regular expression matching (REM) on field-programmable gate array (FPGA). Each regular expression (regex) is first parsed into a concise token list representation, then compiled to a modular nondeterministic finite automaton (RE-NFA) using a modified version of the McNaughton-Yamada algorithm. The RE-NFA can be mapped directly onto a compact register-transistor level (RTL) circuit. A number of optimizations are applied to improve the circuit performance: 1) spatial stacking is used to construct an REM circuit processing m ≥ 1 input characters per clock cycle; 2) single-character constrained repetitions are matched efficiently by parallel shift-register lookup tables; 3) complex character classes are matched by a BRAM-based classifier shared across regexes; 4) a multipipeline architecture is used to organize a large number of RE-NFAs into priority groups to limit the I/O size of the circuit. We implemented 2,630 unique PCRE regexes from Snort rules (February 2010) in the proposed REM architecture. Based on the place-and-route results from Xilinx ISE 11.1 targeting Virtex5 LX-220 FPGAs, the proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.

  • Space-time tradeoff in regular expression matching with semi-deterministic finite automata
    Proceedings - IEEE INFOCOM, 2011
    Co-Authors: Yi-Hua E. Yang, Viktor K Prasanna
    Abstract:

    regular expression matching (REM) with nondeterministic finite automata (NFA) can be computationally expensive when a large number of patterns are matched concurrently. On the other hand, converting the NFA to a deterministic finite automaton (DFA) can cause state explosion, where the number of states and transitions in the DFA are exponentially larger than in the NFA. In this paper, we seek to answer the following question: to match an arbitrary set of regular expressions, is there a finite automaton that lies between the NFA and DFA in terms of computation and memory complexities? We introduce the semi-deterministic finite automata (SFA) and the state convolvement test to construct an SFA from a given NFA. An SFA consists of a fixed number (p) of constituent DFAs (c-DFA) running in parallel; each c-DFA is responsible for a subset of states in the original NFA. To match a set of regular expressions with n overlapping symbols (that can match to the same input character concurrently), the NFA can require O(n) computation per input character, whereas the DFA can have a state transition table with O(2n) states. By exploiting the state convolvements during the SFA construction, an equivalent SFA reduces the computation complexity to O(p2=c2) per input character while limiting the space requirement to O(|Σ|×p2×(n=p)c) states, where Σ is the alphabet and c ≥ 1 is a small design constant. Although the problem of constructing the optimal (minimum-sized) SFA is shown to be NP-complete, we develop a greedy heuristic to quickly construct a near-optimal SFA in time and space quadratic in the number of states in the original NFA. We demonstrate our SFA construction using real-world regular expressions taken from the Snort IDS.

  • compact architecture for high throughput regular expression matching on fpga
    Architectures for Networking and Communications Systems, 2008
    Co-Authors: Yi-Hua E. Yang, Weirong Jiang, Viktor K Prasanna
    Abstract:

    In this paper we present a novel architecture for high-speed and high-capacity regular expression matching (REM) on FPGA. The proposed REM architecture, based on nondeterministic finite automaton (RE-NFA), efficiently constructs regular expression matching engines (REME) of arbitrary regular patterns and character classes in a uniform structure, utilizing both logic slices and block memory (BRAM) available on modern FPGA devices. The resulting circuits take advantage of synthesis and routing optimizations to achieve high operating speed and area efficiency. The uniform structure of our RE-NFA design can be stacked in a simple way to produce multi-character input circuits to scale up throughput further. An n-state m-character input REME takes only O (n X log2 m) time to construct and occupies no more than O (n X m) logic units. The REMEs can be staged and pipelined in large numbers to achieve high parallelism without sacrificing clock frequency. Using the proposed RE-NFA architecture, we are able to implement 3 copies of two-character input REMEs, each with 760 regular expressions, 18715 states and 371 character classes, onto a single Xilinx Virtex 4 LX-100-12 device. Each copy processes 2 characters per clock cycle at 300 MHz, resulting in a concurrent throughput of 14.4 Gbps for 760 REMEs. Compared with the automatic NFA-to-VHDL REME compilation [13], our approach achieves over 9x throughput efficiency (Gbps*state/LUT). Compared with state-of-the-art REMEs on FPGA, our approach also indicates up to 70% better throughput efficiency.

  • regular expression software deceleration for intrusion detection systems
    Proceedings - 2006 International Conference on Field Programmable Logic and Applications, FPL, 2006
    Co-Authors: Zachary K Baker, Hong Jip Jung, Viktor K Prasanna
    Abstract:

    The use of reconfigurable hardware for network security applications has recently made great strides as FPGA devices have provided larger and faster resources. regular expressions have become a necessary and basic capability of intrusion detection systems, but their implementation tends to be expensive in terms of memory cost and time performance. This work provides an architecture that reduces the exponential NFA to DFA conversion cost to a linear growth for many expressions. By handling the timing and integration of the regular expression-based rules in a custom microcontroller, the memory costs are reduced and the capabilities are increased over a DFA-only solution. Both the microcontroller and its associated DFA are implemented on the FPGA. The patterns and software are stored using run-time programmable memory tables. This allows on-the-fly modification to the regular expressions. This paper presents the design details of the regular expression microcontroller and its integration to the DFA state machines. The types of expressions that the system can handle efficiently are discussed as well as the outstanding problems that continue to challenge the community.

  • Fast regular expression matching using FPGAs
    Proceedings of the 9th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2001), 2001
    Co-Authors: Reetinder Sidhu, Viktor K Prasanna
    Abstract:

    This paper presents an efficient method for finding matches to a given regular expression in given text using FPGAs. To match a regular expression of length n, a serial machine requires 0(2^n) memory and takes 0(1) time per text character. The proposed approach reqiures only 0(n^2) space and still process a text character in 0(1) time (one clock cycle).The improvement is due to the Nondetermineistic Finite Automaton (NFA) used to perform the matching. As far as the authors are aware, this is the first prctical use of a nondeterministic state machine on programmable logic. Furthermore, the paper presents a simple, fast algorithm that quickly constructs the NFA for the given regular expression. Fast NFA construction is crucial because the NFA structure depends on the regular expression, which is known only at runtime. Implementations of the algorithm for conventional FPGAs and the self-reconfigurable Gate Array (SRGA) are described. To evaluate performance, the NFA logic was mapped onto the Virtex XCV100 FPGA and the SRGA. Also, the performance of GNU grep for matching regular expressions was evaluated on an 800 MHz Pentium III machine. The proposed approach was faster than best case grep performance in most cases. It was orders of magnitude faster than worst case grep performance. Logic for the largest NFA considered fit in less than a 1000 CLBs while DFA storage for grep in the worst case consumed a few hundred megabytes.

Kubilay Atasu - One of the best experts on this subject based on the ideXlab platform.

  • Hardware-accelerated regular expression matching for high-throughput text analytics
    2013 23rd International Conference on Field programmable Logic and Applications, 2013
    Co-Authors: Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick R. Reiss
    Abstract:

    Advanced text analytics systems combine regular expression (regex) matching, dictionary processing, and relational algebra for efficient information extraction from text documents. Such systems require support for advanced regex matching features, such as start offset reporting and capturing groups. However, existing regex matching architectures based on reconfigurable nondeterministic state machines and programmable deterministic state machines are not designed to support such features. We describe a novel architecture that supports such advanced features using a network of state machines. We also present a compiler that maps the regexs onto such networks that can be efficiently realized on reconfigurable logic. For each regex, our compiler produces a state machine description, statically computes the number of state machines needed, and produces an optimized interconnection network. Experiments on an Altera Stratix IV FPGA, using regexs from a real life text analytics benchmark, show that a throughput rate of 16 Gb/s can be reached.

  • Designing a Programmable Wire-Speed regular-expression Matching Accelerator
    2012 45th Annual IEEE ACM International Symposium on Microarchitecture, 2012
    Co-Authors: Jan Van Lunteren, Christoph Hagleitner, Timothy Heil, Giora Biran, Uzi Shvadron, Kubilay Atasu
    Abstract:

    A growing number of applications rely on fast pattern matching to scan data in real-time for security and analytics purposes. The RegX accelerator in the IBM Power Edge of NetworkTM (PowerEN) processor supports these applications using a combination of fast programmable state machines and simple processing units to scan data streams against thousands of regular-expression patterns at state-of-the-art Ethernet link speeds. RegX employs a special rule cache and includes several new micro-architectural features that enable various instruction dispatch and execution options for the processing units. The architecture applies RISC philosophy to special-purpose computing: hardware provides fast, simple primitives, typically performed in a single cycle, which are exploited by an intelligent compiler and system software for high performance. This approach provides the flexibility required to achieve good performance across a wide range of workloads. As implemented in the PowerENTM processor, the accelerator achieves a theoretical peak scan rate of 73.6 Gbit/s, and a measured scan rate of about 15 to 40 Gbit/s for typical intrusion detection workloads.

Benjamin C. Pierce - One of the best experts on this subject based on the ideXlab platform.

  • regular expression pattern matching for xml
    Journal of Functional Programming, 2003
    Co-Authors: Haruo Hosoya, Benjamin C. Pierce
    Abstract:

    We propose regular expression pattern matching as a core feature of programming languages for manipulating XML. We extend conventional pattern-matching facilities (as in ML) with regular expression operators such as repetition (*), alternation (|), etc., that can match arbitrarily long sequences of subtrees, allowing a compact pattern to extract data from the middle of a complex sequence. We then show how to check standard notions of exhaustiveness and redundancy for these patterns. regular expression patterns are intended to be used in languages with type systems based on regular expression types. To avoid excessive type annotations, we develop a type inference scheme that propagates type constraints to pattern variables from the type of input values. The type inference algorithm translates types and patterns into regular tree automata, and then works in terms of standard closure operations (union, intersection, and difference) on tree automata. The main technical challenge is dealing with the interaction of repetition and alternation patterns with the first-match policy, which gives rise to subtleties concerning both the termination and precision of the analysis. We address these issues by introducing a data structure representing these closure operations lazily.

  • regular expression pattern matching for xml
    Symposium on Principles of Programming Languages, 2001
    Co-Authors: Haruo Hosoya, Benjamin C. Pierce
    Abstract:

    We propose regular expression pattern matching as a core feature for programming languages for manipulating XML (and similar tree-structured data formats). We extend conventional pattern-matching facilities with regular expression operators such as repetition (*), alternation (I), etc., that can match arbitrarily long sequences of subtrees, allowing a compact pattern to extract data from the middle of a complex sequence. We show how to check standard notions of exhaustiveness and redundancy for these patterns.regular expression patterns are intended to be used in languages whose type systems are also based on the regular expression types. To avoid excessive type annotations, we develop a type inference scheme that propagates type constraints to pattern variables from the surrounding context. The type inference algorithm translates types and patterns into regular tree automata and then works in terms of standard closure operations (union, intersection, and difference) on tree automata. The main technical challenge is dealing with the interaction of repetition and alternation patterns with the first-match policy, which gives rise to subtleties concerning both the termination and the precision of the analysis. We address these issues by introducing a data structure representing closure operations lazily.

  • regular expression types for xml
    International Conference on Functional Programming, 2000
    Co-Authors: Haruo Hosoya, Jérôme Vouillon, Benjamin C. Pierce
    Abstract:

    We propose regular expression types as a foundation for XML processing languages. regular expression types are a natural generalization of Document Type Definitions (DTDs), describing structures in XML documents using regular expression operators (i.e., *, ?, |, etc.) and supporting a simple but powerful notion of subtyping.The decision problem for the subtype relation is EXPTIME-hard, but it can be checked quite efficiently in many cases of practical interest. The subtyping algorithm developed here is a variant of Aiken and Murphy's set-inclusion constraint solver, to which are added several optimizations and two new properties: (1) our algorithm is provably complete, and (2) it allows a useful "subtagging" relation between nodes with different labels in XML trees.