Text Analytics

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 6453 Experts worldwide ranked by ideXlab platform

Laura Chiticariu - One of the best experts on this subject based on the ideXlab platform.

  • A hardware compilation framework for Text Analytics queries
    Journal of Parallel and Distributed Computing, 2018
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, Laura Chiticariu, Heiner Giefers, Christoph Hagleitner, Peter Hofstee
    Abstract:

    Abstract Unstructured Text data is being generated at an unprecedented rate in the form of Twitter feeds, machine logs or medical records. The analysis of this data is an important step to gaining significant insight regarding innovation, security and decision-making. The performance of traditional compute systems struggles to keep up with the rapid data growth and the expected high quality of information extraction. To cope with this situation, a compilation framework is presented that can transform Text Analytics queries into a hardware description. Deployed on an FPGA, the queries can be executed 60 times faster on average compared to a multi-threaded software implementation. The performance has been evaluated on two generations of high-end server systems including two generations of FPGAs, demonstrating the performance gains from advanced technology.

  • FPL - Compiling Text Analytics queries to FPGAs
    2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Heiner Giefers, Laura Chiticariu
    Abstract:

    Extracting information from unstructured Text data is a compute-intensive task. The performance of general-purpose processors cannot keep up with the rapid growth of Textual data. Therefore we discuss the use of FPGAs to perform large scale Text Analytics. We present a framework consisting of a compiler and an operator library capable of generating a Verilog processing pipeline from a Text Analytics query specified in the annotation query language AQL. The operator library comprises a set of configurable modules capable of performing relational and extraction tasks which can be assembled by the compiler to represent a full annotation operator graph. Leveraging the nature of Text processing we show that most tasks can be performed in an efficient streaming fashion. We evaluate the performance, power consumption and hardware utilization of our approach for a set of different queries compiled to a Stratix IV FPGA. Measurements show an up to 79 times improvement of document-throughput over a 64 threaded software implementation on a POWER7 server. Moreover the accelerated system's energy efficiency is up to 85 times better.

  • Giving Text Analytics a Boost
    IEEE Micro, 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, H. Peter Hofstee, Laura Chiticariu, Christoph Hagleitner, Eva Sitaridi
    Abstract:

    The amount of Textual data has reached a new scale and continues to grow at an unprecedented rate. IBM's SystemT software is a powerful Text-Analytics system that offers a query-based interface to reveal the valuable information that lies within these mounds of data. However, traditional server architectures are not capable of analyzing so-called big data efficiently, despite the high memory bandwidth that is available. The authors show that by using a streaming hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemT's information extraction queries can be improved by an order of magnitude. They also show how such a system can be deployed by extending SystemT's existing compilation flow and by using a multithreaded communication interface that can efficiently use the accelerator's bandwidth.

  • Compiling Text Analytics queries to FPGAs
    2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Heiner Giefers, Laura Chiticariu
    Abstract:

    Extracting information from unstructured Text data is a compute-intensive task. The performance of general-purpose processors cannot keep up with the rapid growth of Textual data. Therefore we discuss the use of FPGAs to perform large scale Text Analytics. We present a framework consisting of a compiler and an operator library capable of generating a Verilog processing pipeline from a Text Analytics query specified in the annotation query language AQL. The operator library comprises a set of configurable modules capable of performing relational and extraction tasks which can be assembled by the compiler to represent a full annotation operator graph. Leveraging the nature of Text processing we show that most tasks can be performed in an efficient streaming fashion. We evaluate the performance, power consumption and hardware utilization of our approach for a set of different queries compiled to a Stratix IV FPGA. Measurements show an up to 79 times improvement of document-throughput over a 64 threaded software implementation on a POWER7 server. Moreover the accelerated system's energy efficiency is up to 85 times better.

  • i can do Text Analytics designing development tools for novice developers
    Human Factors in Computing Systems, 2013
    Co-Authors: Huahai Yang, Daina Puponswickham, Yunyao Li, Laura Chiticariu, Benjamin Nguyen, Arnaldo Carrenofuentes
    Abstract:

    Text Analytics, an increasingly important application domain, is hampered by the high barrier to entry due to the many conceptual difficulties novice developers encounter. This work addresses the problem by developing a tool to guide novice developers to adopt the best practices employed by expert developers in Text Analytics and to quickly harness the full power of the underlying system. Taking a user centered task analytical approach, the tool development went through multiple design iterations and evaluation cycles. In the latest evaluation, we found that our tool enables novice developers to develop high quality extractors on par with the state of art within a few hours and with minimal training. Finally, we discuss our experience and lessons learned in the conText of designing user interfaces to reduce the barriers to entry into complex domains of expertise.

Raphael Polig - One of the best experts on this subject based on the ideXlab platform.

  • A hardware compilation framework for Text Analytics queries
    Journal of Parallel and Distributed Computing, 2018
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, Laura Chiticariu, Heiner Giefers, Christoph Hagleitner, Peter Hofstee
    Abstract:

    Abstract Unstructured Text data is being generated at an unprecedented rate in the form of Twitter feeds, machine logs or medical records. The analysis of this data is an important step to gaining significant insight regarding innovation, security and decision-making. The performance of traditional compute systems struggles to keep up with the rapid data growth and the expected high quality of information extraction. To cope with this situation, a compilation framework is presented that can transform Text Analytics queries into a hardware description. Deployed on an FPGA, the queries can be executed 60 times faster on average compared to a multi-threaded software implementation. The performance has been evaluated on two generations of high-end server systems including two generations of FPGAs, demonstrating the performance gains from advanced technology.

  • FPL - Compiling Text Analytics queries to FPGAs
    2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Heiner Giefers, Laura Chiticariu
    Abstract:

    Extracting information from unstructured Text data is a compute-intensive task. The performance of general-purpose processors cannot keep up with the rapid growth of Textual data. Therefore we discuss the use of FPGAs to perform large scale Text Analytics. We present a framework consisting of a compiler and an operator library capable of generating a Verilog processing pipeline from a Text Analytics query specified in the annotation query language AQL. The operator library comprises a set of configurable modules capable of performing relational and extraction tasks which can be assembled by the compiler to represent a full annotation operator graph. Leveraging the nature of Text processing we show that most tasks can be performed in an efficient streaming fashion. We evaluate the performance, power consumption and hardware utilization of our approach for a set of different queries compiled to a Stratix IV FPGA. Measurements show an up to 79 times improvement of document-throughput over a 64 threaded software implementation on a POWER7 server. Moreover the accelerated system's energy efficiency is up to 85 times better.

  • Giving Text Analytics a Boost
    IEEE Micro, 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, H. Peter Hofstee, Laura Chiticariu, Christoph Hagleitner, Eva Sitaridi
    Abstract:

    The amount of Textual data has reached a new scale and continues to grow at an unprecedented rate. IBM's SystemT software is a powerful Text-Analytics system that offers a query-based interface to reveal the valuable information that lies within these mounds of data. However, traditional server architectures are not capable of analyzing so-called big data efficiently, despite the high memory bandwidth that is available. The authors show that by using a streaming hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemT's information extraction queries can be improved by an order of magnitude. They also show how such a system can be deployed by extending SystemT's existing compilation flow and by using a multithreaded communication interface that can efficiently use the accelerator's bandwidth.

  • Compiling Text Analytics queries to FPGAs
    2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Heiner Giefers, Laura Chiticariu
    Abstract:

    Extracting information from unstructured Text data is a compute-intensive task. The performance of general-purpose processors cannot keep up with the rapid growth of Textual data. Therefore we discuss the use of FPGAs to perform large scale Text Analytics. We present a framework consisting of a compiler and an operator library capable of generating a Verilog processing pipeline from a Text Analytics query specified in the annotation query language AQL. The operator library comprises a set of configurable modules capable of performing relational and extraction tasks which can be assembled by the compiler to represent a full annotation operator graph. Leveraging the nature of Text processing we show that most tasks can be performed in an efficient streaming fashion. We evaluate the performance, power consumption and hardware utilization of our approach for a set of different queries compiled to a Stratix IV FPGA. Measurements show an up to 79 times improvement of document-throughput over a 64 threaded software implementation on a POWER7 server. Moreover the accelerated system's energy efficiency is up to 85 times better.

  • FPL - Hardware-accelerated regular expression matching for high-throughput Text Analytics
    2013 23rd International Conference on Field programmable Logic and Applications, 2013
    Co-Authors: Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick R. Reiss
    Abstract:

    Advanced Text Analytics systems combine regular expression (regex) matching, dictionary processing, and relational algebra for efficient information extraction from Text documents. Such systems require support for advanced regex matching features, such as start offset reporting and capturing groups. However, existing regex matching architectures based on reconfigurable nondeterministic state machines and programmable deterministic state machines are not designed to support such features. We describe a novel architecture that supports such advanced features using a network of state machines. We also present a compiler that maps the regexs onto such networks that can be efficiently realized on reconfigurable logic. For each regex, our compiler produces a state machine description, statically computes the number of state machines needed, and produces an optimized interconnection network. Experiments on an Altera Stratix IV FPGA, using regexs from a real life Text Analytics benchmark, show that a throughput rate of 16 Gb/s can be reached.

Kubilay Atasu - One of the best experts on this subject based on the ideXlab platform.

  • A hardware compilation framework for Text Analytics queries
    Journal of Parallel and Distributed Computing, 2018
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, Laura Chiticariu, Heiner Giefers, Christoph Hagleitner, Peter Hofstee
    Abstract:

    Abstract Unstructured Text data is being generated at an unprecedented rate in the form of Twitter feeds, machine logs or medical records. The analysis of this data is an important step to gaining significant insight regarding innovation, security and decision-making. The performance of traditional compute systems struggles to keep up with the rapid data growth and the expected high quality of information extraction. To cope with this situation, a compilation framework is presented that can transform Text Analytics queries into a hardware description. Deployed on an FPGA, the queries can be executed 60 times faster on average compared to a multi-threaded software implementation. The performance has been evaluated on two generations of high-end server systems including two generations of FPGAs, demonstrating the performance gains from advanced technology.

  • FPL - Compiling Text Analytics queries to FPGAs
    2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Heiner Giefers, Laura Chiticariu
    Abstract:

    Extracting information from unstructured Text data is a compute-intensive task. The performance of general-purpose processors cannot keep up with the rapid growth of Textual data. Therefore we discuss the use of FPGAs to perform large scale Text Analytics. We present a framework consisting of a compiler and an operator library capable of generating a Verilog processing pipeline from a Text Analytics query specified in the annotation query language AQL. The operator library comprises a set of configurable modules capable of performing relational and extraction tasks which can be assembled by the compiler to represent a full annotation operator graph. Leveraging the nature of Text processing we show that most tasks can be performed in an efficient streaming fashion. We evaluate the performance, power consumption and hardware utilization of our approach for a set of different queries compiled to a Stratix IV FPGA. Measurements show an up to 79 times improvement of document-throughput over a 64 threaded software implementation on a POWER7 server. Moreover the accelerated system's energy efficiency is up to 85 times better.

  • Giving Text Analytics a Boost
    IEEE Micro, 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, H. Peter Hofstee, Laura Chiticariu, Christoph Hagleitner, Eva Sitaridi
    Abstract:

    The amount of Textual data has reached a new scale and continues to grow at an unprecedented rate. IBM's SystemT software is a powerful Text-Analytics system that offers a query-based interface to reveal the valuable information that lies within these mounds of data. However, traditional server architectures are not capable of analyzing so-called big data efficiently, despite the high memory bandwidth that is available. The authors show that by using a streaming hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemT's information extraction queries can be improved by an order of magnitude. They also show how such a system can be deployed by extending SystemT's existing compilation flow and by using a multithreaded communication interface that can efficiently use the accelerator's bandwidth.

  • Resource-efficient regular expression matching architecture for Text Analytics
    2014 IEEE 25th International Conference on Application-Specific Systems Architectures and Processors, 2014
    Co-Authors: Kubilay Atasu
    Abstract:

    Text Analytics systems, such as IBM's SystemT software, rely on regular expressions (regexs) and dictionaries for transforming unstructured data into a structured format. Unlike network intrusion detection systems, Text Analytics systems compute and report precisely where the specific and sensitive information starts and ends in a Text document. Therefore, advanced regex matching functions, such as start-offset reporting, capturing groups, and leftmost match computation are heavily used in Text Analytics systems. We present a novel regex matching architecture that supports such functions in a resource-efficient way. The resource efficiency is achieved by 1) eliminating state replication, 2) avoiding expensive offset comparison operations in leftmost match computation, and 3) minimizing the number of offset registers. Experiments on regex sets from Text Analytics and network intrusion detection domains, using an Altera Stratix IV FPGA, show that the proposed architecture achieves a more than threefold reduction of the logic resources used and a more than 1.25-fold increase of the clock frequency with respect to a recently proposed architecture that supports identical features.

  • ASAP - Resource-efficient regular expression matching architecture for Text Analytics
    2014 IEEE 25th International Conference on Application-Specific Systems Architectures and Processors, 2014
    Co-Authors: Kubilay Atasu
    Abstract:

    Text Analytics systems, such as IBM's SystemT software, rely on regular expressions (regexs) and dictionaries for transforming unstructured data into a structured format. Unlike network intrusion detection systems, Text Analytics systems compute and report precisely where the specific and sensitive information starts and ends in a Text document. Therefore, advanced regex matching functions, such as start-offset reporting, capturing groups, and leftmost match computation are heavily used in Text Analytics systems. We present a novel regex matching architecture that supports such functions in a resource-efficient way. The resource efficiency is achieved by 1) eliminating state replication, 2) avoiding expensive offset comparison operations in leftmost match computation, and 3) minimizing the number of offset registers. Experiments on regex sets from Text Analytics and network intrusion detection domains, using an Altera Stratix IV FPGA, show that the proposed architecture achieves a more than threefold reduction of the logic resources used and a more than 1.25-fold increase of the clock frequency with respect to a recently proposed architecture that supports identical features.

Zeynep Akkalyoncu Yilmaz - One of the best experts on this subject based on the ideXlab platform.

  • information retrieval meets scalable Text Analytics solr integration with spark
    International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019
    Co-Authors: Ryan Clancy, Zeynep Akkalyoncu Yilmaz
    Abstract:

    Despite the broad adoption of both Apache Spark and Apache Solr, there is little integration between these two platforms to support scalable, end-to-end Text Analytics. We believe this is a missed opportunity, as there is substantial synergy in building analytical pipelines where the results of potentially complex faceted queries feed downstream Text processing components. This demonstration explores exactly such an integration: we evaluate performance under different analytical scenarios and present three simple case studies that illustrate the range of possible analyses enabled by seamlessly connecting Spark to Solr.

  • SIGIR - Information Retrieval Meets Scalable Text Analytics: Solr Integration with Spark
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR'19, 2019
    Co-Authors: Ryan Clancy, Zeynep Akkalyoncu Yilmaz
    Abstract:

    Despite the broad adoption of both Apache Spark and Apache Solr, there is little integration between these two platforms to support scalable, end-to-end Text Analytics. We believe this is a missed opportunity, as there is substantial synergy in building analytical pipelines where the results of potentially complex faceted queries feed downstream Text processing components. This demonstration explores exactly such an integration: we evaluate performance under different analytical scenarios and present three simple case studies that illustrate the range of possible analyses enabled by seamlessly connecting Spark to Solr.

Frederick R. Reiss - One of the best experts on this subject based on the ideXlab platform.

  • A hardware compilation framework for Text Analytics queries
    Journal of Parallel and Distributed Computing, 2018
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, Laura Chiticariu, Heiner Giefers, Christoph Hagleitner, Peter Hofstee
    Abstract:

    Abstract Unstructured Text data is being generated at an unprecedented rate in the form of Twitter feeds, machine logs or medical records. The analysis of this data is an important step to gaining significant insight regarding innovation, security and decision-making. The performance of traditional compute systems struggles to keep up with the rapid data growth and the expected high quality of information extraction. To cope with this situation, a compilation framework is presented that can transform Text Analytics queries into a hardware description. Deployed on an FPGA, the queries can be executed 60 times faster on average compared to a multi-threaded software implementation. The performance has been evaluated on two generations of high-end server systems including two generations of FPGAs, demonstrating the performance gains from advanced technology.

  • Giving Text Analytics a Boost
    IEEE Micro, 2014
    Co-Authors: Raphael Polig, Kubilay Atasu, Frederick R. Reiss, H. Peter Hofstee, Laura Chiticariu, Christoph Hagleitner, Eva Sitaridi
    Abstract:

    The amount of Textual data has reached a new scale and continues to grow at an unprecedented rate. IBM's SystemT software is a powerful Text-Analytics system that offers a query-based interface to reveal the valuable information that lies within these mounds of data. However, traditional server architectures are not capable of analyzing so-called big data efficiently, despite the high memory bandwidth that is available. The authors show that by using a streaming hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemT's information extraction queries can be improved by an order of magnitude. They also show how such a system can be deployed by extending SystemT's existing compilation flow and by using a multithreaded communication interface that can efficiently use the accelerator's bandwidth.

  • FPL - Hardware-accelerated regular expression matching for high-throughput Text Analytics
    2013 23rd International Conference on Field programmable Logic and Applications, 2013
    Co-Authors: Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick R. Reiss
    Abstract:

    Advanced Text Analytics systems combine regular expression (regex) matching, dictionary processing, and relational algebra for efficient information extraction from Text documents. Such systems require support for advanced regex matching features, such as start offset reporting and capturing groups. However, existing regex matching architectures based on reconfigurable nondeterministic state machines and programmable deterministic state machines are not designed to support such features. We describe a novel architecture that supports such advanced features using a network of state machines. We also present a compiler that maps the regexs onto such networks that can be efficiently realized on reconfigurable logic. For each regex, our compiler produces a state machine description, statically computes the number of state machines needed, and produces an optimized interconnection network. Experiments on an Altera Stratix IV FPGA, using regexs from a real life Text Analytics benchmark, show that a throughput rate of 16 Gb/s can be reached.

  • Hardware-accelerated regular expression matching for high-throughput Text Analytics
    2013 23rd International Conference on Field programmable Logic and Applications, 2013
    Co-Authors: Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick R. Reiss
    Abstract:

    Advanced Text Analytics systems combine regular expression (regex) matching, dictionary processing, and relational algebra for efficient information extraction from Text documents. Such systems require support for advanced regex matching features, such as start offset reporting and capturing groups. However, existing regex matching architectures based on reconfigurable nondeterministic state machines and programmable deterministic state machines are not designed to support such features. We describe a novel architecture that supports such advanced features using a network of state machines. We also present a compiler that maps the regexs onto such networks that can be efficiently realized on reconfigurable logic. For each regex, our compiler produces a state machine description, statically computes the number of state machines needed, and produces an optimized interconnection network. Experiments on an Altera Stratix IV FPGA, using regexs from a real life Text Analytics benchmark, show that a throughput rate of 16 Gb/s can be reached.