Stream Processing

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 58035 Experts worldwide ranked by ideXlab platform

Martin Hirzel - One of the best experts on this subject based on the ideXlab platform.

  • Dagstuhl Seminar on Big Stream Processing
    ACM SIGMOD Record, 2019
    Co-Authors: Sherif Sakr, Martin Hirzel, Tilmann Rabl, Paris Carbone, Martin Strohbach
    Abstract:

    Stream Processing can generate insights from big data in real time as it is being produced. This paper reports findings from a 2017 seminar on big Stream Processing, focusing on applications, systems, and languages.

  • Stream Processing languages in the big data era
    International Conference on Management of Data, 2018
    Co-Authors: Martin Hirzel, Sherif Sakr, Emanuele Della Valle, Guillaume Baudart, Angela Bonifati, Akrivi Vlachou
    Abstract:

    This paper is a survey of recent Stream Processing languages, which are programming languages for writing applications that analyze data Streams. Data Streams, or continuous data flows, have been around for decades. But with the advent of the big-data era, the size of data Streams has increased dramatically. Analyzing big data Streams yields immense advantages across all sectors of our society. To analyze Streams, one needs to write a Stream Processing application. This paper showcases several languages designed for this purpose, articulates underlying principles, and outlines open challenges.

  • SPL: An Extensible Language for Distributed Stream Processing
    ACM Transactions on Programming Languages and Systems, 2017
    Co-Authors: Martin Hirzel, Scott Schneider, Bugra Gedik
    Abstract:

    Big data is revolutionizing how all sectors of our economy do business, including telecommunication, transportation, medical, and finance. Big data comes in two flavors: data at rest and data in motion. Processing data in motion is Stream Processing. Stream Processing for big data analytics often requires scale that can only be delivered by a distributed system, exploiting parallelism on many hosts and many cores. One such distributed Stream Processing system is IBM Streams. Early customer experience with IBM Streams uncovered that another core requirement is extensibility, since customers want to build high-performance domain-specific operators for use in their Streaming applications. Based on these two core requirements of distribution and extensibility, we designed and implemented the Streams Processing Language (SPL). This article describes SPL with an emphasis on the language design, distributed runtime, and extensibility mechanism. SPL is now the gateway for the IBM Streams platform, used by our customers for Stream Processing in a broad range of application domains.

  • River: an intermediate language for Stream Processing
    Software: Practice and Experience, 2015
    Co-Authors: Robert Soulé, Martin Hirzel, Bugra Gedik, Robert Grimm
    Abstract:

    This paper presents both a calculus for Stream Processing, named Brooklet, and its realization as an intermediate language, named River. Because River is based on Brooklet, it has a formal semantics that enables reasoning about the correctness of source translations and optimizations. River builds on Brooklet by addressing the real-world details that the calculus elides. We evaluated our system by implementing front-ends for three Streaming languages, and three important optimizations, and a back-end for the System S distributed Streaming runtime. Overall, we significantly lower the barrier to entry for new Stream-Processing languages and thus grow the ecosystem of this crucial style of programming. Copyright © 2015 John Wiley & Sons, Ltd.

  • Stream Processing with a spreadsheet
    European Conference on Object-Oriented Programming, 2014
    Co-Authors: Mandana Vaziri, Olivier Tardieu, Rodric Rabbah, Philippe Suter, Martin Hirzel
    Abstract:

    Continuous data Streams are ubiquitous and represent such a high volume of data that they cannot be stored to disk, yet it is often crucial for them to be analyzed in real-time. Stream Processing is a programming paradigm that processes these immediately, and enables continuous analytics. Our objective is to make it easier for analysts, with little programming experience, to develop continuous analytics applications directly. We propose enhancing a spreadsheet, a pervasive tool, to obtain a programming platform for Stream Processing. We present the design and implementation of an enhanced spreadsheet that enables visualizing live Streams, live programming to compute new Streams, and exporting computations to be run on a server where they can be shared with other users, and persisted beyond the life of the spreadsheet. We formalize our core language, and present case studies that cover a range of Stream Processing applications.

Rajkumar Buyya - One of the best experts on this subject based on the ideXlab platform.

  • distributed data Stream Processing and edge computing
    Journal of Network and Computer Applications, 2018
    Co-Authors: Marcos Dias De Assuno, Alexandre Da Silva Veith, Rajkumar Buyya
    Abstract:

    Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, wearable assistance, and Internet of Things, continuous data Streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for Processing unbounded data Streams in a scalable and efficient manner. More recently, architecture has been proposed to use edge computing for data Stream Processing. This paper surveys state of the art on Stream Processing engines and mechanisms for exploiting resource elasticity features of cloud computing in Stream Processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, Stream Processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. Elasticity becomes even more challenging in highly distributed environments comprising edge and cloud computing resources. This work examines some of these challenges and discusses solutions proposed in the literature to address them. HighlightsThe paper surveys state of the art on Stream Processing engines and mechanisms.The work describes how existing solutions exploit resource elasticity features of cloud computing in Stream Processing.It presents a gap analysis and future directions on Stream Processing on heterogeneous environments.

  • distributed data Stream Processing and edge computing
    Journal of Network and Computer Applications, 2018
    Co-Authors: Marcos Dias De Assuno, Alexandre Da Silva Veith, Rajkumar Buyya
    Abstract:

    Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, wearable assistance, and Internet of Things, continuous data Streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for Processing unbounded data Streams in a scalable and efficient manner. More recently, architecture has been proposed to use edge computing for data Stream Processing. This paper surveys state of the art on Stream Processing engines and mechanisms for exploiting resource elasticity features of cloud computing in Stream Processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, Stream Processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. Elasticity becomes even more challenging in highly distributed environments comprising edge and cloud computing resources. This work examines some of these challenges and discusses solutions proposed in the literature to address them. HighlightsThe paper surveys state of the art on Stream Processing engines and mechanisms.The work describes how existing solutions exploit resource elasticity features of cloud computing in Stream Processing.It presents a gap analysis and future directions on Stream Processing on heterogeneous environments.

  • A Taxonomy and Survey of Stream Processing Systems
    Software Architecture for Big Data and the Cloud, 2017
    Co-Authors: Xinwei Zhao, Saurabh Kumar Garg, Carlos Queiroz, Rajkumar Buyya
    Abstract:

    In the era of big data, an unprecedented amount of data is generated every second. The real time analytics has become a force for transforming organizations which are looking for increasing their consumer base and profit. Therefore, the real time Stream Processing systems have gained a lot of attention, particularly within social media companies such as Twitter and LinkedIn. To identify the open challenges in the area of Stream Processing and facilitate future advancements, it is essential to synthesize and categorize current Stream Processing systems. In this chapter, we propose a taxonomy that characterizes and classifies various Stream systems. Based on the taxonomy we present a survey and comparison study of the state-of-the-art open source Stream computing platforms. The taxonomy and survey is intended to help researchers by providing insights into capabilities of existing Stream platforms and businesses by providing criteria that can be leveraged to identify the most suitable Stream Processing solution that can be adopted for developing their domain-specific applications.

  • Stream Processing in IoT: foundations, state-of-the-art, and future directions
    Internet of Things, 2016
    Co-Authors: Xunyun Liu, Amir Vahid Dastjerdi, Rajkumar Buyya
    Abstract:

    Abstract The emerging Stream-Processing paradigm has become an enabling technology that powers the time-critical applications for Internet of Things. It offers the ability to collect, integrate, analyze, and visualize continuous data Streams in real time, with a scalable, highly available, and fault-tolerant architecture. This chapter outlines the relationship between Stream Processing and IoT, provides a discussion on the idea and formal concepts of Stream Processing, and further categorizes the various use-cases of the Stream model into the Data Stream Management and Complex Event Processing domains. We also present a detailed analysis on the characteristics of IoT Stream data with a focus on their relevant Processing requirements, based on what an abstract architecture of the Stream-Processing system is proposed to be, which is to generally satisfy the Processing needs coming from both IoT domains. This chapter concludes with an outlook that explores the future directions and trending topics regarding the development of Stream Processing to better support IoT applications.

Bugra Gedik - One of the best experts on this subject based on the ideXlab platform.

  • SPL: An Extensible Language for Distributed Stream Processing
    ACM Transactions on Programming Languages and Systems, 2017
    Co-Authors: Martin Hirzel, Scott Schneider, Bugra Gedik
    Abstract:

    Big data is revolutionizing how all sectors of our economy do business, including telecommunication, transportation, medical, and finance. Big data comes in two flavors: data at rest and data in motion. Processing data in motion is Stream Processing. Stream Processing for big data analytics often requires scale that can only be delivered by a distributed system, exploiting parallelism on many hosts and many cores. One such distributed Stream Processing system is IBM Streams. Early customer experience with IBM Streams uncovered that another core requirement is extensibility, since customers want to build high-performance domain-specific operators for use in their Streaming applications. Based on these two core requirements of distribution and extensibility, we designed and implemented the Streams Processing Language (SPL). This article describes SPL with an emphasis on the language design, distributed runtime, and extensibility mechanism. SPL is now the gateway for the IBM Streams platform, used by our customers for Stream Processing in a broad range of application domains.

  • River: an intermediate language for Stream Processing
    Software: Practice and Experience, 2015
    Co-Authors: Robert Soulé, Martin Hirzel, Bugra Gedik, Robert Grimm
    Abstract:

    This paper presents both a calculus for Stream Processing, named Brooklet, and its realization as an intermediate language, named River. Because River is based on Brooklet, it has a formal semantics that enables reasoning about the correctness of source translations and optimizations. River builds on Brooklet by addressing the real-world details that the calculus elides. We evaluated our system by implementing front-ends for three Streaming languages, and three important optimizations, and a back-end for the System S distributed Streaming runtime. Overall, we significantly lower the barrier to entry for new Stream-Processing languages and thus grow the ecosystem of this crucial style of programming. Copyright © 2015 John Wiley & Sons, Ltd.

  • A catalog of Stream Processing optimizations
    ACM Computing Surveys, 2014
    Co-Authors: Martin Hirzel, Scott Schneider, Bugra Gedik, Robert Soulé, Robert Grimm
    Abstract:

    Various research communities have independently arrived at Stream Processing as a programming model for efficient and parallel computing. These communities include digital signal Processing, databases, operating systems, and complex event Processing. Since each community faces applications with challenging performance requirements, each of them has developed some of the same optimizations, but often with conflicting terminology and unstated assumptions. This article presents a survey of optimizations for Stream Processing. It is aimed both at users who need to understand and guide the system’s optimizer and at implementers who need to make engineering tradeoffs. To consolidate terminology, this article is organized as a catalog, in a style similar to catalogs of design patterns or refactorings. To make assumptions explicit and help understand tradeoffs, each optimization is presented with its safety constraints (when does it preserve correctnessq) and a profitability experiment (when does it improve performanceq). We hope that this survey will help future Streaming system builders to stand on the shoulders of giants from not just their own community.

  • Autopipelining for Data Stream Processing
    IEEE Transactions on Parallel and Distributed Systems, 2013
    Co-Authors: Yuzhe Tang, Bugra Gedik
    Abstract:

    Stream Processing applications use online analytics to ingest high-rate data sources, process them on-the-fly, and generate live results in a timely manner. The data flow graph representation of these applications facilitates the specification of Stream computing tasks with ease, and also lends itself to possible runtime exploitation of parallelization on multicore processors. While the data flow graphs naturally contain a rich set of parallelization opportunities, exploiting them is challenging due to the combinatorial number of possible configurations. Furthermore, the best configuration is dynamic in nature; it can differ across multiple runs of the application, and even during different phases of the same run. In this paper, we propose an autopipelining solution that can take advantage of multicore processors to improve throughput of Streaming applications, in an effective and transparent way. The solution is effective in the sense that it provides good utilization of resources by dynamically finding and exploiting sources of pipeline parallelism in Streaming applications. It is transparent in the sense that it does not require any hints from the application developers. As a part of our solution, we describe a light-weight runtime profiling scheme to learn resource usage of operators comprising the application, an optimization algorithm to locate best places in the data flow graph to explore additional parallelism, and an adaptive control scheme to find the right level of parallelism. We have implemented our solution in an industrial-strength Stream Processing system. Our experimental evaluation based on microbenchmarks, synthetic workloads, as well as real-world applications confirms that our design is effective in optimizing the throughput of Stream Processing applications without requiring any changes to the application code.

  • DEBS - From a calculus to an execution environment for Stream Processing
    Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems - DEBS '12, 2012
    Co-Authors: Robert Soulé, Martin Hirzel, Bugra Gedik, Robert Grimm
    Abstract:

    At one level, this paper is about River, a virtual execution environment for Stream Processing. Stream Processing is a paradigm well-suited for many modern data Processing systems that ingest high-volume data Streams from the real world, such as audio/video Streaming, high-frequency trading, and security monitoring. One attractive property of Stream Processing is that it lends itself to parallelization on multicores, and even to distribution on clusters when extreme scale is required. Stream Processing has been co-evolved by several communities, leading to diverse languages with similar core concepts. Providing a common execution environment reduces language development effort and increases portability. We designed River as a practical realization of Brooklet, a calculus for Stream Processing. So at another level, this paper is about a journey from theory (the calculus) to practice (the execution environment). The challenge is that, by definition, a calculus abstracts away all but the most central concepts. Hence, there are several research questions in concretizing the missing parts, not to mention a significant engineering effort in implementing them. But the effort is well worth it, because using a calculus as a foundation yields clear semantics and proven correctness results.

Magdalena Balazinska - One of the best experts on this subject based on the ideXlab platform.

  • Fault-tolerant Stream Processing using a distributed, replicated file system
    2015
    Co-Authors: Yongchul Kwon, Magdalena Balazinska, Albert Greenberg
    Abstract:

    We present SGuard, a new fault-tolerance technique for dis-tributed Stream Processing engines (SPEs) running in clus-ters of commodity servers. SGuard is less disruptive to nor-mal Stream Processing and leaves more resources available for normal Stream Processing than previous proposals. Like several previous schemes, SGuard is based on rollback recov-ery [18]: it checkpoints the state of Stream Processing nodes periodically and restarts failed nodes from their most recent checkpoints. In contrast to previous proposals, however, SGuard performs checkpoints asynchronously: i.e., opera-tors continue Processing Streams during the checkpoint thus reducing the potential disruption due to the checkpointing activity. Additionally, SGuard saves the checkpointed state into a new type of distributed and replicated file system (DFS) such as GFS [22] or HDFS [9], leaving more memory resources available for normal Stream Processing. To man-age resource contention due to simultaneous checkpoints by different SPE nodes, SGuard adds a scheduler to the DFS. This scheduler coordinates large batches of write requests in a manner that reduces individual checkpoint times while maintaining good overall resource utilization. We demon-strate the effectiveness of the approach through measure-ments of a prototype implementation in the Borealis [2] open-source SPE using HDFS [9] as the DFS. 1

  • high availability algorithms for distributed Stream Processing
    International Conference on Data Engineering, 2005
    Co-Authors: Jeonghyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Cetintemel, Michael Stonebraker, Stanley B Zdonik
    Abstract:

    Stream-Processing systems are designed to support an emerging class of applications that require sophisticated and timely Processing of high-volume data Streams, often originating in distributed environments. Unlike traditional data-Processing applications that require precise recovery for correctness, many Stream-Processing applications can tolerate and benefit from weaker recovery guarantees. In this paper, we study various recovery guarantees and pertinent recovery techniques that can meet the correctness and performance requirements of Stream-Processing applications. We discuss the design and algorithmic challenges associated with the proposed recovery techniques and describe how each can provide different guarantees with proper combinations of redundant Processing, checkpointing, and remote logging. Using analysis and simulations, we quantify the cost of our recovery guarantees and examine the performance and applicability of the recovery techniques. We also analyze how the knowledge of query network properties can help decrease the cost of high availability.

  • the design of the borealis Stream Processing engine
    Conference on Innovative Data Systems Research, 2005
    Co-Authors: Daniel J Abadi, Mitch Cherniack, Magdalena Balazinska, Jeonghyon Hwang, Alexander Rasin, Yanif Ahmad, Wolfgang Lindner, Anurag S Maskey, Esther Ryvkina, Nesime Tatbul
    Abstract:

    Borealis is a second-generation distributed Stream Processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core Stream Processing functionality from Aurora [14] and distribution functionality from Medusa [51]. Borealis modifies and extends both systems in non-trivial and critical ways to provide advanced capabilities that are commonly required by newly-emerging Stream Processing applications. In this paper, we outline the basic design and functionality of Borealis. Through sample real-world applications, we motivate the need for dynamically revising query results and modifying query specifications. We then describe how Borealis addresses these challenges through an innovative set of features, including revision records, time travel, and control lines. Finally, we present a highly flexible and scalable QoS-based optimization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs.

  • CIDR - The Design of the Borealis Stream Processing Engine
    2005
    Co-Authors: Daniel J Abadi, Mitch Cherniack, Magdalena Balazinska, Jeonghyon Hwang, Alexander Rasin, Yanif Ahmad, Wolfgang Lindner, Anurag S Maskey, Esther Ryvkina, Nesime Tatbul
    Abstract:

    Borealis is a second-generation distributed Stream Processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core Stream Processing functionality from Aurora [14] and distribution functionality from Medusa [51]. Borealis modifies and extends both systems in non-trivial and critical ways to provide advanced capabilities that are commonly required by newly-emerging Stream Processing applications. In this paper, we outline the basic design and functionality of Borealis. Through sample real-world applications, we motivate the need for dynamically revising query results and modifying query specifications. We then describe how Borealis addresses these challenges through an innovative set of features, including revision records, time travel, and control lines. Finally, we present a highly flexible and scalable QoS-based optimization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs.

  • scalable distributed Stream Processing
    Conference on Innovative Data Systems Research, 2003
    Co-Authors: Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Ying Xing, Stan Zdonik
    Abstract:

    Stream Processing fits a large class of new applications for which conventional DBMSs fall short. Because many Stream-oriented systems are inherently geographically distributed and because distribution offers scalable load management and higher availability, future Stream Processing systems will operate in a distributed fashion. They will run across the Internet on computers typically owned by multiple cooperating administrative domains. This paper describes the architectural challenges facing the design of large-scale distributed Stream Processing systems, and discusses novel approaches for addressing load management, high availability, and federated operation issues. We describe two Stream Processing systems, Aurora* and Medusa, which are being designed to explore complementary solutions to these challenges. This paper discusses the architectural issues facing the design of large-scale distributed Stream Processing systems. We begin in Section 2 with a brief description of our centralized Stream Processing system, Aurora [4]. We then discuss two complementary efforts to extend Aurora to a distributed environment: Aurora* and Medusa. Aurora* assumes an environment in which all nodes fall under a single administrative domain. Medusa provides the infrastructure to support federated operation of nodes across administrative boundaries. After describing the architectures of these two systems in Section 3, we consider three design challenges common to both: infrastructures and protocols supporting communication amongst nodes (Section 4), load sharing in response to variable network conditions (Section 5), and high availability in the presence of failures (Section 6). We also discuss high-level policy specifications employed by the two systems in Section 7. For all of these issues, we believe that the push-based nature of Stream-based applications not only raises new challenges but also offers the possibility of new domain-specific solutions.

Jesus Arias-fisteus - One of the best experts on this subject based on the ideXlab platform.

  • Patterns for Distributed Real-Time Stream Processing
    IEEE Transactions on Parallel and Distributed Systems, 2017
    Co-Authors: Pablo Basanta-val, Norberto Fernandez-garcia, Luis Sanchez-fernandez, Jesus Arias-fisteus
    Abstract:

    In recent years, big data systems have become an active area of research and development. Stream Processing is one of the potential application scenarios of big data systems where the goal is to process a continuous, high velocity flow of information items. High frequency trading (HFT) in stock markets or trending topic detection in Twitter are some examples of Stream Processing applications. In some cases (like, for instance, in HFT), these applications have end-to-end quality-of-service requirements and may benefit from the usage of real-time techniques. Taking this into account, the present article analyzes, from the point of view of real-time systems, a set of patterns that can be used when implementing a Stream Processing application. For each pattern, we discuss its advantages and disadvantages, as well as its impact in application performance, measured as response time, maximum input frequency and changes in utilization demands due to the pattern.