Apache Spark

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

Giuliano Casale - One of the best experts on this subject based on the ideXlab platform.

  • Artificial neural networks based techniques for anomaly detection in Apache Spark
    Cluster Computing, 2019
    Co-Authors: Ahmad Alnafessah, Giuliano Casale
    Abstract:

    Late detection and manual resolutions of performance anomalies in Cloud Computing and Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose an artificial neural network based methodology for anomaly detection tailored to the Apache Spark in-memory processing platform. Apache Spark is widely adopted by industry because of its speed and generality, however there is still a shortage of comprehensive performance anomaly detection methods applicable to this platform. We propose an artificial neural networks driven methodology to quickly sift through Spark logs data and operating system monitoring metrics to accurately detect and classify anomalous behaviors based on the Spark resilient distributed dataset characteristics. The proposed method is evaluated against three popular machine learning algorithms, decision trees, nearest neighbor, and support vector machine, as well as against four variants that consider different monitoring datasets. The results prove that our proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.

Paul Chow - One of the best experts on this subject based on the ideXlab platform.

  • accelerating Apache Spark with fpgas
    Concurrency and Computation: Practice and Experience, 2019
    Co-Authors: Ehsan Ghasemi, Paul Chow
    Abstract:

    Summary Apache Spark has become one of the most popular engines for big data processing. Spark provides a platform-independent, high-abstraction programming paradigm for large-scale data processing by leveraging the Java framework. Though it provides software portability across various machines, Java also limits the performance of distributed environments, such as Spark. While it may be unrealistic to rewrite platforms like Spark in a faster language, a more viable approach to mitigate its poor performance is to accelerate the computations while still working within the Java-based framework. This paper demonstrates the feasibility of incorporating Field-Programmable Gate Array (FPGA) acceleration into Spark and presents the performance benefits and bottlenecks of our FPGA-accelerated Spark environment using a MapReduce implementation of the k-means clustering algorithm, to show that acceleration is possible even when using a hardware platform that is not well optimized for performance. An important feature of our approach is that the use of FPGAs is completely transparent to the user through the use of library functions, which is a common way by which users access functions provided by Spark. Power users can further develop other computations using high-level synthesis.

  • accelerating Apache Spark big data analysis with fpgas
    Field-Programmable Custom Computing Machines, 2016
    Co-Authors: Ehsan Ghasemi, Paul Chow
    Abstract:

    Apache Spark has become one of the most popular engines for big data processing. Spark provides a platform-independent, high-abstraction programming paradigm for large-scale data processing by leveraging the Java frame-work. Though it provides software portability across various machines, Java also limits the performance of distributed environments, such as Spark. While it may be unrealistic to rewrite platforms like Spark in a faster language, a more viable approach to mitigate its poor performance is to accelerate the computations while still working within the Java-based framework. This work demonstrates the feasibility of incorporating FPGA acceleration into Spark, and uses a MapReduce implementation of the k-means clustering algorithm to show that acceleration is possible even when using a hardware platform that is not well-optimized for performance. An important feature of our approach is that the use of FPGAs is completely transparent to the user through the use of library functions, which is a common way by which users access functions provided by Spark. Power users can further develop other computations using high-level synthesis.

Ahmad Alnafessah - One of the best experts on this subject based on the ideXlab platform.

  • Artificial neural networks based techniques for anomaly detection in Apache Spark
    Cluster Computing, 2019
    Co-Authors: Ahmad Alnafessah, Giuliano Casale
    Abstract:

    Late detection and manual resolutions of performance anomalies in Cloud Computing and Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose an artificial neural network based methodology for anomaly detection tailored to the Apache Spark in-memory processing platform. Apache Spark is widely adopted by industry because of its speed and generality, however there is still a shortage of comprehensive performance anomaly detection methods applicable to this platform. We propose an artificial neural networks driven methodology to quickly sift through Spark logs data and operating system monitoring metrics to accurately detect and classify anomalous behaviors based on the Spark resilient distributed dataset characteristics. The proposed method is evaluated against three popular machine learning algorithms, decision trees, nearest neighbor, and support vector machine, as well as against four variants that consider different monitoring datasets. The results prove that our proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.

Ehsan Ghasemi - One of the best experts on this subject based on the ideXlab platform.

  • accelerating Apache Spark with fpgas
    Concurrency and Computation: Practice and Experience, 2019
    Co-Authors: Ehsan Ghasemi, Paul Chow
    Abstract:

    Summary Apache Spark has become one of the most popular engines for big data processing. Spark provides a platform-independent, high-abstraction programming paradigm for large-scale data processing by leveraging the Java framework. Though it provides software portability across various machines, Java also limits the performance of distributed environments, such as Spark. While it may be unrealistic to rewrite platforms like Spark in a faster language, a more viable approach to mitigate its poor performance is to accelerate the computations while still working within the Java-based framework. This paper demonstrates the feasibility of incorporating Field-Programmable Gate Array (FPGA) acceleration into Spark and presents the performance benefits and bottlenecks of our FPGA-accelerated Spark environment using a MapReduce implementation of the k-means clustering algorithm, to show that acceleration is possible even when using a hardware platform that is not well optimized for performance. An important feature of our approach is that the use of FPGAs is completely transparent to the user through the use of library functions, which is a common way by which users access functions provided by Spark. Power users can further develop other computations using high-level synthesis.

  • accelerating Apache Spark big data analysis with fpgas
    Field-Programmable Custom Computing Machines, 2016
    Co-Authors: Ehsan Ghasemi, Paul Chow
    Abstract:

    Apache Spark has become one of the most popular engines for big data processing. Spark provides a platform-independent, high-abstraction programming paradigm for large-scale data processing by leveraging the Java frame-work. Though it provides software portability across various machines, Java also limits the performance of distributed environments, such as Spark. While it may be unrealistic to rewrite platforms like Spark in a faster language, a more viable approach to mitigate its poor performance is to accelerate the computations while still working within the Java-based framework. This work demonstrates the feasibility of incorporating FPGA acceleration into Spark, and uses a MapReduce implementation of the k-means clustering algorithm to show that acceleration is possible even when using a hardware platform that is not well-optimized for performance. An important feature of our approach is that the use of FPGAs is completely transparent to the user through the use of library functions, which is a common way by which users access functions provided by Spark. Power users can further develop other computations using high-level synthesis.

Zhu Han - One of the best experts on this subject based on the ideXlab platform.

  • mobile big data analytics using deep learning and Apache Spark
    IEEE Network, 2016
    Co-Authors: Mohammad Abu Alsheikh, Dusit Niyato, Shaowei Lin, Hweepink Tan, Zhu Han
    Abstract:

    The proliferation of mobile devices, such as smartphones and Internet of Things gadgets, has resulted in the recent mobile big data era. Collecting mobile big data is unprofitable unless suitable analytics and learning methods are utilized to extract meaningful information and hidden patterns from data. This article presents an overview and brief tutorial on deep learning in mobile big data analytics and discusses a scalable learning framework over Apache Spark. Specifically, distributed deep learning is executed as an iterative MapReduce computing on many Spark workers. Each Spark worker learns a partial deep model on a partition of the overall mobile, and a master deep model is then built by averaging the parameters of all partial models. This Spark-based framework speeds up the learning of deep models consisting of many hidden layers and millions of parameters. We use a context-aware activity recognition application with a real-world dataset containing millions of samples to validate our framework and assess its speedup effectiveness.