Hadoop Ecosystem

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1023 Experts worldwide ranked by ideXlab platform

Karan Gupta - One of the best experts on this subject based on the ideXlab platform.

  • towards synthesizing realistic workload traces for studying the Hadoop Ecosystem
    Modeling Analysis and Simulation On Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

  • MASCOTS - Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem
    2011 IEEE 19th Annual International Symposium on Modelling Analysis and Simulation of Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

Guanying Wang - One of the best experts on this subject based on the ideXlab platform.

  • towards synthesizing realistic workload traces for studying the Hadoop Ecosystem
    Modeling Analysis and Simulation On Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

  • MASCOTS - Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem
    2011 IEEE 19th Annual International Symposium on Modelling Analysis and Simulation of Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

Henry M Monti - One of the best experts on this subject based on the ideXlab platform.

  • towards synthesizing realistic workload traces for studying the Hadoop Ecosystem
    Modeling Analysis and Simulation On Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

  • MASCOTS - Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem
    2011 IEEE 19th Annual International Symposium on Modelling Analysis and Simulation of Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

Ali R Butt - One of the best experts on this subject based on the ideXlab platform.

  • towards synthesizing realistic workload traces for studying the Hadoop Ecosystem
    Modeling Analysis and Simulation On Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

  • MASCOTS - Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem
    2011 IEEE 19th Annual International Symposium on Modelling Analysis and Simulation of Computer and Telecommunication Systems, 2011
    Co-Authors: Guanying Wang, Ali R Butt, Henry M Monti, Karan Gupta
    Abstract:

    Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications' behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, and inferring lessons that can be used to design realistic cloud workloads as well as enable thorough quantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers, and (ii) Quantifying the impact of shared storage on Hadoop system performance.

Ravi Sandhu - One of the best experts on this subject based on the ideXlab platform.

  • An Attribute-Based Access Control Model for Secure Big Data Processing in Hadoop Ecosystem
    Proceedings of the Third ACM Workshop on Attribute-Based Access Control - ABAC'18, 2018
    Co-Authors: Maanak Gupta, Farhan Patwa, Ravi Sandhu
    Abstract:

    Apache Hadoop is a predominant software framework for distributed compute and storage with capability to handle huge amounts of data, usually referred to as Big Data. This data collected from different enterprises and government agencies often includes private and sensitive information, which needs to be secured from unauthorized access. This paper proposes extensions to the current authorization capabilities offered by Hadoop core and other Ecosystem projects, specifically Apache Ranger and Apache Sentry. We present a fine-grained attribute-based access control model, referred as HeABAC, catering to the security and privacy needs of multi-tenant Hadoop Ecosystem. The paper reviews the current multi-layered access control model used primarily in Hadoop core (2.x), Apache Ranger (version 0.6) and Sentry (version 1.7.0), as well as a previously proposed RBAC extension (OT-RBAC). It then presents a formal attribute-based access control model for Hadoop Ecosystem, including the novel concept of cross Hadoop services trust. It further highlights different trust scenarios, presents an implementation approach for HeABAC using Apache Ranger and, discusses the administration requirements of HeABAC operational model. Some comprehensive, real-world use cases are also discussed to reflect the application and enforcement of the proposed HeABAC model in Hadoop Ecosystem.

  • Object-Tagged RBAC Model for the Hadoop Ecosystem
    2017
    Co-Authors: Maanak Gupta, Farhan Patwa, Ravi Sandhu
    Abstract:

    Hadoop Ecosystem provides a highly scalable, fault-tolerant and cost-effective platform for storing and analyzing variety of data formats. Apache Ranger and Apache Sentry are two predominant frameworks used to provide authorization capabilities in Hadoop Ecosystem. In this paper we present a formal multi-layer access control model (called $$\mathrm {HeAC}$$) for Hadoop Ecosystem, as an academic-style abstraction of Ranger, Sentry and native Apache Hadoop access-control capabilities. We further extend $$\mathrm {HeAC}$$ base model to provide a cohesive object-tagged role-based access control (OT-RBAC) model, consistent with generally accepted academic concepts of RBAC. Besides inheriting advantages of RBAC, OT-RBAC offers a novel method for combining RBAC with attributes (beyond NIST proposed strategies). Additionally, a proposed implementation approach for OT-RBAC in Apache Ranger, is presented. We further outline attribute-based extensions to OT-RBAC.

  • DBSec - Object-Tagged RBAC Model for the Hadoop Ecosystem
    Data and Applications Security and Privacy XXXI, 2017
    Co-Authors: Maanak Gupta, Farhan Patwa, Ravi Sandhu
    Abstract:

    Hadoop Ecosystem provides a highly scalable, fault-tolerant and cost-effective platform for storing and analyzing variety of data formats. Apache Ranger and Apache Sentry are two predominant frameworks used to provide authorization capabilities in Hadoop Ecosystem. In this paper we present a formal multi-layer access control model (called \(\mathrm {HeAC}\)) for Hadoop Ecosystem, as an academic-style abstraction of Ranger, Sentry and native Apache Hadoop access-control capabilities. We further extend \(\mathrm {HeAC}\) base model to provide a cohesive object-tagged role-based access control (OT-RBAC) model, consistent with generally accepted academic concepts of RBAC. Besides inheriting advantages of RBAC, OT-RBAC offers a novel method for combining RBAC with attributes (beyond NIST proposed strategies). Additionally, a proposed implementation approach for OT-RBAC in Apache Ranger, is presented. We further outline attribute-based extensions to OT-RBAC.

  • multi layer authorization framework for a representative Hadoop Ecosystem deployment
    Symposium on Access Control Models and Technologies, 2017
    Co-Authors: Maanak Gupta, Farhan Patwa, James Benson, Ravi Sandhu
    Abstract:

    Apache Hadoop is a predominant software framework to store and process vast amount of data, produced in varied formats. Data stored in Hadoop multi-tenant data lake often includes sensitive data such as social security numbers, intelligence sources and medical particulars, which should only be accessed by legitimate users. Apache Ranger and Apache Sentry are important authorization systems providing fine-grained access control across several Hadoop Ecosystem services. In this paper, we provide a comprehensive explanation for the authorization framework offered by Hadoop Ecosystem, incorporating core Hadoop 2.x native access control features and capabilities offered by Apache Ranger, with prime focus on data services including Apache Hive and Hadoop 2.x core services. A multi-layer authorization system is discussed and demonstrated, reflecting access control for services, data, applications and infrastructure resources inside a representative Hadoop Ecosystem instance. A concrete use case is discussed to underline the application of aforementioned access control points. We use Hortonworks Hadoop distribution HDP 2.5 to exhibit this multi-layer access control framework.

  • poster access control model for the Hadoop Ecosystem
    Symposium on Access Control Models and Technologies, 2017
    Co-Authors: Maanak Gupta, Farhan Patwa, Ravi Sandhu
    Abstract:

    Apache Hadoop is an important framework for fault-tolerant and distributed storage and processing of Big Data. Hadoop core platform along with other open-source tools such as Apache Hive, Storm, HBase offer an Ecosystem to enable users to fully harness Big Data potential. Apache Ranger and Apache Sentry provide access control capabilities to several Ecosystem components by offering centralized policy administration and enforcement through plugins. In this work we discuss the access control model for Hadoop Ecosystem (referred as HeAC) used by Apache Ranger (release 0.6) and Sentry (release 1.7.0) along with Hadoop 2.x native authorization capabilities. This multi-layer model provides several access enforcement points to restrict unauthorized users to cluster resources. We further outline some preliminary approaches to extend the HeAC model consistent with widely accepted access control models.