MapReduce

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 19647 Experts worldwide ranked by ideXlab platform

Shivnath Babu - One of the best experts on this subject based on the ideXlab platform.

  • MapReduce programming and cost based optimization crossing this chasm with starfish
    Very Large Data Bases, 2011
    Co-Authors: Herodotos Herodotou, Fei Dong, Shivnath Babu
    Abstract:

    MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. Starfish is a self-tuning system for big data analytics that includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. This demonstration will present the profiling, whatif analysis, and cost-based optimization of MapReduce programs in Starfish. We will show how (nonexpert) users can employ the Starfish Visualizer to (a) get a deep understanding of a MapReduce program’s behavior during execution, (b) ask hypothetical questions on how the program’s behavior will change when parameter settings, cluster resources, or input data properties change, and (c) ultimately optimize the program.

  • Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
    PVLDB: Proceedings of the VLDB Endowment, 2011
    Co-Authors: Herodotos Herodotou, Shivnath Babu
    Abstract:

    MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains.

Tienhsiung Weng - One of the best experts on this subject based on the ideXlab platform.

  • scaling up MapReduce based big data processing on multi gpu systems
    Cluster Computing, 2015
    Co-Authors: Hai Jiang, Yi Chen, Zhi Qiao, Tienhsiung Weng
    Abstract:

    MapReduce is a popular data-parallel processing model encompassed with recent advances in computing technology and has been widely exploited for large-scale data analysis. The high demand on MapReduce has stimulated the investigation of MapReduce implementations with different architectural models and computing paradigms, such as multi-core clusters, Clouds, Cubieboards and GPUs. Particularly, current GPU-based MapReduce approaches mainly focus on single-GPU algorithms and cannot handle large data sets, due to the limited GPU memory capacity. Based on the previous multi-GPU MapReduce version MGMR, this paper proposes an upgrade version MGMR++ to eliminate GPU memory limitation and a pipelined version, PMGMR, to handle the Big Data challenge through both CPU memory and hard disks. MGMR++ is extended from MGMR with flexible C++ templates and CPU memory utilization, while PMGMR fine-tuned the performance through the latest GPU features such as streams and Hyper-Q as well as hard disk utilization. Compared to MGMR (Jiang et al., Cluster Computing 2013), the proposed schemes achieve about 2.5-fold performance improvement, increase system scalability, and allow programmers to write straightforward MapReduce code for Big Data.

Herodotos Herodotou - One of the best experts on this subject based on the ideXlab platform.

  • MapReduce programming and cost based optimization crossing this chasm with starfish
    Very Large Data Bases, 2011
    Co-Authors: Herodotos Herodotou, Fei Dong, Shivnath Babu
    Abstract:

    MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. Starfish is a self-tuning system for big data analytics that includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. This demonstration will present the profiling, whatif analysis, and cost-based optimization of MapReduce programs in Starfish. We will show how (nonexpert) users can employ the Starfish Visualizer to (a) get a deep understanding of a MapReduce program’s behavior during execution, (b) ask hypothetical questions on how the program’s behavior will change when parameter settings, cluster resources, or input data properties change, and (c) ultimately optimize the program.

  • Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
    PVLDB: Proceedings of the VLDB Endowment, 2011
    Co-Authors: Herodotos Herodotou, Shivnath Babu
    Abstract:

    MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains.

Willy Zwaenepoel - One of the best experts on this subject based on the ideXlab platform.

  • hadooptosql a MapReduce query optimizer
    European Conference on Computer Systems, 2010
    Co-Authors: Willy Zwaenepoel
    Abstract:

    MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data.HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features.We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.

Randy H. Katz - One of the best experts on this subject based on the ideXlab platform.

  • interactive analytical processing in big data systems a cross industry study of MapReduce workloads
    arXiv: Databases, 2012
    Co-Authors: Yanpei Chen, Sara Alspaugh, Randy H. Katz
    Abstract:

    Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

  • energy efficiency for large scale MapReduce workloads with significant interactive analysis
    European Conference on Computer Systems, 2012
    Co-Authors: Yanpei Chen, Sara Alspaugh, Dhruba Borthakur, Randy H. Katz
    Abstract:

    MapReduce workloads have evolved to include increasing amounts of time-sensitive, interactive data analysis; we refer to such workloads as MapReduce with Interactive Analysis (MIA). Such workloads run on large clusters, whose size and cost make energy efficiency a critical concern. Prior works on MapReduce energy efficiency have not yet considered this workload class. Increasing hardware utilization helps improve efficiency, but is challenging to achieve for MIA workloads. These concerns lead us to develop BEEMR (Berkeley Energy Efficient MapReduce), an energy efficient MapReduce workload manager motivated by empirical analysis of real-life MIA traces at Facebook. The key insight is that although MIA clusters host huge data volumes, the interactive jobs operate on a small fraction of the data, and thus can be served by a small pool of dedicated machines; the less time-sensitive jobs can run on the rest of the cluster in a batch fashion. BEEMR achieves 40-50% energy savings under tight design constraints, and represents a first step towards improving energy efficiency for an increasingly important class of datacenter workloads.

  • the case for evaluating MapReduce performance using workload suites
    Modeling Analysis and Simulation On Computer and Telecommunication Systems, 2011
    Co-Authors: Yanpei Chen, Archana Ganapathi, Rean Griffith, Randy H. Katz
    Abstract:

    MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.