MapReduce - Explore the Science & Experts

The Experts below are selected from a list of 19647 Experts worldwide ranked by ideXlab platform

Shivnath Babu - One of the best experts on this subject based on the ideXlab platform.

MapReduce programming and cost based optimization crossing this chasm with starfish

Very Large Data Bases, 2011

Co-Authors: Herodotos Herodotou, Fei Dong, Shivnath Babu

Abstract:

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. Starfish is a self-tuning system for big data analytics that includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. This demonstration will present the profiling, whatif analysis, and cost-based optimization of MapReduce programs in Starfish. We will show how (nonexpert) users can employ the Starfish Visualizer to (a) get a deep understanding of a MapReduce program’s behavior during execution, (b) ask hypothetical questions on how the program’s behavior will change when parameter settings, cluster resources, or input data properties change, and (c) ultimately optimize the program.

15 days free trial to Access Article
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

PVLDB: Proceedings of the VLDB Endowment, 2011

Co-Authors: Herodotos Herodotou, Shivnath Babu

Abstract:

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains.

15 days free trial to Access Article

Tienhsiung Weng - One of the best experts on this subject based on the ideXlab platform.

scaling up MapReduce based big data processing on multi gpu systems

Cluster Computing, 2015

Co-Authors: Hai Jiang, Yi Chen, Zhi Qiao, Tienhsiung Weng

Abstract:

MapReduce is a popular data-parallel processing model encompassed with recent advances in computing technology and has been widely exploited for large-scale data analysis. The high demand on MapReduce has stimulated the investigation of MapReduce implementations with different architectural models and computing paradigms, such as multi-core clusters, Clouds, Cubieboards and GPUs. Particularly, current GPU-based MapReduce approaches mainly focus on single-GPU algorithms and cannot handle large data sets, due to the limited GPU memory capacity. Based on the previous multi-GPU MapReduce version MGMR, this paper proposes an upgrade version MGMR++ to eliminate GPU memory limitation and a pipelined version, PMGMR, to handle the Big Data challenge through both CPU memory and hard disks. MGMR++ is extended from MGMR with flexible C++ templates and CPU memory utilization, while PMGMR fine-tuned the performance through the latest GPU features such as streams and Hyper-Q as well as hard disk utilization. Compared to MGMR (Jiang et al., Cluster Computing 2013), the proposed schemes achieve about 2.5-fold performance improvement, increase system scalability, and allow programmers to write straightforward MapReduce code for Big Data.

15 days free trial to Access Article

Herodotos Herodotou - One of the best experts on this subject based on the ideXlab platform.

MapReduce programming and cost based optimization crossing this chasm with starfish

Very Large Data Bases, 2011

Co-Authors: Herodotos Herodotou, Fei Dong, Shivnath Babu

Abstract:

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. Starfish is a self-tuning system for big data analytics that includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. This demonstration will present the profiling, whatif analysis, and cost-based optimization of MapReduce programs in Starfish. We will show how (nonexpert) users can employ the Starfish Visualizer to (a) get a deep understanding of a MapReduce program’s behavior during execution, (b) ask hypothetical questions on how the program’s behavior will change when parameter settings, cluster resources, or input data properties change, and (c) ultimately optimize the program.

15 days free trial to Access Article
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

PVLDB: Proceedings of the VLDB Endowment, 2011

Co-Authors: Herodotos Herodotou, Shivnath Babu

Abstract:

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains.

15 days free trial to Access Article

Willy Zwaenepoel - One of the best experts on this subject based on the ideXlab platform.

hadooptosql a MapReduce query optimizer

European Conference on Computer Systems, 2010

Co-Authors: Willy Zwaenepoel

Abstract:

MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data.HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features.We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.

15 days free trial to Access Article

Randy H. Katz - One of the best experts on this subject based on the ideXlab platform.

interactive analytical processing in big data systems a cross industry study of MapReduce workloads

arXiv: Databases, 2012

Co-Authors: Yanpei Chen, Sara Alspaugh, Randy H. Katz

Abstract:

Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

15 days free trial to Access Article
energy efficiency for large scale MapReduce workloads with significant interactive analysis

European Conference on Computer Systems, 2012

Co-Authors: Yanpei Chen, Sara Alspaugh, Dhruba Borthakur, Randy H. Katz

Abstract:

MapReduce workloads have evolved to include increasing amounts of time-sensitive, interactive data analysis; we refer to such workloads as MapReduce with Interactive Analysis (MIA). Such workloads run on large clusters, whose size and cost make energy efficiency a critical concern. Prior works on MapReduce energy efficiency have not yet considered this workload class. Increasing hardware utilization helps improve efficiency, but is challenging to achieve for MIA workloads. These concerns lead us to develop BEEMR (Berkeley Energy Efficient MapReduce), an energy efficient MapReduce workload manager motivated by empirical analysis of real-life MIA traces at Facebook. The key insight is that although MIA clusters host huge data volumes, the interactive jobs operate on a small fraction of the data, and thus can be served by a small pool of dedicated machines; the less time-sensitive jobs can run on the rest of the cluster in a batch fashion. BEEMR achieves 40-50% energy savings under tight design constraints, and represents a first step towards improving energy efficiency for an increasingly important class of datacenter workloads.

15 days free trial to Access Article
the case for evaluating MapReduce performance using workload suites

Modeling Analysis and Simulation On Computer and Telecommunication Systems, 2011

Co-Authors: Yanpei Chen, Archana Ganapathi, Rean Griffith, Randy H. Katz

Abstract:

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.

15 days free trial to Access Article

Discover everything there is to know about the scientific topic MapReduce with ideXlab!

Shivnath Babu - One of the best experts on this subject based on the ideXlab platform.

MapReduce programming and cost based optimization crossing this chasm with starfish

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

Tienhsiung Weng - One of the best experts on this subject based on the ideXlab platform.

scaling up MapReduce based big data processing on multi gpu systems

Herodotos Herodotou - One of the best experts on this subject based on the ideXlab platform.

MapReduce programming and cost based optimization crossing this chasm with starfish

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

Willy Zwaenepoel - One of the best experts on this subject based on the ideXlab platform.

hadooptosql a MapReduce query optimizer

Randy H. Katz - One of the best experts on this subject based on the ideXlab platform.

interactive analytical processing in big data systems a cross industry study of MapReduce workloads

energy efficiency for large scale MapReduce workloads with significant interactive analysis

the case for evaluating MapReduce performance using workload suites