Map-Reduce

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 6379626 Experts worldwide ranked by ideXlab platform

Jeffrey D. Ullman - One of the best experts on this subject based on the ideXlab platform.

  • upper and lower bounds on the cost of a map reduce computation
    Very Large Data Bases, 2013
    Co-Authors: Anish Das Sarma, Foto N. Afrati, Semih Salihoglu, Jeffrey D. Ullman
    Abstract:

    In this paper we study the tradeoff between parallelism and communication cost in a Map-Reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of Map-Reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance d, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round Map-Reduce algorithms. We are also able to explore two-round Map-Reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.

  • ICDE - Enumerating subgraph instances using Map-Reduce
    2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013
    Co-Authors: Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman
    Abstract:

    The theme of this paper is how to find all instances of a given “sample” graph in a larger “data graph,” using a single round of Map-Reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of [1] for computing multiway joins (evaluating conjunctive queries) in a single Map-Reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be “convertible,” in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

  • Enumerating Subgraph Instances Using Map-Reduce
    arXiv: Distributed Parallel and Cluster Computing, 2012
    Co-Authors: Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman
    Abstract:

    The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of Map-Reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive queries) in a single Map-Reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be "convertible," in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

  • EDBT - Map-Reduce extensions and recursive queries
    Proceedings of the 14th International Conference on Extending Database Technology - EDBT ICDT '11, 2011
    Co-Authors: Foto N. Afrati, Vinayak Borkar, Michael J. Carey, Neoklis Polyzotis, Jeffrey D. Ullman
    Abstract:

    We survey the recent wave of extensions to the popular Map-Reduce systems, including those that have begun to address the implementation of recursive queries using the same computing environment as Map-Reduce. A central problem is that recursive tasks cannot deliver their output only at the end, which makes recovery from failures much more complicated than in Map-Reduce and its nonrecursive extensions. We propose several algorithmic ideas for efficient implementation of recursions in the Map-Reduce environment and discuss several alternatives for supporting recovery from failures without restarting the entire job.

  • Optimizing Multiway Joins in a Map-Reduce Environment
    IEEE Transactions on Knowledge and Data Engineering, 2011
    Co-Authors: Foto N. Afrati, Jeffrey D. Ullman
    Abstract:

    Implementations of Map-Reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the Map-Reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using Map-Reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

Echeverría Gemma - One of the best experts on this subject based on the ideXlab platform.

  • Optimal Handling and Postharvest Strategies to Reduce Losses of ‘Cuello Dama Negro’ Dark Figs (Ficus Carica L.)
    'Informa UK Limited', 2021
    Co-Authors: Cantín, Celia M., Giné-bordonaba Jordi, Echeverría Gemma
    Abstract:

    The optimal postharvest handling to reduce postharvest decay and maintain quality of ‘Cuello Dama Negro’ fresh dark figs grown in Spain is been studied. Different storage temperatures (0ºC and 4ºC), relative humidity (RH, 75% to 95%) and cooling strategies (delayed and intermittent cooling) were tested. Moreover, different postharvest strategies such as 1-MCP (10 ppm), two different passive modified atmosphere packaging (Xtend® and LifePack MAP), and SO2 generating pads (UVASYS, Grapetek (Pty) Ltd.), were also tested. Storage at 0ºC, 95% RH together with MAP effectively decreased postharvest rots and therefore increased the market life of ‘Cuello Dama Negro’ fresh figs, without altering the fruit quality nor the consumer liking degree. No improvement on the shelf life of the fruit was observed with the application of 1-MCP. The use of SO2 generating pads reduced the decay but detrimentally affected fruit quality by inducing skin bleaching. Low temperature from harvest to consumption is crucial for a good maintenance of quality in fresh fig. In addition, EMAP technology is a low-cost technology able to reduce decay and maintain fruit quality of fresh figs up to 2 weeks.info:eu-repo/semantics/acceptedVersio

  • Optimal Handling and Postharvest Strategies to Reduce Losses of ‘Cuello Dama Negro’ Dark Figs (Ficus Carica L.)
    Taylor & Francis, 2021
    Co-Authors: Cantín, Celia M., Giné-bordonaba Jordi, Echeverría Gemma
    Abstract:

    The optimal postharvest handling to reduce postharvest decay and maintain quality of ‘Cuello Dama Negro’ fresh dark figs grown in Spain is been studied. Different storage temperatures (0ºC and 4ºC), relative humidity (RH, 75% to 95%) and cooling strategies (delayed and intermittent cooling) were tested. Moreover, different postharvest strategies such as 1-MCP (10 ppm), two different passive modified atmosphere packaging (Xtend® and LifePack MAP), and SO2 generating pads (UVASYS, Grapetek (Pty) Ltd.), were also tested. Storage at 0ºC, 95% RH together with MAP effectively decreased postharvest rots and therefore increased the market life of ‘Cuello Dama Negro’ fresh figs, without altering the fruit quality nor the consumer liking degree. No improvement on the shelf life of the fruit was observed with the application of 1-MCP. The use of SO2 generating pads reduced the decay but detrimentally affected fruit quality by inducing skin bleaching. Low temperature from harvest to consumption is crucial for a good maintenance of quality in fresh fig. In addition, EMAP technology is a low-cost technology able to reduce decay and maintain fruit quality of fresh figs up to 2 weeks.postprin

Lei Wang - One of the best experts on this subject based on the ideXlab platform.

  • PDCAT - Dacoop: Accelerating Data-Iterative Applications on Map/Reduce Cluster
    2011 12th International Conference on Parallel and Distributed Computing Applications and Technologies, 2011
    Co-Authors: Yi Liang, Lei Wang
    Abstract:

    Map/reduce is a popular parallel processing framework for massive-scale data-intensive computing. The data-iterative application is composed of a serials of map/reduce jobs and need to repeatedly process some data files among these jobs. The existing implementation of map/reduce framework focus on perform data processing in a single pass with one map/reduce job and do not directly support the data-iterative applications, particularly in term of the explicit specification of the repeatedly processed data among jobs. In this paper, we propose an extended version of Hadoop map/reduce framework called Dacoop. Dacoop extends Map/Reduce programming interface to specify the repeatedly processed data, introduces the shared memory-based data cache mechanism to cache the data since its first access, and adopts the caching-aware task scheduling so that the cached data can be shared among the map/reduce jobs of data-iterative applications. We evaluate Dacoop on two typical data-iterative applications: k-means clustering and the domain rule reasoning in sementic web, with real and synthetic datasets. Experimental results show that the data-iterative applications can gain better performance on Dacoop than that on Hadoop. The turnaround time of a data-iterative application can be reduced by the maximum of 15.1%.

Foto N. Afrati - One of the best experts on this subject based on the ideXlab platform.

  • upper and lower bounds on the cost of a map reduce computation
    Very Large Data Bases, 2013
    Co-Authors: Anish Das Sarma, Foto N. Afrati, Semih Salihoglu, Jeffrey D. Ullman
    Abstract:

    In this paper we study the tradeoff between parallelism and communication cost in a Map-Reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of Map-Reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance d, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round Map-Reduce algorithms. We are also able to explore two-round Map-Reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.

  • ICDE - Enumerating subgraph instances using Map-Reduce
    2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013
    Co-Authors: Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman
    Abstract:

    The theme of this paper is how to find all instances of a given “sample” graph in a larger “data graph,” using a single round of Map-Reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of [1] for computing multiway joins (evaluating conjunctive queries) in a single Map-Reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be “convertible,” in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

  • Enumerating Subgraph Instances Using Map-Reduce
    arXiv: Distributed Parallel and Cluster Computing, 2012
    Co-Authors: Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman
    Abstract:

    The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of Map-Reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive queries) in a single Map-Reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be "convertible," in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

  • EDBT - Map-Reduce extensions and recursive queries
    Proceedings of the 14th International Conference on Extending Database Technology - EDBT ICDT '11, 2011
    Co-Authors: Foto N. Afrati, Vinayak Borkar, Michael J. Carey, Neoklis Polyzotis, Jeffrey D. Ullman
    Abstract:

    We survey the recent wave of extensions to the popular Map-Reduce systems, including those that have begun to address the implementation of recursive queries using the same computing environment as Map-Reduce. A central problem is that recursive tasks cannot deliver their output only at the end, which makes recovery from failures much more complicated than in Map-Reduce and its nonrecursive extensions. We propose several algorithmic ideas for efficient implementation of recursions in the Map-Reduce environment and discuss several alternatives for supporting recovery from failures without restarting the entire job.

  • Optimizing Multiway Joins in a Map-Reduce Environment
    IEEE Transactions on Knowledge and Data Engineering, 2011
    Co-Authors: Foto N. Afrati, Jeffrey D. Ullman
    Abstract:

    Implementations of Map-Reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the Map-Reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using Map-Reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

Cantín, Celia M. - One of the best experts on this subject based on the ideXlab platform.

  • Optimal Handling and Postharvest Strategies to Reduce Losses of ‘Cuello Dama Negro’ Dark Figs (Ficus Carica L.)
    'Informa UK Limited', 2021
    Co-Authors: Cantín, Celia M., Giné-bordonaba Jordi, Echeverría Gemma
    Abstract:

    The optimal postharvest handling to reduce postharvest decay and maintain quality of ‘Cuello Dama Negro’ fresh dark figs grown in Spain is been studied. Different storage temperatures (0ºC and 4ºC), relative humidity (RH, 75% to 95%) and cooling strategies (delayed and intermittent cooling) were tested. Moreover, different postharvest strategies such as 1-MCP (10 ppm), two different passive modified atmosphere packaging (Xtend® and LifePack MAP), and SO2 generating pads (UVASYS, Grapetek (Pty) Ltd.), were also tested. Storage at 0ºC, 95% RH together with MAP effectively decreased postharvest rots and therefore increased the market life of ‘Cuello Dama Negro’ fresh figs, without altering the fruit quality nor the consumer liking degree. No improvement on the shelf life of the fruit was observed with the application of 1-MCP. The use of SO2 generating pads reduced the decay but detrimentally affected fruit quality by inducing skin bleaching. Low temperature from harvest to consumption is crucial for a good maintenance of quality in fresh fig. In addition, EMAP technology is a low-cost technology able to reduce decay and maintain fruit quality of fresh figs up to 2 weeks.info:eu-repo/semantics/acceptedVersio

  • Optimal Handling and Postharvest Strategies to Reduce Losses of ‘Cuello Dama Negro’ Dark Figs (Ficus Carica L.)
    Taylor & Francis, 2021
    Co-Authors: Cantín, Celia M., Giné-bordonaba Jordi, Echeverría Gemma
    Abstract:

    The optimal postharvest handling to reduce postharvest decay and maintain quality of ‘Cuello Dama Negro’ fresh dark figs grown in Spain is been studied. Different storage temperatures (0ºC and 4ºC), relative humidity (RH, 75% to 95%) and cooling strategies (delayed and intermittent cooling) were tested. Moreover, different postharvest strategies such as 1-MCP (10 ppm), two different passive modified atmosphere packaging (Xtend® and LifePack MAP), and SO2 generating pads (UVASYS, Grapetek (Pty) Ltd.), were also tested. Storage at 0ºC, 95% RH together with MAP effectively decreased postharvest rots and therefore increased the market life of ‘Cuello Dama Negro’ fresh figs, without altering the fruit quality nor the consumer liking degree. No improvement on the shelf life of the fruit was observed with the application of 1-MCP. The use of SO2 generating pads reduced the decay but detrimentally affected fruit quality by inducing skin bleaching. Low temperature from harvest to consumption is crucial for a good maintenance of quality in fresh fig. In addition, EMAP technology is a low-cost technology able to reduce decay and maintain fruit quality of fresh figs up to 2 weeks.postprin