Data Intensive Application

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 315 Experts worldwide ranked by ideXlab platform

Gaëtan Heidsieck - One of the best experts on this subject based on the ideXlab platform.

  • Distributed Management of Scientific Workflows for High-Throughput Plant Phenotyping
    2020
    Co-Authors: Gaëtan Heidsieck
    Abstract:

    In many scientific domains, such as bio-science, complex numerical experiments typically require many processing or analysis steps over huge Datasets. They can be represented as scientific workflows. These workflows ease the modeling, management, and execution of computational activities linked by Data dependencies. As the size of the Data processed and the complexity of the computation keep increasing, these workflows become Data-Intensive. In order to execute such workflows within a reasonable timeframe, they need to be deployed in a high-performance distributed computing environment, such as the cloud.Plant phenotyping aims at capturing plant characteristics, such as morphological, topological, phenological features. High-throughput phenotyping (HTP) platforms have emerged to speed up the phenotyping Data acquisition in controlled conditions (e.g. greenhouse) or in the field. Such platforms generate terabytes of Data used in plant breeding and plant biology to test novel mechanisms. These Datasets are stored in different geodistributed sites (Data centers). Scientists can use a Scientific Workflow Management System (SWMS) to manage the workflow execution over a multisite cloud.In bio-science, it is common for workflow users to reuse other workflows or Data generated by other users. Reusing and re-purposing workflows allow the user to develop new analyses faster. Furthermore, a user may need to execute a workflow many times with different sets of parameters and input Data to analyze the impact of some experimental step, represented as a workflow fragment, i.e., a subset of the workflow activities and dependencies. In both cases, some fragments of the workflow may be executed many times, which can be highly resource-consuming and unnecessary long. Workflow re-execution can be avoided by storing the intermediate results of these workflow fragments and reusing them in later executions.In this thesis, we propose an adaptive caching solution for efficient execution of Data-Intensive workflows in monosite and multisite clouds. By adapting to the variations in tasks’ execution times, our solution can maximize the reuse of intermediate Data produced by workflows from multiple users. Our solution is based on a new SWMS architecture that automatically manages the storage and reuse of intermediate Data. Cache management is involved during two main steps: workflows preprocessing, to remove all fragments of the workflow that do not need to be executed; and cache provisioning, to decide at runtime which intermediate Data should be cached. We propose an adaptive cache provisioning algorithm that deals with the variations in task execution times and the size of Data. We evaluated our solution by implementing it in OpenAlea and performing extensive experiments on real Data with a complex Data-Intensive Application in plant phenotyping.Our main contributions are i) a SWMS architecture to handle caching and cache-aware scheduling algorithms when executing workflows in both monosite and multisite clouds, ii) a cost model that includes both financial and time costs for both the workflow execution, and the cache management, iii) two cache-aware scheduling algorithms one adapted for monosite and one for multisite cloud, and iv) and an experimental validation on a Data-Intensive plant phenotyping Application.

  • Distributed Management of Scientific Workflows for High-Throughput Plant Phenotyping
    2020
    Co-Authors: Gaëtan Heidsieck
    Abstract:

    In many scientific domains, such as bio-science, complex numerical experiments typically require many processing or analysis steps over huge Datasets. They can be represented as scientific workflows. These workflows ease the modeling, management, and execution of computational activities linked by Data dependencies. As the size of the Data processed and the complexity of the computation keep increasing, these workflows become Data-Intensive. In order to execute such workflows within a reasonable timeframe, they need to be deployed in a high-performance distributed computing environment, such as the cloud. Plant phenotyping aims at capturing plant characteristics, such as morphological, topological, phenological features. High-throughput phenotyping (HTP) platforms have emerged to speed up the phenotyping Data acquisition in controlled conditions (\textit{e.g.} greenhouse) or in the field. Such platforms generate terabytes of Data used in plant breeding and plant biology to test novel mechanisms. These Datasets are stored in different geodistributed sites (Data centers). Scientists can use a Scientific Workflow Management System (SWMS) to manage the workflow execution over a multisite cloud. In bio-science, it is common for workflow users to reuse other workflows or Data generated by other users. Reusing and re-purposing workflows allow the user to develop new analyses faster. Furthermore, a user may need to execute a workflow many times with different sets of parameters and input Data to analyze the impact of some experimental step, represented as a workflow fragment, \textit{i.e.}, a subset of the workflow activities and dependencies. In both cases, some fragments of the workflow may be executed many times, which can be highly resource-consuming and unnecessary long. Workflow re-execution can be avoided by storing the intermediate results of these workflow fragments and reusing them in later executions. In this thesis, we propose an adaptive caching solution for efficient execution of Data-Intensive workflows in monosite and multisite clouds. By adapting to the variations in tasks’ execution times, our solution can maximize the reuse of intermediate Data produced by workflows from multiple users. Our solution is based on a new SWMS architecture that automatically manages the storage and reuse of intermediate Data. Cache management is involved during two main steps: workflows preprocessing, to remove all fragments of the workflow that do not need to be executed; and cache provisioning, to decide at runtime which intermediate Data should be cached. We propose an adaptive cache provisioning algorithm that deals with the variations in task execution times and the size of Data. We evaluated our solution by implementing it in OpenAlea and performing extensive experiments on real Data with a complex Data-Intensive Application in plant phenotyping. Our main contributions are i) a SWMS architecture to handle caching and cache-aware scheduling algorithms when executing workflows in both monosite and multisite clouds, ii) a cost model that includes both financial and time costs for both the workflow execution, and the cache management, iii) two cache-aware scheduling algorithms one adapted for monosite and one for multisite cloud, and iv) and an experimental validation on a Data-Intensive plant phenotyping Application.

  • Distributed Caching of Scientific Workflows in Multisite Cloud
    2020
    Co-Authors: Gaëtan Heidsieck, Daniel De Oliveira, Esther Pacitti, Christophe Pradal, Francois Tardieu, Patrick Valduriez
    Abstract:

    Many scientific experiments are performed using scientific workflows, which are becoming more and more Data-Intensive. We consider the efficient execution of such workflows in the cloud, leveraging the heterogeneous resources available at multiple cloud sites (geo-distributed Data centers). Since it is common for workflow users to reuse code or Data from other workflows, a promising approach for efficient workflow execution is to cache intermediate Data in order to avoid re-executing entire workflows. In this paper, we propose a solution for distributed caching of scientific workflows in a multisite cloud. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation on a three-site cloud with a Data-Intensive Application in plant phenotyping shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of same input Data for each new execution.

  • Efficient Execution of Scientific Workflows in the Cloud Through Adaptive Caching
    Transactions on Large-Scale Data- and Knowledge-Centered Systems, 2020
    Co-Authors: Gaëtan Heidsieck, Daniel De Oliveira, Esther Pacitti, Christophe Pradal, Francois Tardieu, Patrick Valduriez
    Abstract:

    Many scientific experiments are now carried on using scientific workflows, which are becoming more and more Data-Intensive and complex. We consider the efficient execution of such workflows in the cloud. Since it is common for workflow users to reuse other workflows or Data generated by other workflows, a promising approach for efficient workflow execution is to cache intermediate Data and exploit it to avoid task re-execution. In this paper, we propose an adaptive caching solution for Data-Intensive workflows in the cloud. Our solution is based on a new scientific workflow management architecture that automatically manages the storage and reuse of intermediate Data and adapts to the variations in task execution times and output Data size. We evaluated our solution by implementing it in the OpenAlea system and performing extensive experiments on real Data with a Data-Intensive Application in plant phenotyping. The results show that adaptive caching can yield major performance gains, e.g., up to a factor of 3.5 with 6 workflow re-executions.

  • Adaptive Caching for Data-Intensive Scientific Workflows in the Cloud
    2019
    Co-Authors: Gaëtan Heidsieck, Daniel De Oliveira, Esther Pacitti, Christophe Pradal, Francois Tardieu, Patrick Valduriez
    Abstract:

    Many scientific experiments are now carried on using scien-tific workflows, which are becoming more and more Data-Intensive and complex. We consider the efficient execution of such workflows in the cloud. Since it is common for workflow users to reuse other workflows or Data generated by other workflows, a promising approach for efficient workflow execution is to cache intermediate Data and exploit it to avoid task re-execution. In this paper, we propose an adaptive caching solution for Data-Intensive workflows in the cloud. Our solution is based on a new scientific workflow management architecture that automatically manages the storage and reuse of intermediate Data and adapts to the variations in task execution times and output Data size. We evaluated our solution by implementing it in the OpenAlea system and performing extensive experiments on real Data with a Data-Intensive Application inplant phenotyping. The results show that adaptive caching can yield major performance gains,e.g., up to 120.16% with 6 workflow re-executions.

Honghao Gao - One of the best experts on this subject based on the ideXlab platform.

  • Data Intensive Application deployment at edge a deep reinforcement learning approach
    International Conference on Web Services, 2019
    Co-Authors: Yishan Chen, Shuiguang Deng, Hailiang Zhao, Honghao Gao
    Abstract:

    Mobile Edge Computing (MEC) has already developed into a key component of the future mobile broadband network due to its low latency. In MEC, mobile devices can access Data-Intensive Applications deployed at edge, which are facilitated by service and computing resources available on edge servers. However, it is difficult to handle such issues while Data transmission, user mobility and load balancing conditions change constantly among mobile devices, edge servers and the cloud. In this paper, we propose an approach for formulating Data-Intensive Application Edge Deployment Policy (DAEDP) that maximizes the latency reduction for mobile devices while minimizing the monetary cost for Application Service Providers (ASPs). The deployment problem is modelled as a Markov decision process, and a deep reinforcement learning strategy is proposed to formulate the optimal policy with maximization of the long-term discount reward. Extensive experiments are conducted to evaluate DAEDP. The results show that DAEDP outperforms four baseline approaches.

  • ICWS - Data-Intensive Application Deployment at Edge: A Deep Reinforcement Learning Approach
    2019 IEEE International Conference on Web Services (ICWS), 2019
    Co-Authors: Yishan Chen, Shuiguang Deng, Hailiang Zhao, Honghao Gao
    Abstract:

    Mobile Edge Computing (MEC) has already developed into a key component of the future mobile broadband network due to its low latency. In MEC, mobile devices can access Data-Intensive Applications deployed at edge, which are facilitated by service and computing resources available on edge servers. However, it is difficult to handle such issues while Data transmission, user mobility and load balancing conditions change constantly among mobile devices, edge servers and the cloud. In this paper, we propose an approach for formulating Data-Intensive Application Edge Deployment Policy (DAEDP) that maximizes the latency reduction for mobile devices while minimizing the monetary cost for Application Service Providers (ASPs). The deployment problem is modelled as a Markov decision process, and a deep reinforcement learning strategy is proposed to formulate the optimal policy with maximization of the long-term discount reward. Extensive experiments are conducted to evaluate DAEDP. The results show that DAEDP outperforms four baseline approaches.

Haiyang Wang - One of the best experts on this subject based on the ideXlab platform.

  • a workflow oriented cloud computing framework and programming model for Data Intensive Application
    Computer Supported Cooperative Work in Design, 2011
    Co-Authors: Jinshan Pang, Lizhen Cui, Yongqing Zheng, Haiyang Wang
    Abstract:

    In order to support workflow-oriented Application on multiple Data centers, this paper describes a workflow-oriented cloud computing framework, called WfOC. WfOC can run workflow jobs composed of multiple user defined task-functions extracted by java annotation. This framework includes workflow-oriented cloud computing programming language, tasks extraction and composition, tasks and Data sources registration, tasks functions mapper/reducer and other components, and enables users to especially focus on workflow definition and workflow tasks logic implementation without needing to worry about the distribution of Data and target execution systems. A mechanism is offered in building workflow-oriented Data Intensive Applications, with multiple heterogeneous java runtime environments as the underlying computation platform. A case in social security Application on multiple Databases shows this framework can streamline complex computational workflow.

Jacek Kitowski - One of the best experts on this subject based on the ideXlab platform.

  • A TOOLKIT FOR STORAGE QOS PROVISIONING FOR Data-Intensive ApplicationS
    Computer Science, 2012
    Co-Authors: Renata Slota, Dariusz Król, Kornel Skałkowski, Michal Orzechowski, Darin Nikolow, Bartosz Kryza, Michał Wrzeszcz, Jacek Kitowski
    Abstract:

    This paper describes a programming toolkit developed in the PL-Grid project, named QStorMan, which supports storage QoS provisioning for Data-Intensive Applications in distributed environments. QStorMan exploits knowledge-oriented methods for matching storage resources to non-functional requirements, which are defined for a Data-Intensive Application. In order to support various usage scenarios, QStorMan provides two interfaces, such as programming libraries or a web portal. The interfaces allow to define the requirements either directly in an Application source code or by using an intuitive graphical interface. The first way provides finer granularity, e.g., each portion of Data processed by an Application can define a different set of requirements. The second method is aimed at legacy Applications support, which source code can not be modified. The toolkit has been evaluated using synthetic benchmarks and the production infrastructure of PL-Grid, in particular its storage infrastructure, which utilizes the Lustre file system.

Patrick Valduriez - One of the best experts on this subject based on the ideXlab platform.

  • Distributed Caching of Scientific Workflows in Multisite Cloud
    2020
    Co-Authors: Gaëtan Heidsieck, Daniel De Oliveira, Esther Pacitti, Christophe Pradal, Francois Tardieu, Patrick Valduriez
    Abstract:

    Many scientific experiments are performed using scientific workflows, which are becoming more and more Data-Intensive. We consider the efficient execution of such workflows in the cloud, leveraging the heterogeneous resources available at multiple cloud sites (geo-distributed Data centers). Since it is common for workflow users to reuse code or Data from other workflows, a promising approach for efficient workflow execution is to cache intermediate Data in order to avoid re-executing entire workflows. In this paper, we propose a solution for distributed caching of scientific workflows in a multisite cloud. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation on a three-site cloud with a Data-Intensive Application in plant phenotyping shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of same input Data for each new execution.

  • Efficient Execution of Scientific Workflows in the Cloud Through Adaptive Caching
    Transactions on Large-Scale Data- and Knowledge-Centered Systems, 2020
    Co-Authors: Gaëtan Heidsieck, Daniel De Oliveira, Esther Pacitti, Christophe Pradal, Francois Tardieu, Patrick Valduriez
    Abstract:

    Many scientific experiments are now carried on using scientific workflows, which are becoming more and more Data-Intensive and complex. We consider the efficient execution of such workflows in the cloud. Since it is common for workflow users to reuse other workflows or Data generated by other workflows, a promising approach for efficient workflow execution is to cache intermediate Data and exploit it to avoid task re-execution. In this paper, we propose an adaptive caching solution for Data-Intensive workflows in the cloud. Our solution is based on a new scientific workflow management architecture that automatically manages the storage and reuse of intermediate Data and adapts to the variations in task execution times and output Data size. We evaluated our solution by implementing it in the OpenAlea system and performing extensive experiments on real Data with a Data-Intensive Application in plant phenotyping. The results show that adaptive caching can yield major performance gains, e.g., up to a factor of 3.5 with 6 workflow re-executions.

  • Adaptive Caching for Data-Intensive Scientific Workflows in the Cloud
    2019
    Co-Authors: Gaëtan Heidsieck, Daniel De Oliveira, Esther Pacitti, Christophe Pradal, Francois Tardieu, Patrick Valduriez
    Abstract:

    Many scientific experiments are now carried on using scien-tific workflows, which are becoming more and more Data-Intensive and complex. We consider the efficient execution of such workflows in the cloud. Since it is common for workflow users to reuse other workflows or Data generated by other workflows, a promising approach for efficient workflow execution is to cache intermediate Data and exploit it to avoid task re-execution. In this paper, we propose an adaptive caching solution for Data-Intensive workflows in the cloud. Our solution is based on a new scientific workflow management architecture that automatically manages the storage and reuse of intermediate Data and adapts to the variations in task execution times and output Data size. We evaluated our solution by implementing it in the OpenAlea system and performing extensive experiments on real Data with a Data-Intensive Application inplant phenotyping. The results show that adaptive caching can yield major performance gains,e.g., up to 120.16% with 6 workflow re-executions.