Task-Level Parallelism

The Experts below are selected from a list of 49320 Experts worldwide ranked by ideXlab platform

Mateo Valero - One of the best experts on this subject based on the ideXlab platform.

picos a hardware runtime architecture support for ompss

Future Generation Computer Systems, 2015

Co-Authors: Rosa M Badia, Fahimeh Yazdanpanah, Carlos Alvarez, Daniel Jimenezgonzalez, Mateo Valero

Abstract:

OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs to exploit heterogeneity and task Parallelism based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution; it has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a pure-software runtime library: Nanos++. Therefore, although powerful and easy to use, the performance benefits of exploiting fine-grained (pico) task Parallelism are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. Picos is a novel hardware dataflow-based task scheduler that dynamically analyzes inter-task dependencies and identifies Task-Level Parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time. Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using aggressive parallel strategies with a large number of fine granularity tasks.

15 days free trial to Access Article
available task level Parallelism on the cell be

IEEE International Conference on High Performance Computing Data and Analytics, 2009

Co-Authors: Alejandro Rico, Alex Ramirez, Mateo Valero

Abstract:

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.

15 days free trial to Access Article

Rosa M Badia - One of the best experts on this subject based on the ideXlab platform.

picos a hardware runtime architecture support for ompss

Future Generation Computer Systems, 2015

Co-Authors: Rosa M Badia, Fahimeh Yazdanpanah, Carlos Alvarez, Daniel Jimenezgonzalez, Mateo Valero

Abstract:

OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs to exploit heterogeneity and task Parallelism based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution; it has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a pure-software runtime library: Nanos++. Therefore, although powerful and easy to use, the performance benefits of exploiting fine-grained (pico) task Parallelism are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. Picos is a novel hardware dataflow-based task scheduler that dynamically analyzes inter-task dependencies and identifies Task-Level Parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time. Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using aggressive parallel strategies with a large number of fine granularity tasks.

15 days free trial to Access Article
a dependency aware task based programming environment for multi core architectures

International Conference on Cluster Computing, 2008

Co-Authors: Josep M Perez, Rosa M Badia, Jesus Labarta

Abstract:

Parallel programming on SMP and multi-core architectures is hard. In this paper we present a programming model for those environments based on automatic function level Parallelism that strives to be easy, flexible, portable, and performant. Its main trait is its ability to exploit task level Parallelism by analyzing task dependencies at run time. We present the programming environment in the context of algorithms from several domains and pinpoint its benefits compared to other approaches. We discuss its execution model and its scheduler. Finally we analyze its performance and demonstrate that it offers reasonable performance without tuning, and that it can rival highly tuned libraries with minimal tuning effort.

15 days free trial to Access Article

Pieter Jonker - One of the best experts on this subject based on the ideXlab platform.

designing area and performance constrained simd vliw image processing architectures

Advanced Concepts for Intelligent Vision Systems, 2005

Co-Authors: Hamed Fatemi, Twan Basten, Henk Corporaal, Richard P Kleihorst, Pieter Jonker

Abstract:

Image processing is widely used in many applications, including medical imaging, industrial manufacturing and security systems. In these applications, the size of the image is often very large, the processing time should be very small and the real-time constraints should be met. Therefore, during the last decades, there has been an increasing demand to exploit Parallelism in applications. It is possible to explore Parallelism along three axes: data-level Parallelism (DLP), instruction-level Parallelism (ILP) and Task-Level Parallelism (TLP). This paper explores the limitations and bottlenecks of increasing support for Parallelism along the DLP and ILP axes in isolation and in combination. To scrutinize the effect of DLP and ILP in our architecture (template), an area model based on the number of ALUs (ILP) and the number of processing elements (DLP) in the template is defined, as well as a performance model. Based on these models and the template, a set of kernels of image processing applications has been studied to find Pareto optimal architectures in terms of area and number of cycles via multi-objective optimization.

15 days free trial to Access Article

A Paschalis - One of the best experts on this subject based on the ideXlab platform.

a 3 3 gbps ccsds 123 0 b 1 multispectral hyperspectral image compression hardware accelerator on a space grade sram fpga

IEEE Transactions on Emerging Topics in Computing, 2021

Co-Authors: Antonis Tsigkanos, N Kranitis, George Theodorou, A Paschalis

Abstract:

The explosive growth of data volume from next generation high-resolution and high-speed hyperspectral remote sensing systems will compete with the limited on-board storage resources and bandwidth available for the transmission of data to ground stations making hyperspectral image compression a mission critical and challenging on-board payload data processing task. The Consultative Committee for Space Data Systems (CCSDS) has issued recommended standard CCSDS-123.0-B-1 for lossless multispectral and hyperspectral image compression. In this paper, a very high data-rate performance hardware accelerator is presented implementing the CCSDS-123.0-B-1 algorithm as an IP core targeting a space-grade FPGA. For the first time, the introduced architecture based on the principles of C-slow retiming, exploits the inherent Task-Level Parallelism of the algorithm under BIP ordering and implements a reconfigurable fine-grained pipeline in critical feedback loops, achieving high throughput performance. The CCSDS-123.0-B-1 IP core achieves beyond the current state-of-the-art data-rate performance with a maximum throughput of 213 MSamples/s (3.3 Gbps @ 16-bits) using 11 percent of LUTs and 27 percent of BRAMs of the Virtex-5QV FPGA resources for a typical hyperspectral image, leveraging the full throughput of a single SpaceFibre lane. To the best of our knowledge, it is the fastest implementation of CCSDS-123.0-B-1 targeting a space-grade FPGA to date.

15 days free trial to Access Article

Peter Marwedel - One of the best experts on this subject based on the ideXlab platform.

automatic extraction of task level Parallelism for heterogeneous mpsocs

International Conference on Parallel Processing, 2013

Co-Authors: Daniel Cordes, Olaf Neugebauer, Michael Engel, Peter Marwedel

Abstract:

Heterogeneous multi-core platforms are increasingly attractive for embedded applications due to their adaptability and efficiency. This proliferation of heterogeneity demands new approaches for extracting thread level Parallelism from sequential applications which have to be efficient at runtime. We present, to the best of our knowledge, the first Integer Linear Programming (ILP)-based parallelization approach for heterogeneous multi-core platforms. Using Hierarchical Task Graphs and high-level timing models, our approach manages to balance the extracted tasks while considering performance differences between cores. As a result, we obtain considerable speedups at runtime, significantly outperforming tools for homogeneous systems. We evaluate our approach by parallelizing standard benchmarks from various application domains.

15 days free trial to Access Article
multi objective aware extraction of task level Parallelism using genetic algorithms

Design Automation and Test in Europe, 2012

Co-Authors: Daniel Cordes, Peter Marwedel

Abstract:

A large amount of research work has been done in the area of automatic parallelization for decades, resulting in a huge amount of tools, which should relieve the designer from the burden of manually parallelizing an application. Unfortunately, most of these tools are only optimizing the execution time by splitting up applications into concurrently executed tasks. In the domain of embedded devices, however, it is not sufficient to look only at this criterion. Since most of these devices are constraint-driven regarding execution time, energy consumption, heat dissipation and other objectives, a good trade-off has to be found to efficiently map applications to multiprocessor system on chip (MPSoC) devices. Therefore, we developed a fully automated multi-objective aware parallelization framework, which optimizes different objectives at the same time. The tool returns a Pareto-optimal front of solutions of the parallelized application to the designer, so that the solution with the best trade-off can be chosen.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Mateo Valero - One of the best experts on this subject based on the ideXlab platform.

picos a hardware runtime architecture support for ompss

available task level Parallelism on the cell be

Rosa M Badia - One of the best experts on this subject based on the ideXlab platform.

picos a hardware runtime architecture support for ompss

a dependency aware task based programming environment for multi core architectures

Pieter Jonker - One of the best experts on this subject based on the ideXlab platform.

designing area and performance constrained simd vliw image processing architectures

A Paschalis - One of the best experts on this subject based on the ideXlab platform.

a 3 3 gbps ccsds 123 0 b 1 multispectral hyperspectral image compression hardware accelerator on a space grade sram fpga

Peter Marwedel - One of the best experts on this subject based on the ideXlab platform.

automatic extraction of task level Parallelism for heterogeneous mpsocs

multi objective aware extraction of task level Parallelism using genetic algorithms