The Experts below are selected from a list of 15516 Experts worldwide ranked by ideXlab platform
Vivek Sarkar - One of the best experts on this subject based on the ideXlab platform.
-
OpenSHMEM - Integrating Asynchronous Task Parallelism with OpenSHMEM
OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments, 2016Co-Authors: Max Grossman, Vivek Kumar, Zoran Budimlic, Vivek SarkarAbstract:Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, and provide a foundation for high-productivity parallel programming using lightweight one-sided communications. The OpenSHMEM programming interface has recently begun gaining popularity as a lightweight library-based approach for developing PGAS applications, in part through its use of a symmetric heap to realize more efficient implementations of global pointers than in other PGAS systems. However, current approaches to hybrid inter-node and intra-node parallel programming in OpenSHMEM rely on the use of multithreaded programming models (e.g., pthreads, OpenMP) that harness intra-node Parallelism but are opaque to the OpenSHMEM runtime. This OpenSHMEM+X approach can encounter performance challenges such as bottlenecks on shared resources, long pause times due to load imbalances, and poor data locality. Furthermore, OpenSHMEM+X requires the expertise of hero-level programmers, compared to the use of just OpenSHMEM. All of these are hard challenges to mitigate with incremental changes. This situation will worsen as computing nodes increase their use of accelerators and heterogeneous memories.
-
integrating asynchronous Task Parallelism with openshmem
Workshop on OpenSHMEM and Related Technologies, 2016Co-Authors: Max Grossman, Vivek Kumar, Zoran Budimlic, Vivek SarkarAbstract:Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, and provide a foundation for high-productivity parallel programming using lightweight one-sided communications. The OpenSHMEM programming interface has recently begun gaining popularity as a lightweight library-based approach for developing PGAS applications, in part through its use of a symmetric heap to realize more efficient implementations of global pointers than in other PGAS systems. However, current approaches to hybrid inter-node and intra-node parallel programming in OpenSHMEM rely on the use of multithreaded programming models (e.g., pthreads, OpenMP) that harness intra-node Parallelism but are opaque to the OpenSHMEM runtime. This OpenSHMEM+X approach can encounter performance challenges such as bottlenecks on shared resources, long pause times due to load imbalances, and poor data locality. Furthermore, OpenSHMEM+X requires the expertise of hero-level programmers, compared to the use of just OpenSHMEM. All of these are hard challenges to mitigate with incremental changes. This situation will worsen as computing nodes increase their use of accelerators and heterogeneous memories.
-
SCnC: Efficient Unification of Streaming with Dynamic Task Parallelism
International Journal of Parallel Programming, 2016Co-Authors: Dragoş Sbîrlea, Jun Shirako, Ryan Newton, Vivek SarkarAbstract:Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelization. To take full advantage of the paradigm programmers are typically required to learn a new language and re-implement their applications. This work shows that it is possible to exploit streaming as a safe and automatic optimization of a more general dataflow-based model—one in which computation kernels are written in standard, general-purpose languages and organized as a coordination graph. We propose streaming concurrent collections (SCnC), a streaming system that can efficiently run a subset of programs supported by concurrent collections (CnC). CnC is a general purpose parallel programming paradigm that integrates Task Parallelism and dataflow computing. The proposed streaming support allows application developers to reason about their program as a general dataflow graph, while benefiting from the performance and tight memory footprint of stream Parallelism when their program satisfies streaming constraints. In this paper, we formally define the application requirements for using SCnC, and outline a static decision procedure for identifying and processing eligible SCnC subgraphs. We present initial results showing that transitioning from general CnC to SCnC leads to a throughput increase of up to 40 $$\times $$ × for certain benchmarks, and also enables programs with large data sizes to execute in available memory for cases where CnC execution may run out of memory.
-
elastic Tasks unifying Task Parallelism and spmd Parallelism with an adaptive runtime
European Conference on Parallel Processing, 2015Co-Authors: Alina Sbirlea, Kunal Agrawal, Vivek SarkarAbstract:In this paper, we introduce elastic Tasks, a new high-level parallel programming primitive that can be used to unify Task Parallelism and SPMD Parallelism in a common adaptive scheduling framework. Elastic Tasks are internally parallel Tasks and can run on a single worker or expand to take over multiple workers. An elastic Task can be an ordinary Task or an SPMD region that must be executed by one or more workers simultaneously, in a tightly coupled manner.
-
Euro-Par - Elastic Tasks: Unifying Task Parallelism and SPMD Parallelism with an Adaptive Runtime
Lecture Notes in Computer Science, 2015Co-Authors: Alina Sbirlea, Kunal Agrawal, Vivek SarkarAbstract:In this paper, we introduce elastic Tasks, a new high-level parallel programming primitive that can be used to unify Task Parallelism and SPMD Parallelism in a common adaptive scheduling framework. Elastic Tasks are internally parallel Tasks and can run on a single worker or expand to take over multiple workers. An elastic Task can be an ordinary Task or an SPMD region that must be executed by one or more workers simultaneously, in a tightly coupled manner.
Eduard Ayguade - One of the best experts on this subject based on the ideXlab platform.
-
PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite
ACM Transactions on Architecture and Code Optimization, 2016Co-Authors: Dimitrios Chasapis, Eduard Ayguade, Marc Casas, Miquel Moreto, Raul Vidal, Jesús Labarta, Mateo ValeroAbstract:In this work, we show how parallel applications can be implemented efficiently using Task Parallelism. We also evaluate the benefits of such parallel paradigm with respect to other approaches. We use the PARSEC benchmark suite as our test bed, which includes applications representative of a wide range of domains from HPC to desktop and server applications. We adopt different parallelization techniques, tailored to the needs of each application, to fully exploit the Task-based model. Our evaluation shows that Task Parallelism achieves better performance than thread-based parallelization models, such as Pthreads. Our experimental results show that we can obtain scalability improvements up to 42p on a 16-core system and code size reductions up to 81p. Such reductions are achieved by removing from the source code application specific schedulers or thread pooling systems and transferring these responsibilities to the runtime system software.
-
LCPC - Unrolling loops containing Task Parallelism
Languages and Compilers for Parallel Computing, 2010Co-Authors: Roger Ferrer, Alejandro Duran, Xavier Martorell, Eduard AyguadeAbstract:Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains Parallelism inside most compilers will ignore it or perform a naive transformation. We propose to extend the semantics of the loop unrolling transformation to cover loops that contain Task Parallelism. In these cases, the transformation will try to aggregate the multiple Tasks that appear after a classic unrolling phase to reduce the overheads per iteration. We present an implementation of such extended loop unrolling for OpenMP Tasks with two phases: a classical unroll followed by a Task aggregation phase. Our aggregation technique covers the special cases where Task Parallelism appears inside branches or where the loop is uncountable. Our experimental results show that using this extended unroll allows loops with fine-grained Tasks to reduce the overheads associated with Task creation and obtain a much better scaling.
-
unrolling loops containing Task Parallelism
Languages and Compilers for Parallel Computing, 2009Co-Authors: Roger Ferrer, Alejandro Duran, Xavier Martorell, Eduard AyguadeAbstract:Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains Parallelism inside most compilers will ignore it or perform a naive transformation. We propose to extend the semantics of the loop unrolling transformation to cover loops that contain Task Parallelism. In these cases, the transformation will try to aggregate the multiple Tasks that appear after a classic unrolling phase to reduce the overheads per iteration. We present an implementation of such extended loop unrolling for OpenMP Tasks with two phases: a classical unroll followed by a Task aggregation phase. Our aggregation technique covers the special cases where Task Parallelism appears inside branches or where the loop is uncountable. Our experimental results show that using this extended unroll allows loops with fine-grained Tasks to reduce the overheads associated with Task creation and obtain a much better scaling.
-
barcelona openmp Tasks suite a set of benchmarks targeting the exploitation of Task Parallelism in openmp
International Conference on Parallel Processing, 2009Co-Authors: Alejandro Duran, Roger Ferrer, Xavier Martorell, Xavier Teruel, Eduard AyguadeAbstract:Traditional parallel applications have exploited regular Parallelism, based on parallel loops. Only a few applications exploit sections Parallelism. With the release of the new OpenMP specification (3.0), this programming model supports Tasking. Parallel Tasks allow the exploitation of irregular Parallelism, but there is a lack of benchmarks exploiting Tasks in OpenMP. With the current (and projected) multicore architectures that offer many more alternatives to execute parallel applications than traditional SMP machines, this kind of Parallelism is increasingly important. And so, the need to have some set of benchmarks to evaluate it. In this paper, we motivate the need of having such a benchmarks suite, for irregular and/or recursive Task Parallelism. We present our proposal, the Barcelona OpenMP Tasks Suite (BOTS), with a set of applications exploiting regular and irregular Parallelism, based on Tasks. We present an overall evaluation of the BOTS benchmarks in an Altix system and we discuss some of the different experiments that can be done with the different compilation and runtime alternatives of the benchmarks.
-
ICPP - Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP
2009 International Conference on Parallel Processing, 2009Co-Authors: Alejandro Duran, Roger Ferrer, Xavier Martorell, Xavier Teruel, Eduard AyguadeAbstract:Traditional parallel applications have exploited regular Parallelism, based on parallel loops. Only a few applications exploit sections Parallelism. With the release of the new OpenMP specification (3.0), this programming model supports Tasking. Parallel Tasks allow the exploitation of irregular Parallelism, but there is a lack of benchmarks exploiting Tasks in OpenMP. With the current (and projected) multicore architectures that offer many more alternatives to execute parallel applications than traditional SMP machines, this kind of Parallelism is increasingly important. And so, the need to have some set of benchmarks to evaluate it. In this paper, we motivate the need of having such a benchmarks suite, for irregular and/or recursive Task Parallelism. We present our proposal, the Barcelona OpenMP Tasks Suite (BOTS), with a set of applications exploiting regular and irregular Parallelism, based on Tasks. We present an overall evaluation of the BOTS benchmarks in an Altix system and we discuss some of the different experiments that can be done with the different compilation and runtime alternatives of the benchmarks.
Pedro Palomo - One of the best experts on this subject based on the ideXlab platform.
-
a process network model for reactive streaming software with deterministic Task Parallelism
Fundamental Approaches to Software Engineering, 2018Co-Authors: Fotios Gioulekas, Peter Poplavko, Panagiotis Katsaros, Saddek Bensalem, Pedro PalomoAbstract:A formal semantics is introduced for a Process Network model, which combines streaming and reactive control processing with Task Parallelism properties suitable to exploit multi-cores. Applications that react to environment stimuli are implemented by communicating sporadic and periodic Tasks, programmed independently from an execution platform. Two functionally equivalent semantics are defined, one for sequential execution and one real-time. The former ensures functional determinism by implying precedence constraints between jobs (Task executions), hence, the program outputs are independent from the Task scheduling. The latter specifies concurrent execution on a real-time platform, guaranteeing all model’s constraints; it has been implemented in an executable formal specification language. The model’s implementation runs on multi-core embedded systems, and supports integration of run-time managers for shared HW/SW resources (e.g. for controlling QoS, resource interference or power consumption). Finally, a model transformation approach has been developed, which allowed to port and statically schedule a real spacecraft on-board application on an industrial multi-core platform.
-
FASE - A Process Network Model for Reactive Streaming Software with Deterministic Task Parallelism
Fundamental Approaches to Software Engineering, 2018Co-Authors: Fotios Gioulekas, Peter Poplavko, Panagiotis Katsaros, Saddek Bensalem, Pedro PalomoAbstract:A formal semantics is introduced for a Process Network model, which combines streaming and reactive control processing with Task Parallelism properties suitable to exploit multi-cores. Applications that react to environment stimuli are implemented by communicating sporadic and periodic Tasks, programmed independently from an execution platform. Two functionally equivalent semantics are defined, one for sequential execution and one real-time. The former ensures functional determinism by implying precedence constraints between jobs (Task executions), hence, the program outputs are independent from the Task scheduling. The latter specifies concurrent execution on a real-time platform, guaranteeing all model’s constraints; it has been implemented in an executable formal specification language. The model’s implementation runs on multi-core embedded systems, and supports integration of run-time managers for shared HW/SW resources (e.g. for controlling QoS, resource interference or power consumption). Finally, a model transformation approach has been developed, which allowed to port and statically schedule a real spacecraft on-board application on an industrial multi-core platform.
Panagiotis Katsaros - One of the best experts on this subject based on the ideXlab platform.
-
a process network model for reactive streaming software with deterministic Task Parallelism
Fundamental Approaches to Software Engineering, 2018Co-Authors: Fotios Gioulekas, Peter Poplavko, Panagiotis Katsaros, Saddek Bensalem, Pedro PalomoAbstract:A formal semantics is introduced for a Process Network model, which combines streaming and reactive control processing with Task Parallelism properties suitable to exploit multi-cores. Applications that react to environment stimuli are implemented by communicating sporadic and periodic Tasks, programmed independently from an execution platform. Two functionally equivalent semantics are defined, one for sequential execution and one real-time. The former ensures functional determinism by implying precedence constraints between jobs (Task executions), hence, the program outputs are independent from the Task scheduling. The latter specifies concurrent execution on a real-time platform, guaranteeing all model’s constraints; it has been implemented in an executable formal specification language. The model’s implementation runs on multi-core embedded systems, and supports integration of run-time managers for shared HW/SW resources (e.g. for controlling QoS, resource interference or power consumption). Finally, a model transformation approach has been developed, which allowed to port and statically schedule a real spacecraft on-board application on an industrial multi-core platform.
-
FASE - A Process Network Model for Reactive Streaming Software with Deterministic Task Parallelism
Fundamental Approaches to Software Engineering, 2018Co-Authors: Fotios Gioulekas, Peter Poplavko, Panagiotis Katsaros, Saddek Bensalem, Pedro PalomoAbstract:A formal semantics is introduced for a Process Network model, which combines streaming and reactive control processing with Task Parallelism properties suitable to exploit multi-cores. Applications that react to environment stimuli are implemented by communicating sporadic and periodic Tasks, programmed independently from an execution platform. Two functionally equivalent semantics are defined, one for sequential execution and one real-time. The former ensures functional determinism by implying precedence constraints between jobs (Task executions), hence, the program outputs are independent from the Task scheduling. The latter specifies concurrent execution on a real-time platform, guaranteeing all model’s constraints; it has been implemented in an executable formal specification language. The model’s implementation runs on multi-core embedded systems, and supports integration of run-time managers for shared HW/SW resources (e.g. for controlling QoS, resource interference or power consumption). Finally, a model transformation approach has been developed, which allowed to port and statically schedule a real spacecraft on-board application on an industrial multi-core platform.
P. Sadayappan - One of the best experts on this subject based on the ideXlab platform.
-
Scioto: A Framework for Global-View Task Parallelism
2008 37th International Conference on Parallel Processing, 2008Co-Authors: James Dinan, Sriram Krishnamoorthy, Brian D. Larkins, Jarek Nieplocha, P. SadayappanAbstract:We introduce Scioto, shared collections of Task objects, a lightweight framework for providing Task management on distributed memory machines under one-sided and global-view parallel programming models. Scioto provides locality aware dynamic load balancing and interoperates with MPI, ARMCI, and global arrays. Additionally, Scioto's Task model and programming interface are compatible with many other existing parallel models including UPC, SHMEM, and CAF. Through Task Parallelism, the Scioto framework provides a solution for overcoming irregularity, load imbalance, and heterogeneity as well as dynamic mapping of computation onto emerging architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the unbalanced tree search (UTS) benchmark and two quantum chemistry codes: the closed shell self-consistent field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.
-
ICPP - Scioto: A Framework for Global-View Task Parallelism
2008 37th International Conference on Parallel Processing, 2008Co-Authors: James Dinan, Sriram Krishnamoorthy, Jarek Nieplocha, L.d. Brian, P. SadayappanAbstract:We introduce Scioto, shared collections of Task objects, a lightweight framework for providing Task management on distributed memory machines under one-sided and global-view parallel programming models. Scioto provides locality aware dynamic load balancing and interoperates with MPI, ARMCI, and global arrays. Additionally, Scioto's Task model and programming interface are compatible with many other existing parallel models including UPC, SHMEM, and CAF. Through Task Parallelism, the Scioto framework provides a solution for overcoming irregularity, load imbalance, and heterogeneity as well as dynamic mapping of computation onto emerging architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the unbalanced tree search (UTS) benchmark and two quantum chemistry codes: the closed shell self-consistent field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.