Parallel Programming Model

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 33399 Experts worldwide ranked by ideXlab platform

Wenmei W Hwu - One of the best experts on this subject based on the ideXlab platform.

  • dysel lightweight dynamic selection for kernel based data Parallel Programming Model
    Architectural Support for Programming Languages and Operating Systems, 2016
    Co-Authors: Liwen Chang, Heeseok Kim, Wenmei W Hwu
    Abstract:

    The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data Parallel Programming Models such as OpenCL, CUDA, OpenACC, and C++AMP. These Programming Models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.

  • mcuda an efficient implementation of cuda kernels for multi core cpus
    Languages and Compilers for Parallel Computing, 2008
    Co-Authors: John A Stratton, Sam S. Stone, Wenmei W Hwu
    Abstract:

    CUDA is a data Parallel Programming Model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This Model has proven effective in Programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for Parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually Parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-Parallel Programming Model for more than just GPU architectures.

  • implicitly Parallel Programming Models for thousand core microprocessors
    Design Automation Conference, 2007
    Co-Authors: Wenmei W Hwu, Sam S. Stone, Shane Ryoo, Sainzee Ueng, John H Kelm, Isaac Gelado, Robert E Kidd, Sara S Baghsorkhi, Aqeel Mahesri, Stephanie C Tsao
    Abstract:

    This paper argues for an implicitly Parallel Programming Model for many-core microprocessors, and provides initial technical approaches towards this goal. In an implicitly Parallel Programming Model, programmers maximize algorithm- level Parallelism, express their Parallel algorithms by asserting high-level properties on top of a traditional sequential Programming language, and rely on Parallelizing compilers and hardware support to perform Parallel execution under the hood. In such a Model, compilers and related tools require much more advanced program analysis capabilities and programmer assertions than what are currently available so that a comprehensive understanding of the input program's concurrency can be derived. Such an understanding is then used to drive automatic or interactive Parallel code generation tools for a diverse set of Parallel hardware organizations. The chip-level architecture and hardware should maintain Parallel execution state in such a way that a strictly sequential execution state can always be derived for the purpose of verifying and debugging the program. We argue that implicitly Parallel Programming Models are critical for addressing the software development crises and software scalability challenges for many-core microprocessors.

Thierry Van Cutsem - One of the best experts on this subject based on the ideXlab platform.

  • power system dynamic simulations using a Parallel two level schur complement decomposition
    IEEE Transactions on Power Systems, 2016
    Co-Authors: Petros Aristidou, Simon Lebeau, Thierry Van Cutsem
    Abstract:

    As the need for faster power system dynamic simulations increases, it is essential to develop new algorithms that exploit Parallel computing to accelerate those simulations. This paper proposes a Parallel algorithm based on a two-level, Schur-complement-based, domain decomposition method. The two-level partitioning provides high Parallelization potential (coarse- and fine-grained). In addition, due to the Schur-complement approach used to update the sub-domain interface variables, the algorithm exhibits high global convergence rate. Finally, it provides significant numerical and computational acceleration. The algorithm is implemented using the shared-memory Parallel Programming Model, targeting inexpensive multi-core machines. Its performance is reported on a real system as well as on a large test system combining transmission and distribution networks.

  • dynamic simulation of large scale power systems using a Parallel schur complement based decomposition method
    IEEE Transactions on Parallel and Distributed Systems, 2014
    Co-Authors: Petros Aristidou, Davide Fabozzi, Thierry Van Cutsem
    Abstract:

    Power system dynamic simulations are crucial for the operation of electric power systems as they provide important information on the dynamic evolution of the system after an occurring disturbance. This paper proposes a robust, accurate and efficient Parallel algorithm based on the Schur complement domain decomposition method. The algorithm provides numerical and computational acceleration of the procedure. Based on the shared-memory Parallel Programming Model, a Parallel implementation of the proposed algorithm is presented. The implementation is general, portable and scalable on inexpensive, shared-memory, multi-core machines. Two realistic test systems, a medium-scale and a large-scale, are used for performance evaluation of the proposed method.

  • a schur complement method for dae systems in power system dynamic simulations
    2014
    Co-Authors: Petros Aristidou, Davide Fabozzi, Thierry Van Cutsem
    Abstract:

    This paper proposes a Schur complement-based Domain Decomposition Method to accelerate the time-domain simulation of large, non-linear and stiff Differential and Algebraic Equation systems stemming from power system dynamic studies. The proposed algorithm employs a star-shaped decomposition scheme and exploits the locality and sparsity of the system. The simulation is accelerated by the use of quasi-Newton schemes and Parallel Programming techniques. The proposed algorithm is implemented using the shared-memory Parallel Programming Model and tested on a large-scale, realistic power system Model showing significant speedup.

Wolfgang De Meuter - One of the best experts on this subject based on the ideXlab platform.

  • partitioned global address space languages
    ACM Computing Surveys, 2015
    Co-Authors: Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, Wolfgang De Meuter
    Abstract:

    The Partitioned Global Address Space (PGAS) Model is a Parallel Programming Model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale Parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of nonuniform communication costs. Today, about a dozen languages exist that adhere to the PGAS Model. This survey proposes a definition and a taxonomy along four axes: how Parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions, and finally, how data is accessed across partitions. Our taxonomy reveals that today’s PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost Models remain open challenges.

Bryan Ford - One of the best experts on this subject based on the ideXlab platform.

  • efficient system enforced deterministic Parallelism
    Communications of The ACM, 2012
    Co-Authors: Amittai Aviram, Shuchun Weng, Bryan Ford
    Abstract:

    Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing Parallel programs deterministically, however, often incur high costs, allow misbehaved software to defeat repeatability, and transform time-dependent races into input-or path-dependent races without eliminating them. We introduce a new Parallel Programming Model addressing these issues, and use Determinator, a proof-of-concept OS, to demonstrate the Model's practicality. Determinator's microkernel application Programming interface (API) provides only "shared-nothing" address spaces and deterministic interprocess communication primitives to make execution of all unprivileged code---well-behaved or not---precisely repeatable. Atop this microkernel, Determinator's user-level runtime offers a private workspace Model for both thread-level and process-level Parallel Programming. This Model avoids the introduction of read/write data races, and converts write/write races into reliably detected conflicts. Coarse-grained Parallel benchmarks perform and scale comparably to non-deterministic systems, both on multicore PCs and across nodes in a distributed cluster.

  • Efficient systemenforced deterministic Parallelism
    2010
    Co-Authors: Amittai Aviram, Shuchun Weng, Bryan Ford
    Abstract:

    Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing Parallel programs deterministically, however, often incur high costs, allow misbehaved software to defeat repeatability, and transform time-dependent races into input- or path-dependent races without eliminating them. We introduce a new Parallel Programming Model addressing these issues, and use Determinator, a proof-of-concept OS, to demonstrate the Model’s practicality. Determinator’s microkernel API provides only “shared-nothing ” address spaces and deterministic interprocess communication primitives to make execution of all unprivileged code—well-behaved or not— precisely repeatable. Atop this microkernel, Determinator’s user-level runtime adapts optimistic replication techniques to offer a private workspace Model for both thread-level and process-level Parallel programing. This Model avoids the introduction of read/write data races, and converts write/write races into reliably-detected conflicts. Coarse-grained Parallel benchmarks perform and scale comparably to nondeterministic systems, on both multicore PCs and across nodes in a distributed cluster.

  • deterministic consistency a Programming Model for shared memory Parallelism
    arXiv: Operating Systems, 2010
    Co-Authors: Amittai Aviram, Bryan Ford
    Abstract:

    The difficulty of developing reliable Parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code, and runtime systems can impose synthetic schedules on legacy Parallel code. To Parallelize existing serial code, however, we would like a Programming Model that is naturally deterministic without language restrictions or artificial scheduling. We propose deterministic consistency, a Parallel Programming Model as easy to understand as the “Parallel assignment” construct in sequential languages such as Perl and JavaScript, where concurrent threads always read their inputs before writing shared outputs. DC supports common data- and task-Parallel synchronization abstractions such as fork/join and barriers, as well as non-hierarchical structures such as producer/consumer pipelines and futures. A preliminary prototype suggests that softwareonly implementations of DC can run applications written for popular Parallel environments such as OpenMP with low (< 10%) overhead for some applications.

De Meuterwolfgang - One of the best experts on this subject based on the ideXlab platform.