Dynamic Parallelism

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 12237 Experts worldwide ranked by ideXlab platform

Michela Becchi - One of the best experts on this subject based on the ideXlab platform.

  • Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Hancheng Wu, Da Li, Michela Becchi
    Abstract:

    GPUs have been widely used to accelerate computations exhibiting simple patterns of Parallelism -- such as flat or two-level Parallelism -- and a degree of Parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of Parallelism, possibly known only at runtime, is still an open problem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs. By making it possible to launch kernels directly from GPU threads, this feature enables nested Parallelism at runtime. However, the effective use of DP must still be understood: a naïve use of this feature may suffer from significant runtime overhead and lead to GPU underutilization, resulting in poor performance. In this work, we target this problem. First, we demonstrate how a naïve use of DP can result in poor performance. Second, we propose three workload consolidation schemes to improve performance and hardware utilization of DP-based codes, and we implement these code transformations in a directive-based compiler. Finally, we evaluate our framework on two categories of applications: algorithms including irregular loops and algorithms exhibiting parallel recursion. Our experiments show that our approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.

  • exploiting Dynamic Parallelism to efficiently support irregular nested loops on gpus
    Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, 2015
    Co-Authors: Da Li, Hancheng Wu, Michela Becchi
    Abstract:

    Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naive use of this feature can cause a high number of nested kernel launches, each performing limited work, leading to GPU underutilization and poor performance. We propose workload consolidation mechanisms at different granularities to maximize the work performed by nested kernels and reduce their overhead. Our end goal is to design automatic code transformation techniques for applications with irregular nested loops.

  • Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
    2015 44th International Conference on Parallel Processing, 2015
    Co-Authors: Da Li, Hancheng Wu, Michela Becchi
    Abstract:

    The effective deployment of applications exhibiting irregular nested Parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested Parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on Dynamic Parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested Parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested Parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.

Hancheng Wu - One of the best experts on this subject based on the ideXlab platform.

  • Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Hancheng Wu, Da Li, Michela Becchi
    Abstract:

    GPUs have been widely used to accelerate computations exhibiting simple patterns of Parallelism -- such as flat or two-level Parallelism -- and a degree of Parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of Parallelism, possibly known only at runtime, is still an open problem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs. By making it possible to launch kernels directly from GPU threads, this feature enables nested Parallelism at runtime. However, the effective use of DP must still be understood: a naïve use of this feature may suffer from significant runtime overhead and lead to GPU underutilization, resulting in poor performance. In this work, we target this problem. First, we demonstrate how a naïve use of DP can result in poor performance. Second, we propose three workload consolidation schemes to improve performance and hardware utilization of DP-based codes, and we implement these code transformations in a directive-based compiler. Finally, we evaluate our framework on two categories of applications: algorithms including irregular loops and algorithms exhibiting parallel recursion. Our experiments show that our approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.

  • exploiting Dynamic Parallelism to efficiently support irregular nested loops on gpus
    Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, 2015
    Co-Authors: Da Li, Hancheng Wu, Michela Becchi
    Abstract:

    Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naive use of this feature can cause a high number of nested kernel launches, each performing limited work, leading to GPU underutilization and poor performance. We propose workload consolidation mechanisms at different granularities to maximize the work performed by nested kernels and reduce their overhead. Our end goal is to design automatic code transformation techniques for applications with irregular nested loops.

  • Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
    2015 44th International Conference on Parallel Processing, 2015
    Co-Authors: Da Li, Hancheng Wu, Michela Becchi
    Abstract:

    The effective deployment of applications exhibiting irregular nested Parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested Parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on Dynamic Parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested Parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested Parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.

Da Li - One of the best experts on this subject based on the ideXlab platform.

  • Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Hancheng Wu, Da Li, Michela Becchi
    Abstract:

    GPUs have been widely used to accelerate computations exhibiting simple patterns of Parallelism -- such as flat or two-level Parallelism -- and a degree of Parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of Parallelism, possibly known only at runtime, is still an open problem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs. By making it possible to launch kernels directly from GPU threads, this feature enables nested Parallelism at runtime. However, the effective use of DP must still be understood: a naïve use of this feature may suffer from significant runtime overhead and lead to GPU underutilization, resulting in poor performance. In this work, we target this problem. First, we demonstrate how a naïve use of DP can result in poor performance. Second, we propose three workload consolidation schemes to improve performance and hardware utilization of DP-based codes, and we implement these code transformations in a directive-based compiler. Finally, we evaluate our framework on two categories of applications: algorithms including irregular loops and algorithms exhibiting parallel recursion. Our experiments show that our approach significantly reduces runtime overhead and improves GPU utilization, leading to speedup factors from 90x to 3300x over basic DP-based solutions and speedups from 2x to 6x over flat implementations.

  • exploiting Dynamic Parallelism to efficiently support irregular nested loops on gpus
    Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, 2015
    Co-Authors: Da Li, Hancheng Wu, Michela Becchi
    Abstract:

    Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naive use of this feature can cause a high number of nested kernel launches, each performing limited work, leading to GPU underutilization and poor performance. We propose workload consolidation mechanisms at different granularities to maximize the work performed by nested kernels and reduce their overhead. Our end goal is to design automatic code transformation techniques for applications with irregular nested loops.

  • Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
    2015 44th International Conference on Parallel Processing, 2015
    Co-Authors: Da Li, Hancheng Wu, Michela Becchi
    Abstract:

    The effective deployment of applications exhibiting irregular nested Parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested Parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on Dynamic Parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested Parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested Parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.

Sudhakar Yalamanchili - One of the best experts on this subject based on the ideXlab platform.

  • laperm locality aware scheduler for Dynamic Parallelism on gpus
    International Symposium on Computer Architecture, 2016
    Co-Authors: Jin Wang, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili
    Abstract:

    Recent developments in GPU execution models and architectures have introduced Dynamic Parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during Dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when Dynamic Parallelism is introduced since they are oblivious to the hierarchical reference locality.We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing Dynamic Parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.

  • Dynamic thread block launch a lightweight execution mechanism to support irregular applications on gpus
    International Symposium on Computer Architecture, 2015
    Co-Authors: Jin Wang, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili
    Abstract:

    GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting Dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute Dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute Dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

  • characterization and analysis of Dynamic Parallelism in unstructured gpu applications
    IEEE International Symposium on Workload Characterization, 2014
    Co-Authors: Jin Wang, Sudhakar Yalamanchili
    Abstract:

    In this study, we seek to characterize such Dynamically formed Parallelism and and evaluate implementations designed to exploit them using CUDA Dynamic Parallelism (CDP) - an execution model where parallel workload are launched Dynamically from within kernels when pockets of structured Parallelism are detected. We characterize and evaluate such implementations by analyzing the impact on control and memory behavior measurements on commodity hardware. In particular, the study targets a comprehensive understanding of the overhead of current CDP support in GPUs in terms of kernel launch, memory footprint and algorithm overhead. Experiments show that while the CDP implementation can generate potentially 1.13x-2.73x speedup over non-CDP implementations, the non-trivial overhead causes the overall performance an average of 1.21x slowdown.

Hannu Tenhunen - One of the best experts on this subject based on the ideXlab platform.

  • architecture and implementation of Dynamic Parallelism voltage and frequency scaling pvfs on cgras
    ACM Journal on Emerging Technologies in Computing Systems, 2015
    Co-Authors: Syed M A H Jafri, Ahmed Hemani, Kolin Paul, Juha Plosila, Ozan Ozbag, Nasim Farahini, Hannu Tenhunen
    Abstract:

    In the era of platforms hosting multiple applications with arbitrary performance requirements, providing a worst-case platform-wide voltagesfrequency operating point is neither optimal nor desirable. As a solution to this problem, designs commonly employ Dynamic voltage and frequency scaling (DVFS). DVFS promises significant energy and power reductions by providing each application with the operating point (and hence the performance) tailored to its needs. To further enhance the optimization potential, recent works interleave Dynamic Parallelism with conventional DVFS. The induced Parallelism results in performance gains that allow an application to lower its operating point even further (thereby saving energy and power consumption). However, the existing works employ costly dedicated hardware (for synchronization) and rely solely on greedy algorithms to make Parallelism decisions. To efficiently integrate Parallelism with DVFS, compared to state-of-the-art, we exploit the reconfiguration (to reduce DVFS synchronization overheads) and enhance the intelligence of the greedy algorithm (to make optimal Parallelism decisions). Specifically, our solution relies on Dynamically reconfigurable isolation cells and an autonomous Parallelism, voltage, and frequency selection algorithm. The Dynamically reconfigurable isolation cells reduce the area overheads of DVFS circuitry by configuring the existing resources to provide synchronization. The autonomous Parallelism, voltage, and frequency selection algorithm ensures high power efficiency by combining Parallelism with DVFS. It selects that Parallelism, voltage, and frequency trio which consumes minimum power to meet the deadlines on available resources. Synthesis and simulation results using various applications/algorithms (WLAN, MPEG4, FFT, FIR, matrix multiplication) show that our solution promises significant reduction in area and power consumption (23p and 51p ) compared to state-of-the-art.

  • transpar transformation based Dynamic Parallelism for low power cgras
    Field-Programmable Logic and Applications, 2014
    Co-Authors: Syed M A H Jafri, Ahmed Hemani, Kolin Paul, Juha Plosila, Guillermo Serrano, Masoud Daneshtalab, Naeem Abbas, Hannu Tenhunen
    Abstract:

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer runtime Parallelism to reduce energy consumption (by lowering voltage/frequency). To implement the runtime Parallelism, CGRAs commonly store multiple compile-time generated implementations of an application (with different degree of Parallelism) and select the optimal version at runtime. However, the compile-time binding incurs excessive configuration memory overheads and/or is unable to parallelize an application even when sufficient resources are available. As a solution to this problem, we propose Transformation based Dynamic Parallelism (TransPar). TransPar stores only a single implementation and applies a series for transformations to generate the bitstream for the parallel version. In addition, it also allows to displace and/or rotate an application to parallelize in resource constrained scenarios. By storing only a single implementation, TransPar offers significant reductions in configuration memory requirements (up to 73% for the tested applications), compared to state of the art compaction techniques. Simulation and synthesis results, using real applications, reveal that the additional flexibility allows up to 33% energy reduction compared to static memory based Parallelism techniques. Gate level analysis reveals that TransPar incurs negligible silicon (0.2% of the platform) and timing (6 additional cycles per application) penalty.

  • energy aware task Parallelism for efficient Dynamic voltage and frequency scaling in cgras
    International Conference on Embedded Computer Systems: Architectures Modeling and Simulation, 2013
    Co-Authors: Syed M A H Jafri, Muhammad Adeel Tajammul, Ahmed Hemani, Kolin Paul, Juha Plosila, Hannu Tenhunen
    Abstract:

    Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, Dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining Dynamic Parallelism with DVFS. The proposed methods exploit the speedup induced by Parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of Parallelism. As a solution to this problem, we present energy aware task Parallelism. Our solution relies on a resource allocation graphs and an autonomous Parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous Parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.

  • energy aware task Parallelism for efficient Dynamic voltage and frequency scaling in cgras
    International Conference on Embedded Computer Systems: Architectures Modeling and Simulation, 2013
    Co-Authors: Syed M A H Jafri, Muhammad Adeel Tajammul, Ahmed Hemani, Kolin Paul, Juha Plosila, Hannu Tenhunen
    Abstract:

    Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, Dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining Dynamic Parallelism with DVFS. The proposed methods exploit the speedup induced by Parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of Parallelism. As a solution to this problem, we present energy aware task Parallelism. Our solution relies on a resource allocation graphs and an autonomous Parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous Parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.