Warp Divergence

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 363 Experts worldwide ranked by ideXlab platform

R Govindarajan - One of the best experts on this subject based on the ideXlab platform.

  • taming Warp Divergence
    Symposium on Code Generation and Optimization, 2017
    Co-Authors: Jayvant Anantpur, R Govindarajan
    Abstract:

    Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

  • CGO - Taming Warp Divergence
    2017 IEEE ACM International Symposium on Code Generation and Optimization (CGO), 2017
    Co-Authors: Jayvant Anantpur, R Govindarajan
    Abstract:

    Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

Beomseok Nam - One of the best experts on this subject based on the ideXlab platform.

  • 10.1109/TPDS.2014.2347041, IEEE Transactions on Parallel and Distributed Systems 1 Exploiting Massive Parallelism for Indexing Multi-dimensional Datasets on the GPU
    2016
    Co-Authors: Jinwoong Kim, Won-ki Jeong, Beomseok Nam
    Abstract:

    Abstract—Inherently multi-dimensional n-ary indexing structures such as R-trees are not well suited for the GPU because of their irregular memory access patterns and recursive back-tracking function calls. It has been known that traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism and to maximize the utilization of GPU processing units. Moreover, the recursive tree search algorithms often fail with large indexes because of the GPU’s tiny runtime stack size. In this paper, we propose a novel parallel tree traversal algorithm- Massively Parallel Restart Scanning (MPRS) for multi-dimensional range queries that avoids recursion and irregular memory access. The proposed MPRS algorithm traverses hierarchical tree structures with mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. We implemented the proposed MPRS range query processing algorithm on n-ary bounding volume hierarchies including R-trees and evaluate its performance using real scientific datasets on NVIDIA Tesla M2090 GPU. Our experiments show braided parallel MPRS range query algorithm achieves at least 80 % SIMD efficiency while task parallel tree traversal algorithm shows only 9%-15 % SIMD efficiency. Moreover, braided parallel MPRS algorithm accesses 7∼20 times less amount of global memory than task parallel parent link algorithm by virtue of minimal Warp Divergence. Index Terms—Parallel multi-dimensional indexing; Multi-dimensional range query; GPGPU;

  • ICPP - Parallel Tree Traversal for Nearest Neighbor Query on the GPU
    2016 45th International Conference on Parallel Processing (ICPP), 2016
    Co-Authors: Moohyeon Nam, Jinwoong Kim, Beomseok Nam
    Abstract:

    The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding Warp Divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm.

  • Exploiting Massive Parallelism for IndexingMulti-Dimensional Datasets on the GPU
    IEEE Transactions on Parallel and Distributed Systems, 2015
    Co-Authors: Jinwoong Kim, Won-ki Jeong, Beomseok Nam
    Abstract:

    Inherently multi-dimensional n-ary indexing structures such as R-trees are not well suited for the GPU because of their irregular memory access patterns and recursive back-tracking function calls. It has been known that traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism and to maximize the utilization of GPU processing units. Moreover, the recursive tree search algorithms often fail with large indexes because of the GPU’s tiny runtime stack size. In this paper, we propose a novel parallel tree traversal algorithm— massively parallel restart scanning (MPRS) for multi-dimensional range queries that avoids recursion and irregular memory access. The proposed MPRS algorithm traverses hierarchical tree structures with mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. We implemented the proposed MPRS range query processing algorithm on n-ary bounding volume hierarchies including R-trees and evaluated its performance using real scientific datasets on an NVIDIA Tesla M2090 GPU. Our experiments show braided parallel SIMD friendly MPRS range query algorithm achieves at least 80 percent Warp execution efficiency while task parallel tree traversal algorithm shows only 9-15 percent efficiency. Moreover, braided parallel MPRS algorithm accesses 7-20 times less amount of global memory than task parallel parent link algorithm by virtue of minimal Warp Divergence.

Henk Corporaal - One of the best experts on this subject based on the ideXlab platform.

  • A detailed GPU cache model based on reuse distance theory
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014
    Co-Authors: Cedric Nugteren, Gert-jan Van Den Braak, Henk Corporaal
    Abstract:

    As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

  • HPCA - A detailed GPU cache model based on reuse distance theory
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014
    Co-Authors: Cedric Nugteren, Henk Corporaal, Gert-jan Van Den Braak, Henri E. Bal
    Abstract:

    As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

Cedric Nugteren - One of the best experts on this subject based on the ideXlab platform.

  • A detailed GPU cache model based on reuse distance theory
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014
    Co-Authors: Cedric Nugteren, Gert-jan Van Den Braak, Henk Corporaal
    Abstract:

    As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

  • HPCA - A detailed GPU cache model based on reuse distance theory
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014
    Co-Authors: Cedric Nugteren, Henk Corporaal, Gert-jan Van Den Braak, Henri E. Bal
    Abstract:

    As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

Jayvant Anantpur - One of the best experts on this subject based on the ideXlab platform.

  • taming Warp Divergence
    Symposium on Code Generation and Optimization, 2017
    Co-Authors: Jayvant Anantpur, R Govindarajan
    Abstract:

    Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

  • CGO - Taming Warp Divergence
    2017 IEEE ACM International Symposium on Code Generation and Optimization (CGO), 2017
    Co-Authors: Jayvant Anantpur, R Govindarajan
    Abstract:

    Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.