Warp Divergence

The Experts below are selected from a list of 363 Experts worldwide ranked by ideXlab platform

R Govindarajan - One of the best experts on this subject based on the ideXlab platform.

taming Warp Divergence

Symposium on Code Generation and Optimization, 2017

Co-Authors: Jayvant Anantpur, R Govindarajan

Abstract:

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

15 days free trial to Access Article
CGO - Taming Warp Divergence

2017 IEEE ACM International Symposium on Code Generation and Optimization (CGO), 2017

Co-Authors: Jayvant Anantpur, R Govindarajan

Abstract:

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

15 days free trial to Access Article

Beomseok Nam - One of the best experts on this subject based on the ideXlab platform.

10.1109/TPDS.2014.2347041, IEEE Transactions on Parallel and Distributed Systems 1 Exploiting Massive Parallelism for Indexing Multi-dimensional Datasets on the GPU

2016

Co-Authors: Jinwoong Kim, Won-ki Jeong, Beomseok Nam

Abstract:

Abstract—Inherently multi-dimensional n-ary indexing structures such as R-trees are not well suited for the GPU because of their irregular memory access patterns and recursive back-tracking function calls. It has been known that traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism and to maximize the utilization of GPU processing units. Moreover, the recursive tree search algorithms often fail with large indexes because of the GPU’s tiny runtime stack size. In this paper, we propose a novel parallel tree traversal algorithm- Massively Parallel Restart Scanning (MPRS) for multi-dimensional range queries that avoids recursion and irregular memory access. The proposed MPRS algorithm traverses hierarchical tree structures with mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. We implemented the proposed MPRS range query processing algorithm on n-ary bounding volume hierarchies including R-trees and evaluate its performance using real scientific datasets on NVIDIA Tesla M2090 GPU. Our experiments show braided parallel MPRS range query algorithm achieves at least 80 % SIMD efficiency while task parallel tree traversal algorithm shows only 9%-15 % SIMD efficiency. Moreover, braided parallel MPRS algorithm accesses 7∼20 times less amount of global memory than task parallel parent link algorithm by virtue of minimal Warp Divergence. Index Terms—Parallel multi-dimensional indexing; Multi-dimensional range query; GPGPU;

15 days free trial to Access Article
ICPP - Parallel Tree Traversal for Nearest Neighbor Query on the GPU

2016 45th International Conference on Parallel Processing (ICPP), 2016

Co-Authors: Moohyeon Nam, Jinwoong Kim, Beomseok Nam

Abstract:

The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding Warp Divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm.

15 days free trial to Access Article
Exploiting Massive Parallelism for IndexingMulti-Dimensional Datasets on the GPU

IEEE Transactions on Parallel and Distributed Systems, 2015

Co-Authors: Jinwoong Kim, Won-ki Jeong, Beomseok Nam

Abstract:

Inherently multi-dimensional n-ary indexing structures such as R-trees are not well suited for the GPU because of their irregular memory access patterns and recursive back-tracking function calls. It has been known that traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism and to maximize the utilization of GPU processing units. Moreover, the recursive tree search algorithms often fail with large indexes because of the GPU’s tiny runtime stack size. In this paper, we propose a novel parallel tree traversal algorithm— massively parallel restart scanning (MPRS) for multi-dimensional range queries that avoids recursion and irregular memory access. The proposed MPRS algorithm traverses hierarchical tree structures with mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. We implemented the proposed MPRS range query processing algorithm on n-ary bounding volume hierarchies including R-trees and evaluated its performance using real scientific datasets on an NVIDIA Tesla M2090 GPU. Our experiments show braided parallel SIMD friendly MPRS range query algorithm achieves at least 80 percent Warp execution efficiency while task parallel tree traversal algorithm shows only 9-15 percent efficiency. Moreover, braided parallel MPRS algorithm accesses 7-20 times less amount of global memory than task parallel parent link algorithm by virtue of minimal Warp Divergence.

15 days free trial to Access Article

Henk Corporaal - One of the best experts on this subject based on the ideXlab platform.

A detailed GPU cache model based on reuse distance theory

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014

Co-Authors: Cedric Nugteren, Gert-jan Van Den Braak, Henk Corporaal

Abstract:

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

15 days free trial to Access Article
HPCA - A detailed GPU cache model based on reuse distance theory

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014

Co-Authors: Cedric Nugteren, Henk Corporaal, Gert-jan Van Den Braak, Henri E. Bal

Abstract:

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

15 days free trial to Access Article

Cedric Nugteren - One of the best experts on this subject based on the ideXlab platform.

A detailed GPU cache model based on reuse distance theory

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014

Co-Authors: Cedric Nugteren, Gert-jan Van Den Braak, Henk Corporaal

Abstract:

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

15 days free trial to Access Article
HPCA - A detailed GPU cache model based on reuse distance theory

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014

Co-Authors: Cedric Nugteren, Henk Corporaal, Gert-jan Van Den Braak, Henri E. Bal

Abstract:

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, Warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) Warp Divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

15 days free trial to Access Article

Jayvant Anantpur - One of the best experts on this subject based on the ideXlab platform.

taming Warp Divergence

Symposium on Code Generation and Optimization, 2017

Co-Authors: Jayvant Anantpur, R Govindarajan

Abstract:

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

15 days free trial to Access Article
CGO - Taming Warp Divergence

2017 IEEE ACM International Symposium on Code Generation and Optimization (CGO), 2017

Co-Authors: Jayvant Anantpur, R Govindarajan

Abstract:

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called Warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., Warps of a TB may finish the kernel execution at different points in time, causing the faster Warps to wait for their slower sibling Warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of Warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual Warps, and enables Warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling Warps. We propose simple source to source transformations to use virtual thread blocks and virtual Warps. Further, this technique enables us to design a Warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual Warps, and uses this knowledge to prioritise Warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) Warp scheduler and 1.09x over Loose Round Robin (LRR) Warp scheduler.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

R Govindarajan - One of the best experts on this subject based on the ideXlab platform.

taming Warp Divergence

CGO - Taming Warp Divergence

Beomseok Nam - One of the best experts on this subject based on the ideXlab platform.

10.1109/TPDS.2014.2347041, IEEE Transactions on Parallel and Distributed Systems 1 Exploiting Massive Parallelism for Indexing Multi-dimensional Datasets on the GPU

ICPP - Parallel Tree Traversal for Nearest Neighbor Query on the GPU

Exploiting Massive Parallelism for IndexingMulti-Dimensional Datasets on the GPU

Henk Corporaal - One of the best experts on this subject based on the ideXlab platform.

A detailed GPU cache model based on reuse distance theory

HPCA - A detailed GPU cache model based on reuse distance theory

Cedric Nugteren - One of the best experts on this subject based on the ideXlab platform.

A detailed GPU cache model based on reuse distance theory

HPCA - A detailed GPU cache model based on reuse distance theory

Jayvant Anantpur - One of the best experts on this subject based on the ideXlab platform.

taming Warp Divergence

CGO - Taming Warp Divergence

Warp Divergence

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

R Govindarajan - One of the best experts on this subject based on the ideXlab platform.

Beomseok Nam - One of the best experts on this subject based on the ideXlab platform.

Henk Corporaal - One of the best experts on this subject based on the ideXlab platform.

Cedric Nugteren - One of the best experts on this subject based on the ideXlab platform.

Jayvant Anantpur - One of the best experts on this subject based on the ideXlab platform.

Related terms