thread level parallelism

The Experts below are selected from a list of 2793 Experts worldwide ranked by ideXlab platform

Hazrina Hassan - One of the best experts on this subject based on the ideXlab platform.

thread level parallelism shared memory pool techniques for authorization of credit card system

International Symposium on Communications and Information Technologies, 2008

Co-Authors: Mohd Hairul Nizam Md Nasir, Siti Hafizah Ab. Hamid, Hazrina Hassan

Abstract:

Nowadays, credit card is a famous trend used by millions of people around the world as a form of payment. This paper presented an architectural framework and prototype of credit card authorization system using thread-level parallelism and shared memory pool techniques in order to support dynamic tuning of the size of the thread pool at runtime. Normally, single threaded were chosen by software developer in current credit card authorization whereby authentication process takes longer time to respond and its limitation of handling huge number of simultaneous transactions at the same time. As a result, the performance of the authorization system was affected during peak hours. Through thread-level parallelism technique or usually known as multi-threading, each worker thread will be assigned with several child threads to perform online fraud validation concurrently, depending on numbers of cryptographic elements presented in transaction message while the work thread itself performed card restriction validation based on the card information stored in card's shared memory pool.

15 days free trial to Access Article
thread-level parallelism & Shared-Memory Pool Techniques for Authorization of Credit Card System

2008 International Symposium on Communications and Information Technologies, 2008

Co-Authors: Mohd Hairul Nizam M. Nasir, Siti Hafizah Ab. Hamid, Hazrina Hassan

Abstract:

Nowadays, credit card is a famous trend used by millions of people around the world as a form of payment. This paper presented an architectural framework and prototype of credit card authorization system using thread-level parallelism and shared memory pool techniques in order to support dynamic tuning of the size of the thread pool at runtime. Normally, single threaded were chosen by software developer in current credit card authorization whereby authentication process takes longer time to respond and its limitation of handling huge number of simultaneous transactions at the same time. As a result, the performance of the authorization system was affected during peak hours. Through thread-level parallelism technique or usually known as multi-threading, each worker thread will be assigned with several child threads to perform online fraud validation concurrently, depending on numbers of cryptographic elements presented in transaction message while the work thread itself performed card restriction validation based on the card information stored in card's shared memory pool.

15 days free trial to Access Article

Arun Kejariwal - One of the best experts on this subject based on the ideXlab platform.

parallelization spectroscopy analysis of thread level parallelism in hpc programs

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Co-Authors: Arun Kejariwal, Calin Cascaval

Abstract:

In this paper, we present a method - parallelization spectroscopy - for analyzing the thread-level parallelism available in production High Performance Computing (HPC) codes.We survey a number of techniques that are commonly used for parallelization and classify all the loops in the case study presented using a sensitivity metric: how likely is a particular technique is successful in parallelizing the loop.

15 days free trial to Access Article
PPOPP - Parallelization spectroscopy: analysis of thread-level parallelism in hpc programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '09, 2009

Co-Authors: Arun Kejariwal, Calin Cascaval

Abstract:

In this paper, we present a method - parallelization spectroscopy - for analyzing the thread-level parallelism available in production High Performance Computing (HPC) codes.We survey a number of techniques that are commonly used for parallelization and classify all the loops in the case study presented using a sensitivity metric: how likely is a particular technique is successful in parallelizing the loop.

15 days free trial to Access Article
on the evaluation and extraction of thread level parallelism in ordinary programs

2008

Co-Authors: Alexandru Nicolau, Arun Kejariwal

Abstract:

The need for high performance coupled with the increasing design complexity of modern processors and power and thermal constraints has led to the development of multi-cores systems. Examples of such systems include IBM/Toshiba's Cell processor, Intel's Core 2 Duo processor. One of the ways to exploit the hardware parallelism of such systems is via thread-level program parallelization. Although there has been a large amount of work done in the context of multithreading, the lack of detailed application characterization on real machines makes it difficult to assess the relevance and importance of the problems addressed in prior work and also of the practicality of the solutions proposed. To alleviate this limitation, we did a thorough analysis of ordinary programs, as represented by industry-standard SPEC benchmarks, on both IA-32 and IA-64 architectures to identify real performance bottlenecks. Based on the above and given that loops account for a large percentage of the total execution time in ordinary programs, we propose techniques for extracting thread-level parallelism (TLP) from both—DOALL and non-DOALL —type of loops. Extraction of TLP from DOALL loops entails efficient partitioning and mapping of a DOALL loop so as to achieve load balance between the different processors. In this regard, we present a general approach for partitioning nested DOALL loops, both perfect and non-perfect, with conditionals, with rectangular and non-rectangular iteration geometries, where the expressions in a conditional are affine functions of the outer loop indices. Non-DOALL loops can be parallelized either speculatively (TLS) or via explicit synchronization. Although TLS enables parallel execution of difficult-to-analyze (at compile time) program regions, its efficacy is limited by a wide variety of factors such as high misspeculation penalty and the need for additional hardware. This necessitates an evaluation of the performance potential of TLS. Using the Intel Fortran/C++ compiler, we show that the speedup achievable via TLS, at the loop level, is minimal in ordinary programs. Therefore, we adopted explicit synchronization as the way to parallelize non- DOALL loops and proposed lightweight lock-free synchronization techniques for extracting TLP from non-DOALL loops. We show that the proposed techniques achieve better performance than the state-of-the-art on real machines.

15 days free trial to Access Article
on the performance potential of different types of speculative thread level parallelism the dl version of this paper includes corrections that were not made available in the printed proceedings

International Conference on Supercomputing, 2006

Co-Authors: Arun Kejariwal, Xinmin Tian, Wei Li, Milind Girkar, Sergey Kozhukhov, Hideki Saito, Utpal Banerjee, Alexandru Nicolau, Alexander V Veidenbaum, Constantine D Polychronopoulos

Abstract:

Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism, evaluation of the unique performance potential of TLS, i.e., performance gain that be achieved only through speculation, has not received much attention. In this paper, we evaluate this aspect, by separating the speedup achievable via true TLP (thread-level parallelism) and TLS, for the SPEC CPU2000 benchmark. Further, we dissect the performance potential of each type of speculation --- control speculation, data dependence speculation and data value speculation. To the best of our knowledge, this is the first dissection study of its kind. Assuming an oracle TLS mechanism --- which corresponds to perfect speculation and zero threading overhead --- whereby the execution time of a candidate program region (for speculative execution) can be reduced to zero, our study shows that, at the loop-level, the upper bound on the arithmetic mean and geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16% (standard deviation = 31.23) and 18.18% respectively.

15 days free trial to Access Article
ICS - On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings

Proceedings of the 20th annual international conference on Supercomputing - ICS '06, 2006

Co-Authors: Arun Kejariwal, Xinmin Tian, Wei Li, Milind Girkar, Sergey Kozhukhov, Hideki Saito, Utpal Banerjee, Alexandru Nicolau, Alexander V Veidenbaum, Constantine D Polychronopoulos

Abstract:

Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism, evaluation of the unique performance potential of TLS, i.e., performance gain that be achieved only through speculation, has not received much attention. In this paper, we evaluate this aspect, by separating the speedup achievable via true TLP (thread-level parallelism) and TLS, for the SPEC CPU2000 benchmark. Further, we dissect the performance potential of each type of speculation --- control speculation, data dependence speculation and data value speculation. To the best of our knowledge, this is the first dissection study of its kind. Assuming an oracle TLS mechanism --- which corresponds to perfect speculation and zero threading overhead --- whereby the execution time of a candidate program region (for speculative execution) can be reduced to zero, our study shows that, at the loop-level, the upper bound on the arithmetic mean and geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16% (standard deviation = 31.23) and 18.18% respectively.

15 days free trial to Access Article

David Gregg - One of the best experts on this subject based on the ideXlab platform.

PACT - An Efficient Vectorization Approach to Nested thread-level parallelism for CUDA GPUs

2015 International Conference on Parallel Architecture and Compilation (PACT), 2015

Co-Authors: Shixiong Xu, David Gregg

Abstract:

Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This mapping problem is two folds: suitable execution models and efficient mapping strategies of the nested parallelism.

15 days free trial to Access Article
An Efficient Vectorization Approach to Nested thread-level parallelism for CUDA GPUs

2015 International Conference on Parallel Architecture and Compilation (PACT), 2015

Co-Authors: Shixiong Xu, David Gregg

Abstract:

Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This mapping problem is two folds: suitable execution models and efficient mapping strategies of the nested parallelism.

15 days free trial to Access Article

D.l. Heine - One of the best experts on this subject based on the ideXlab platform.

In search of speculative thread-level parallelism

2003

Co-Authors: J.t. Oplinger, D.l. Heine, M.s. Lam

Abstract:

The paper focuses in the problem of how to find and effectively exploit speculative thread-level parallelism. Our studies show that speculating only on loops does not yield sufficient parallelism. We propose the use of speculative procedure execution as a means to increase the available parallelism. An additional technique, data value prediction, has the potential to greatly improve the performance of speculative execution. In particular, return value prediction improves the success of procedural speculation, and stride value prediction improves the success of loop speculation

15 days free trial to Access Article
in search of speculative thread level parallelism

International Conference on Parallel Architectures and Compilation Techniques, 1999

Co-Authors: J.t. Oplinger, D.l. Heine

Abstract:

This paper focuses on the problem of how to find and effectively exploit speculative thread-level parallelism. Our studies show that speculating only on loops does not yield sufficient parallelism. We propose the use of speculative procedure execution as a means to increase the available parallelism. An additional technique, data value prediction, has the potential to greatly improve the performance of speculative execution. In particular, return value prediction improves the success of procedural speculation, and stride value prediction improves the success of loop speculation.

15 days free trial to Access Article
IEEE PACT - In search of speculative thread-level parallelism

1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), 1999

Co-Authors: J.t. Oplinger, D.l. Heine

Abstract:

This paper focuses on the problem of how to find and effectively exploit speculative thread-level parallelism. Our studies show that speculating only on loops does not yield sufficient parallelism. We propose the use of speculative procedure execution as a means to increase the available parallelism. An additional technique, data value prediction, has the potential to greatly improve the performance of speculative execution. In particular, return value prediction improves the success of procedural speculation, and stride value prediction improves the success of loop speculation.

15 days free trial to Access Article

Huiyang Zhou - One of the best experts on this subject based on the ideXlab platform.

gpu performance vs thread level parallelism scalability analysis and a novel way to improve tlp

ACM Transactions on Architecture and Code Optimization, 2018

Co-Authors: Michael Mantor, Huiyang Zhou

Abstract:

Graphics Processing Units (GPUs) leverage massive thread-level parallelism (TLP) to achieve high computation throughput and hide long memory latency. However, recent studies have shown that the GPU performance does not scale with the GPU occupancy or the degrees of TLP that a GPU supports, especially for memory-intensive workloads. The current understanding points to L1 D-cache contention or off-chip memory bandwidth. In this article, we perform a novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions. We show that the interconnect bandwidth is a critical bound for GPU performance scalability. For the applications that do not have saturated throughput utilization on a particular resource, their performance scales well with increased TLP. To improve TLP for such applications efficiently, we propose a fast context switching approach. When a warp/thread block (TB) is stalled by a long latency operation, the context of the warp/TB is spilled to spare on-chip resource so that a new warp/TB can be launched. The switched-out warp/TB is switched back when another warp/TB is completed or switched out. With this fine-grain fast context switching, higher TLP can be supported without increasing the sizes of critical resources like the register file. Our experiment shows that the performance can be improved by up to 47% and a geometric mean of 22% for a set of applications with unsaturated throughput utilization. Compared to the state-of-the-art TLP improvement scheme, our proposed scheme achieves 12% higher performance on average and 16% for unsaturated benchmarks.

15 days free trial to Access Article
CUDA-NP: Realizing Nested thread-level parallelism in GPGPU Applications

Journal of Computer Science and Technology, 2015

Co-Authors: Yi Yang, Chao Li, Huiyang Zhou

Abstract:

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

15 days free trial to Access Article
cuda np realizing nested thread level parallelism in gpgpu applications

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Co-Authors: Yi Yang, Huiyang Zhou

Abstract:

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.

15 days free trial to Access Article
PPOPP - CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14, 2014

Co-Authors: Yi Yang, Huiyang Zhou

Abstract:

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Hazrina Hassan - One of the best experts on this subject based on the ideXlab platform.

thread level parallelism shared memory pool techniques for authorization of credit card system

thread-level parallelism & Shared-Memory Pool Techniques for Authorization of Credit Card System

Arun Kejariwal - One of the best experts on this subject based on the ideXlab platform.

parallelization spectroscopy analysis of thread level parallelism in hpc programs

PPOPP - Parallelization spectroscopy: analysis of thread-level parallelism in hpc programs

on the evaluation and extraction of thread level parallelism in ordinary programs

on the performance potential of different types of speculative thread level parallelism the dl version of this paper includes corrections that were not made available in the printed proceedings

ICS - On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings

David Gregg - One of the best experts on this subject based on the ideXlab platform.

PACT - An Efficient Vectorization Approach to Nested thread-level parallelism for CUDA GPUs

An Efficient Vectorization Approach to Nested thread-level parallelism for CUDA GPUs

D.l. Heine - One of the best experts on this subject based on the ideXlab platform.

In search of speculative thread-level parallelism

in search of speculative thread level parallelism

IEEE PACT - In search of speculative thread-level parallelism

Huiyang Zhou - One of the best experts on this subject based on the ideXlab platform.

gpu performance vs thread level parallelism scalability analysis and a novel way to improve tlp

CUDA-NP: Realizing Nested thread-level parallelism in GPGPU Applications

cuda np realizing nested thread level parallelism in gpgpu applications

PPOPP - CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

thread level parallelism

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Hazrina Hassan - One of the best experts on this subject based on the ideXlab platform.

Arun Kejariwal - One of the best experts on this subject based on the ideXlab platform.

David Gregg - One of the best experts on this subject based on the ideXlab platform.

D.l. Heine - One of the best experts on this subject based on the ideXlab platform.

Huiyang Zhou - One of the best experts on this subject based on the ideXlab platform.

Related terms