thread level parallelism

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 2793 Experts worldwide ranked by ideXlab platform

Hazrina Hassan - One of the best experts on this subject based on the ideXlab platform.

  • thread level parallelism shared memory pool techniques for authorization of credit card system
    International Symposium on Communications and Information Technologies, 2008
    Co-Authors: Mohd Hairul Nizam Md Nasir, Siti Hafizah Ab. Hamid, Hazrina Hassan
    Abstract:

    Nowadays, credit card is a famous trend used by millions of people around the world as a form of payment. This paper presented an architectural framework and prototype of credit card authorization system using thread-level parallelism and shared memory pool techniques in order to support dynamic tuning of the size of the thread pool at runtime. Normally, single threaded were chosen by software developer in current credit card authorization whereby authentication process takes longer time to respond and its limitation of handling huge number of simultaneous transactions at the same time. As a result, the performance of the authorization system was affected during peak hours. Through thread-level parallelism technique or usually known as multi-threading, each worker thread will be assigned with several child threads to perform online fraud validation concurrently, depending on numbers of cryptographic elements presented in transaction message while the work thread itself performed card restriction validation based on the card information stored in card's shared memory pool.

  • thread-level parallelism & Shared-Memory Pool Techniques for Authorization of Credit Card System
    2008 International Symposium on Communications and Information Technologies, 2008
    Co-Authors: Mohd Hairul Nizam M. Nasir, Siti Hafizah Ab. Hamid, Hazrina Hassan
    Abstract:

    Nowadays, credit card is a famous trend used by millions of people around the world as a form of payment. This paper presented an architectural framework and prototype of credit card authorization system using thread-level parallelism and shared memory pool techniques in order to support dynamic tuning of the size of the thread pool at runtime. Normally, single threaded were chosen by software developer in current credit card authorization whereby authentication process takes longer time to respond and its limitation of handling huge number of simultaneous transactions at the same time. As a result, the performance of the authorization system was affected during peak hours. Through thread-level parallelism technique or usually known as multi-threading, each worker thread will be assigned with several child threads to perform online fraud validation concurrently, depending on numbers of cryptographic elements presented in transaction message while the work thread itself performed card restriction validation based on the card information stored in card's shared memory pool.

Arun Kejariwal - One of the best experts on this subject based on the ideXlab platform.

  • parallelization spectroscopy analysis of thread level parallelism in hpc programs
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009
    Co-Authors: Arun Kejariwal, Calin Cascaval
    Abstract:

    In this paper, we present a method - parallelization spectroscopy - for analyzing the thread-level parallelism available in production High Performance Computing (HPC) codes.We survey a number of techniques that are commonly used for parallelization and classify all the loops in the case study presented using a sensitivity metric: how likely is a particular technique is successful in parallelizing the loop.

  • PPOPP - Parallelization spectroscopy: analysis of thread-level parallelism in hpc programs
    Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '09, 2009
    Co-Authors: Arun Kejariwal, Calin Cascaval
    Abstract:

    In this paper, we present a method - parallelization spectroscopy - for analyzing the thread-level parallelism available in production High Performance Computing (HPC) codes.We survey a number of techniques that are commonly used for parallelization and classify all the loops in the case study presented using a sensitivity metric: how likely is a particular technique is successful in parallelizing the loop.

  • on the evaluation and extraction of thread level parallelism in ordinary programs
    2008
    Co-Authors: Alexandru Nicolau, Arun Kejariwal
    Abstract:

    The need for high performance coupled with the increasing design complexity of modern processors and power and thermal constraints has led to the development of multi-cores systems. Examples of such systems include IBM/Toshiba's Cell processor, Intel's Core 2 Duo processor. One of the ways to exploit the hardware parallelism of such systems is via thread-level program parallelization. Although there has been a large amount of work done in the context of multithreading, the lack of detailed application characterization on real machines makes it difficult to assess the relevance and importance of the problems addressed in prior work and also of the practicality of the solutions proposed. To alleviate this limitation, we did a thorough analysis of ordinary programs, as represented by industry-standard SPEC benchmarks, on both IA-32 and IA-64 architectures to identify real performance bottlenecks. Based on the above and given that loops account for a large percentage of the total execution time in ordinary programs, we propose techniques for extracting thread-level parallelism (TLP) from both—DOALL and non-DOALL —type of loops. Extraction of TLP from DOALL loops entails efficient partitioning and mapping of a DOALL loop so as to achieve load balance between the different processors. In this regard, we present a general approach for partitioning nested DOALL loops, both perfect and non-perfect, with conditionals, with rectangular and non-rectangular iteration geometries, where the expressions in a conditional are affine functions of the outer loop indices. Non-DOALL loops can be parallelized either speculatively (TLS) or via explicit synchronization. Although TLS enables parallel execution of difficult-to-analyze (at compile time) program regions, its efficacy is limited by a wide variety of factors such as high misspeculation penalty and the need for additional hardware. This necessitates an evaluation of the performance potential of TLS. Using the Intel Fortran/C++ compiler, we show that the speedup achievable via TLS, at the loop level, is minimal in ordinary programs. Therefore, we adopted explicit synchronization as the way to parallelize non- DOALL loops and proposed lightweight lock-free synchronization techniques for extracting TLP from non-DOALL loops. We show that the proposed techniques achieve better performance than the state-of-the-art on real machines.

  • on the performance potential of different types of speculative thread level parallelism the dl version of this paper includes corrections that were not made available in the printed proceedings
    International Conference on Supercomputing, 2006
    Co-Authors: Arun Kejariwal, Xinmin Tian, Wei Li, Milind Girkar, Sergey Kozhukhov, Hideki Saito, Utpal Banerjee, Alexandru Nicolau, Alexander V Veidenbaum, Constantine D Polychronopoulos
    Abstract:

    Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism, evaluation of the unique performance potential of TLS, i.e., performance gain that be achieved only through speculation, has not received much attention. In this paper, we evaluate this aspect, by separating the speedup achievable via true TLP (thread-level parallelism) and TLS, for the SPEC CPU2000 benchmark. Further, we dissect the performance potential of each type of speculation --- control speculation, data dependence speculation and data value speculation. To the best of our knowledge, this is the first dissection study of its kind. Assuming an oracle TLS mechanism --- which corresponds to perfect speculation and zero threading overhead --- whereby the execution time of a candidate program region (for speculative execution) can be reduced to zero, our study shows that, at the loop-level, the upper bound on the arithmetic mean and geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16% (standard deviation = 31.23) and 18.18% respectively.

  • ICS - On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings
    Proceedings of the 20th annual international conference on Supercomputing - ICS '06, 2006
    Co-Authors: Arun Kejariwal, Xinmin Tian, Wei Li, Milind Girkar, Sergey Kozhukhov, Hideki Saito, Utpal Banerjee, Alexandru Nicolau, Alexander V Veidenbaum, Constantine D Polychronopoulos
    Abstract:

    Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism, evaluation of the unique performance potential of TLS, i.e., performance gain that be achieved only through speculation, has not received much attention. In this paper, we evaluate this aspect, by separating the speedup achievable via true TLP (thread-level parallelism) and TLS, for the SPEC CPU2000 benchmark. Further, we dissect the performance potential of each type of speculation --- control speculation, data dependence speculation and data value speculation. To the best of our knowledge, this is the first dissection study of its kind. Assuming an oracle TLS mechanism --- which corresponds to perfect speculation and zero threading overhead --- whereby the execution time of a candidate program region (for speculative execution) can be reduced to zero, our study shows that, at the loop-level, the upper bound on the arithmetic mean and geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16% (standard deviation = 31.23) and 18.18% respectively.

David Gregg - One of the best experts on this subject based on the ideXlab platform.

D.l. Heine - One of the best experts on this subject based on the ideXlab platform.

  • In search of speculative thread-level parallelism
    2003
    Co-Authors: J.t. Oplinger, D.l. Heine, M.s. Lam
    Abstract:

    The paper focuses in the problem of how to find and effectively exploit speculative thread-level parallelism. Our studies show that speculating only on loops does not yield sufficient parallelism. We propose the use of speculative procedure execution as a means to increase the available parallelism. An additional technique, data value prediction, has the potential to greatly improve the performance of speculative execution. In particular, return value prediction improves the success of procedural speculation, and stride value prediction improves the success of loop speculation

  • in search of speculative thread level parallelism
    International Conference on Parallel Architectures and Compilation Techniques, 1999
    Co-Authors: J.t. Oplinger, D.l. Heine
    Abstract:

    This paper focuses on the problem of how to find and effectively exploit speculative thread-level parallelism. Our studies show that speculating only on loops does not yield sufficient parallelism. We propose the use of speculative procedure execution as a means to increase the available parallelism. An additional technique, data value prediction, has the potential to greatly improve the performance of speculative execution. In particular, return value prediction improves the success of procedural speculation, and stride value prediction improves the success of loop speculation.

  • IEEE PACT - In search of speculative thread-level parallelism
    1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), 1999
    Co-Authors: J.t. Oplinger, D.l. Heine
    Abstract:

    This paper focuses on the problem of how to find and effectively exploit speculative thread-level parallelism. Our studies show that speculating only on loops does not yield sufficient parallelism. We propose the use of speculative procedure execution as a means to increase the available parallelism. An additional technique, data value prediction, has the potential to greatly improve the performance of speculative execution. In particular, return value prediction improves the success of procedural speculation, and stride value prediction improves the success of loop speculation.

Huiyang Zhou - One of the best experts on this subject based on the ideXlab platform.

  • gpu performance vs thread level parallelism scalability analysis and a novel way to improve tlp
    ACM Transactions on Architecture and Code Optimization, 2018
    Co-Authors: Michael Mantor, Huiyang Zhou
    Abstract:

    Graphics Processing Units (GPUs) leverage massive thread-level parallelism (TLP) to achieve high computation throughput and hide long memory latency. However, recent studies have shown that the GPU performance does not scale with the GPU occupancy or the degrees of TLP that a GPU supports, especially for memory-intensive workloads. The current understanding points to L1 D-cache contention or off-chip memory bandwidth. In this article, we perform a novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions. We show that the interconnect bandwidth is a critical bound for GPU performance scalability. For the applications that do not have saturated throughput utilization on a particular resource, their performance scales well with increased TLP. To improve TLP for such applications efficiently, we propose a fast context switching approach. When a warp/thread block (TB) is stalled by a long latency operation, the context of the warp/TB is spilled to spare on-chip resource so that a new warp/TB can be launched. The switched-out warp/TB is switched back when another warp/TB is completed or switched out. With this fine-grain fast context switching, higher TLP can be supported without increasing the sizes of critical resources like the register file. Our experiment shows that the performance can be improved by up to 47% and a geometric mean of 22% for a set of applications with unsaturated throughput utilization. Compared to the state-of-the-art TLP improvement scheme, our proposed scheme achieves 12% higher performance on average and 16% for unsaturated benchmarks.

  • CUDA-NP: Realizing Nested thread-level parallelism in GPGPU Applications
    Journal of Computer Science and Technology, 2015
    Co-Authors: Yi Yang, Chao Li, Huiyang Zhou
    Abstract:

    Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

  • cuda np realizing nested thread level parallelism in gpgpu applications
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014
    Co-Authors: Yi Yang, Huiyang Zhou
    Abstract:

    Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.

  • PPOPP - CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
    Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14, 2014
    Co-Authors: Yi Yang, Huiyang Zhou
    Abstract:

    Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.