Thread Parallelism

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 8439 Experts worldwide ranked by ideXlab platform

Kunle Olukotun - One of the best experts on this subject based on the ideXlab platform.

  • exposing speculative Thread Parallelism in spec2000
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005
    Co-Authors: Manohar K Prabhu, Kunle Olukotun
    Abstract:

    As increasing the performance of single-Threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of Thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where Parallelism was located within the application, the impediments to extracting this Parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application Parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where Thread-level Parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.

  • PPOPP - Exposing speculative Thread Parallelism in SPEC2000
    Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '05, 2005
    Co-Authors: Manohar K Prabhu, Kunle Olukotun
    Abstract:

    As increasing the performance of single-Threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of Thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where Parallelism was located within the application, the impediments to extracting this Parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application Parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where Thread-level Parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.

  • ASPLOS - Data speculation support for a chip multiprocessor
    Proceedings of the eighth international conference on Architectural support for programming languages and operating systems - ASPLOS-VIII, 1998
    Co-Authors: Lance Hammond, Mark Willey, Kunle Olukotun
    Abstract:

    Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for Threadlevel speculation on the Hydra chip multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP This support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of medium--grained loop-level Parallelism in the application. When the granularity of Parallelism is too small or there is little inherent Parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-Thread Parallelism. Overall, Thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.

M Karnstedt - One of the best experts on this subject based on the ideXlab platform.

Hideki Saito - One of the best experts on this subject based on the ideXlab platform.

  • function kernel vectorization via loop vectorizer
    2018 IEEE ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2018
    Co-Authors: Matt Masten, Eric Garcia, Evgeniy Tyurin, Konstantina Mitropoulou, Hideki Saito
    Abstract:

    Currently, there are three vectorizers in the LLVM trunk: Loop Vectorizer, SLP Vectorizer, and Load-Store Vectorizer. There is a need for vectorizing functions/kernels: 1) Function calls are an integral part of programming real world application code and we cannot always rely on fully inlining them. When a function call is made from a vectorized context such as vectorized loop or vectorized function, if there are no vectorized callees available, the call has to be made to a scalar callee, one vector element at a time. At the programming model level, OpenMP declare simd is a standardized syntax to address this problem. LLVM needs a vectorizer to properly vectorize OpenMP declare simd functions. 2) Also, in the GPGPU programming model, such as OpenCL, work-item (Thread) Parallelism is not expressed with a loop; it is implicit in the execution of the kernels. In order to exploit SIMD Parallelism at this top-level (Thread-level), we need to start from vectorizing the kernel. One of the obvious ways to vectorize functions/kernels is to add a fourth vectorizer that specifically deals with function vectorization. In this paper, we argue that such a naive approach will lead us to sub-optimal performance and/or higher maintenance burden. Instead, we present a technique to take advantages of the current functionalities and future improvements of Loop Vectorizer in order to vectorize functions and kernels.

  • Function/Kernel Vectorization via Loop Vectorizer
    2018 IEEE ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2018
    Co-Authors: Matt Masten, Eric Garcia, Evgeniy Tyurin, Konstantina Mitropoulou, Hideki Saito
    Abstract:

    Currently, there are three vectorizers in the LLVM trunk: Loop Vectorizer, SLP Vectorizer, and Load-Store Vectorizer. There is a need for vectorizing functions/kernels: 1) Function calls are an integral part of programming real world application code and we cannot always rely on fully inlining them. When a function call is made from a vectorized context such as vectorized loop or vectorized function, if there are no vectorized callees available, the call has to be made to a scalar callee, one vector element at a time. At the programming model level, OpenMP declare simd is a standardized syntax to address this problem. LLVM needs a vectorizer to properly vectorize OpenMP declare simd functions. 2) Also, in the GPGPU programming model, such as OpenCL, work-item (Thread) Parallelism is not expressed with a loop; it is implicit in the execution of the kernels. In order to exploit SIMD Parallelism at this top-level (Thread-level), we need to start from vectorizing the kernel. One of the obvious ways to vectorize functions/kernels is to add a fourth vectorizer that specifically deals with function vectorization. In this paper, we argue that such a naive approach will lead us to sub-optimal performance and/or higher maintenance burden. Instead, we present a technique to take advantages of the current functionalities and future improvements of Loop Vectorizer in order to vectorize functions and kernels.

  • Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
    2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013
    Co-Authors: Xinmin Tian, Serguei V. Preis, Eric N. Garcia, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Nikolay Panchenko
    Abstract:

    Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

  • IPDPS Workshops - Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
    2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013
    Co-Authors: Xinmin Tian, Serguei V. Preis, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Eric Garcia, Nikolay Panchenko
    Abstract:

    Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

  • CODES+ISSS - Challenges in exploitation of loop Parallelism in embedded applications
    Proceedings of the 4th international conference on Hardware software codesign and system synthesis - CODES+ISSS '06, 2006
    Co-Authors: Arun Kejariwal, Xinmin Tian, Alexander V. Veidenbaum, Alexandru Nicolau, Milind Girkarmark, Hideki Saito
    Abstract:

    Embedded processors have been increasingly exploiting hardware Parallelism. Vector units, multiple processors or cores, hyper-Threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. How this hardware Parallelism can be exploited by applications is directly related to the amount of Parallelism inherent in a target application. In this paper we evaluate the performance potential of different types of Parallelism, viz., true Thread-level Parallelism, speculative Thread-level Parallelism and vector Parallelism, when executing loops. Applications from the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites are analyzed using the Intel C compiler. The results show what can be achieved today, provide upper bounds on the performance potential of different types of Thread Parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of Parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.

Nikolay Panchenko - One of the best experts on this subject based on the ideXlab platform.

  • Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
    2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013
    Co-Authors: Xinmin Tian, Serguei V. Preis, Eric N. Garcia, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Nikolay Panchenko
    Abstract:

    Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

  • IPDPS Workshops - Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
    2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013
    Co-Authors: Xinmin Tian, Serguei V. Preis, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Eric Garcia, Nikolay Panchenko
    Abstract:

    Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

H Blaar - One of the best experts on this subject based on the ideXlab platform.