Thread Parallelism - Explore the Science & Experts

Related Terms:

The Experts below are selected from a list of 8439 Experts worldwide ranked by ideXlab platform

Kunle Olukotun - One of the best experts on this subject based on the ideXlab platform.

exposing speculative Thread Parallelism in spec2000

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

Co-Authors: Manohar K Prabhu, Kunle Olukotun

Abstract:

As increasing the performance of single-Threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of Thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where Parallelism was located within the application, the impediments to extracting this Parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application Parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where Thread-level Parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.

15 days free trial to Access Article
PPOPP - Exposing speculative Thread Parallelism in SPEC2000

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '05, 2005

Co-Authors: Manohar K Prabhu, Kunle Olukotun

Abstract:

As increasing the performance of single-Threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of Thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design.For each application, we discuss how and where Parallelism was located within the application, the impediments to extracting this Parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application Parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where Thread-level Parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.

15 days free trial to Access Article
ASPLOS - Data speculation support for a chip multiprocessor

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems - ASPLOS-VIII, 1998

Co-Authors: Lance Hammond, Mark Willey, Kunle Olukotun

Abstract:

Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for Threadlevel speculation on the Hydra chip multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP This support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of medium--grained loop-level Parallelism in the application. When the granularity of Parallelism is too small or there is little inherent Parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-Thread Parallelism. Overall, Thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.

15 days free trial to Access Article

M Karnstedt - One of the best experts on this subject based on the ideXlab platform.

possibilities to solve the clique problem by Thread Parallelism using task pools

International Parallel and Distributed Processing Symposium, 2005

Co-Authors: H Blaar, T Lange, R Winter, M Karnstedt

Abstract:

We construct parallel algorithms with implementations to solve the clique problem in practice and research their computing time compared with sequential algorithms. The parallel algorithms are implemented in Java using Threads. Best efficiency is achieved by solving the problem of task scheduling by using task pools.

15 days free trial to Access Article
IPDPS - Possibilities to solve the clique problem by Thread Parallelism using task pools

19th IEEE International Parallel and Distributed Processing Symposium, 1

Co-Authors: H Blaar, T Lange, R Winter, M Karnstedt

Abstract:

We construct parallel algorithms with implementations to solve the clique problem in practice and research their computing time compared with sequential algorithms. The parallel algorithms are implemented in Java using Threads. Best efficiency is achieved by solving the problem of task scheduling by using task pools.

15 days free trial to Access Article

Hideki Saito - One of the best experts on this subject based on the ideXlab platform.

function kernel vectorization via loop vectorizer

2018 IEEE ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2018

Co-Authors: Matt Masten, Eric Garcia, Evgeniy Tyurin, Konstantina Mitropoulou, Hideki Saito

Abstract:

Currently, there are three vectorizers in the LLVM trunk: Loop Vectorizer, SLP Vectorizer, and Load-Store Vectorizer. There is a need for vectorizing functions/kernels: 1) Function calls are an integral part of programming real world application code and we cannot always rely on fully inlining them. When a function call is made from a vectorized context such as vectorized loop or vectorized function, if there are no vectorized callees available, the call has to be made to a scalar callee, one vector element at a time. At the programming model level, OpenMP declare simd is a standardized syntax to address this problem. LLVM needs a vectorizer to properly vectorize OpenMP declare simd functions. 2) Also, in the GPGPU programming model, such as OpenCL, work-item (Thread) Parallelism is not expressed with a loop; it is implicit in the execution of the kernels. In order to exploit SIMD Parallelism at this top-level (Thread-level), we need to start from vectorizing the kernel. One of the obvious ways to vectorize functions/kernels is to add a fourth vectorizer that specifically deals with function vectorization. In this paper, we argue that such a naive approach will lead us to sub-optimal performance and/or higher maintenance burden. Instead, we present a technique to take advantages of the current functionalities and future improvements of Loop Vectorizer in order to vectorize functions and kernels.

15 days free trial to Access Article
Function/Kernel Vectorization via Loop Vectorizer

2018 IEEE ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2018

Co-Authors: Matt Masten, Eric Garcia, Evgeniy Tyurin, Konstantina Mitropoulou, Hideki Saito

Abstract:

Currently, there are three vectorizers in the LLVM trunk: Loop Vectorizer, SLP Vectorizer, and Load-Store Vectorizer. There is a need for vectorizing functions/kernels: 1) Function calls are an integral part of programming real world application code and we cannot always rely on fully inlining them. When a function call is made from a vectorized context such as vectorized loop or vectorized function, if there are no vectorized callees available, the call has to be made to a scalar callee, one vector element at a time. At the programming model level, OpenMP declare simd is a standardized syntax to address this problem. LLVM needs a vectorizer to properly vectorize OpenMP declare simd functions. 2) Also, in the GPGPU programming model, such as OpenCL, work-item (Thread) Parallelism is not expressed with a loop; it is implicit in the execution of the kernels. In order to exploit SIMD Parallelism at this top-level (Thread-level), we need to start from vectorizing the kernel. One of the obvious ways to vectorize functions/kernels is to add a fourth vectorizer that specifically deals with function vectorization. In this paper, we argue that such a naive approach will lead us to sub-optimal performance and/or higher maintenance burden. Instead, we present a technique to take advantages of the current functionalities and future improvements of Loop Vectorizer in order to vectorize functions and kernels.

15 days free trial to Access Article
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013

Co-Authors: Xinmin Tian, Serguei V. Preis, Eric N. Garcia, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Nikolay Panchenko

Abstract:

Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

15 days free trial to Access Article
IPDPS Workshops - Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013

Co-Authors: Xinmin Tian, Serguei V. Preis, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Eric Garcia, Nikolay Panchenko

Abstract:

Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

15 days free trial to Access Article
CODES+ISSS - Challenges in exploitation of loop Parallelism in embedded applications

Proceedings of the 4th international conference on Hardware software codesign and system synthesis - CODES+ISSS '06, 2006

Co-Authors: Arun Kejariwal, Xinmin Tian, Alexander V. Veidenbaum, Alexandru Nicolau, Milind Girkarmark, Hideki Saito

Abstract:

Embedded processors have been increasingly exploiting hardware Parallelism. Vector units, multiple processors or cores, hyper-Threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. How this hardware Parallelism can be exploited by applications is directly related to the amount of Parallelism inherent in a target application. In this paper we evaluate the performance potential of different types of Parallelism, viz., true Thread-level Parallelism, speculative Thread-level Parallelism and vector Parallelism, when executing loops. Applications from the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites are analyzed using the Intel C compiler. The results show what can be achieved today, provide upper bounds on the performance potential of different types of Thread Parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of Parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.

15 days free trial to Access Article

Nikolay Panchenko - One of the best experts on this subject based on the ideXlab platform.

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013

Co-Authors: Xinmin Tian, Serguei V. Preis, Eric N. Garcia, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Nikolay Panchenko

Abstract:

Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

15 days free trial to Access Article
IPDPS Workshops - Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

2013 IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013

Co-Authors: Xinmin Tian, Serguei V. Preis, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, Hideki Saito, Eric Garcia, Nikolay Panchenko

Abstract:

Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant Thread Parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.

15 days free trial to Access Article

H Blaar - One of the best experts on this subject based on the ideXlab platform.

possibilities to solve the clique problem by Thread Parallelism using task pools

International Parallel and Distributed Processing Symposium, 2005

Co-Authors: H Blaar, T Lange, R Winter, M Karnstedt

Abstract:

We construct parallel algorithms with implementations to solve the clique problem in practice and research their computing time compared with sequential algorithms. The parallel algorithms are implemented in Java using Threads. Best efficiency is achieved by solving the problem of task scheduling by using task pools.

15 days free trial to Access Article
IPDPS - Possibilities to solve the clique problem by Thread Parallelism using task pools

19th IEEE International Parallel and Distributed Processing Symposium, 1

Co-Authors: H Blaar, T Lange, R Winter, M Karnstedt

Abstract:

We construct parallel algorithms with implementations to solve the clique problem in practice and research their computing time compared with sequential algorithms. The parallel algorithms are implemented in Java using Threads. Best efficiency is achieved by solving the problem of task scheduling by using task pools.

15 days free trial to Access Article

Discover everything there is to know about the scientific topic Thread Parallelism with ideXlab!

Kunle Olukotun - One of the best experts on this subject based on the ideXlab platform.

exposing speculative Thread Parallelism in spec2000

PPOPP - Exposing speculative Thread Parallelism in SPEC2000

ASPLOS - Data speculation support for a chip multiprocessor

M Karnstedt - One of the best experts on this subject based on the ideXlab platform.

possibilities to solve the clique problem by Thread Parallelism using task pools

IPDPS - Possibilities to solve the clique problem by Thread Parallelism using task pools

Hideki Saito - One of the best experts on this subject based on the ideXlab platform.

function kernel vectorization via loop vectorizer

Function/Kernel Vectorization via Loop Vectorizer

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

IPDPS Workshops - Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

CODES+ISSS - Challenges in exploitation of loop Parallelism in embedded applications

Nikolay Panchenko - One of the best experts on this subject based on the ideXlab platform.

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

IPDPS Workshops - Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

H Blaar - One of the best experts on this subject based on the ideXlab platform.

possibilities to solve the clique problem by Thread Parallelism using task pools

IPDPS - Possibilities to solve the clique problem by Thread Parallelism using task pools