Parallel Architectures

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 63240 Experts worldwide ranked by ideXlab platform

David Kaeli - One of the best experts on this subject based on the ideXlab platform.

  • exploiting memory access patterns to improve memory performance in data Parallel Architectures
    IEEE Transactions on Parallel and Distributed Systems, 2011
    Co-Authors: Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli
    Abstract:

    The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of Parallel computing. At the core of this phenomenon are massively multithreaded, data-Parallel Architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these Architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-Parallel Architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-Parallel Architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based Architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based Architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

  • data transformations enabling loop vectorization on multithreaded data Parallel Architectures
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010
    Co-Authors: Byunghyun Jang, Dana Schaa, Rodrigo Dominguez, Perhaad Mistry, David Kaeli
    Abstract:

    Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector Architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data Parallel Architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level Parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

Byunghyun Jang - One of the best experts on this subject based on the ideXlab platform.

  • exploiting memory access patterns to improve memory performance in data Parallel Architectures
    IEEE Transactions on Parallel and Distributed Systems, 2011
    Co-Authors: Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli
    Abstract:

    The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of Parallel computing. At the core of this phenomenon are massively multithreaded, data-Parallel Architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these Architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-Parallel Architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-Parallel Architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based Architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based Architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

  • data transformations enabling loop vectorization on multithreaded data Parallel Architectures
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010
    Co-Authors: Byunghyun Jang, Dana Schaa, Rodrigo Dominguez, Perhaad Mistry, David Kaeli
    Abstract:

    Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector Architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data Parallel Architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level Parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

Dana Schaa - One of the best experts on this subject based on the ideXlab platform.

  • exploiting memory access patterns to improve memory performance in data Parallel Architectures
    IEEE Transactions on Parallel and Distributed Systems, 2011
    Co-Authors: Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli
    Abstract:

    The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of Parallel computing. At the core of this phenomenon are massively multithreaded, data-Parallel Architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these Architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-Parallel Architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-Parallel Architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based Architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based Architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

  • data transformations enabling loop vectorization on multithreaded data Parallel Architectures
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010
    Co-Authors: Byunghyun Jang, Dana Schaa, Rodrigo Dominguez, Perhaad Mistry, David Kaeli
    Abstract:

    Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector Architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data Parallel Architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level Parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

Perhaad Mistry - One of the best experts on this subject based on the ideXlab platform.

  • exploiting memory access patterns to improve memory performance in data Parallel Architectures
    IEEE Transactions on Parallel and Distributed Systems, 2011
    Co-Authors: Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli
    Abstract:

    The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of Parallel computing. At the core of this phenomenon are massively multithreaded, data-Parallel Architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these Architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-Parallel Architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-Parallel Architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based Architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based Architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

  • data transformations enabling loop vectorization on multithreaded data Parallel Architectures
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010
    Co-Authors: Byunghyun Jang, Dana Schaa, Rodrigo Dominguez, Perhaad Mistry, David Kaeli
    Abstract:

    Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector Architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data Parallel Architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level Parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

P Sadayappan - One of the best experts on this subject based on the ideXlab platform.

  • automatic data movement and computation mapping for multi level Parallel Architectures with explicitly managed memories
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008
    Co-Authors: Muthu Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J Ramanujam, Atanas Rountev, P Sadayappan
    Abstract:

    Several Parallel Architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of Parallelism. A significant challenge in programming these Architectures is to effectively exploit the Parallelism available in the architecture and manage the fast memories to maximize performance.In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level Parallel Architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.