Multiplication Kernel

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 3147 Experts worldwide ranked by ideXlab platform

Jack Dongarra - One of the best experts on this subject based on the ideXlab platform.

  • variable size batched gauss jordan elimination for block jacobi preconditioning on graphics processors
    Parallel Computing, 2018
    Co-Authors: Jack Dongarra, Hartwig Anzt, Goran Flegar, Enrique S Quintanaorti
    Abstract:

    Abstract In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion Kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector Multiplication Kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our Kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.

  • IPDPS Workshops - Search Space Generation and Pruning System for Autotuners
    2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016
    Co-Authors: Piotr Luszczek, Jakub Kurzak, Mark Gates, Anthony Danalis, Jack Dongarra
    Abstract:

    This work tackles two simultaneous challenges faced by autotuners: the ease ofdescribing a complex, multidimensional search space, and the speed ofevaluating that space, while applying a multitude of pruning constraints. Thisarticle presents a declarative notation for describing a search space and atranslation system for conversion to a standard C code for fast andmultithreaded, as necessary, evaluation. The notation is Python-based and thussimple in syntax and easy to assimilate by the user interested in tuningrather than learning a new programming language. A large number of dimensionsand a large number of pruning constraints may be expressed with littleeffort. The system is discussed in the context of autotuning the canonicalmatrix Multiplication Kernel for NVIDIA GPUs, where the search space has 15dimensions and involves application of 10 complex pruning constrains. Thespeed of evaluation is compared against generators created using imperativeprogramming style in various scripting and compiled languages.

  • Numerical Computations with GPUs - Accelerating Numerical Dense Linear Algebra Calculations with GPUs
    Numerical Computations with GPUs, 2014
    Co-Authors: Jack Dongarra, Jakub Kurzak, Piotr Luszczek, Mark Gates, Azzam Haidar, Stanimire Tomov, Ichitaro Yamazaki
    Abstract:

    This chapter presents the current best design and implementation practices for the acceleration of dense linear algebra (DLA) on GPUs. Examples are given with fundamental algorithms—from the matrix–matrix Multiplication Kernel written in CUDA to the higher level algorithms for solving linear systems, eigenvalue and SVD problems. The implementations are available through the MAGMA library—a redesign for GPUs of the popular LAPACK. To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using either static scheduling or a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows the exploration of the unique strengths of the various hardware components.

  • accelerating numerical dense linear algebra calculations with gpus
    Numerical Computations with GPUs, 2014
    Co-Authors: Jakub Kurzak, Jack Dongarra, Piotr Luszczek, Mark Gates, Azzam Haidar, Stanimire Tomov, Ichitaro Yamazaki
    Abstract:

    This chapter presents the current best design and implementation practices for the acceleration of dense linear algebra (DLA) on GPUs. Examples are given with fundamental algorithms—from the matrix–matrix Multiplication Kernel written in CUDA to the higher level algorithms for solving linear systems, eigenvalue and SVD problems. The implementations are available through the MAGMA library—a redesign for GPUs of the popular LAPACK. To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using either static scheduling or a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows the exploration of the unique strengths of the various hardware components.

  • Optimizing matrix Multiplication for a short-vector SIMD architecture - CELL processor
    Parallel Computing, 2009
    Co-Authors: Jakub Kurzak, Wesley Alvaro, Jack Dongarra
    Abstract:

    Matrix Multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix Multiplication operation is essential. The crucial component is the matrix Multiplication Kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix Multiplication Kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.

Nectarios Koziris - One of the best experts on this subject based on the ideXlab platform.

  • improving the performance of the symmetric sparse matrix vector Multiplication in multicore
    International Parallel and Distributed Processing Symposium, 2013
    Co-Authors: Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, Vasileios Karakasis, Nectarios Koziris
    Abstract:

    Symmetric sparse matrices arise often in the solution of sparse linear systems. Exploiting the non-zero element symmetry in order to reduce the overall matrix size is very tempting for optimizing the symmetric Sparse Matrix-Vector Multiplication Kernel (SpMxV) for multicore architectures. Despite being very beneficial for the single-threaded execution, not storing the upper or lower triangular part of a symmetric sparse matrix complicates the multithreaded SpMxV version, since it introduces an undesirable dependency on the output vector elements. The most common approach for overcoming this problem is to use local, per-thread vectors, which are reduced to the output vector at the end of the computation. However, this reduction leads to considerable memory traffic, limiting the scalability of the symmetric SpMxV. In this paper, we take a two-step approach in optimizing the symmetric SpMxV Kernel. First, we introduce the CSX-Sym variant of the highly compressed CSX format, which exploits the non-zero element symmetry for compressing further the input matrix. Second, we minimize the memory traffic produced by the local vectors reduction phase by implementing a non-zero indexing compression scheme that minimizes the local data to be reduced. Our indexing scheme allowed the scaling of symmetric SpMxV and provided a more than 2x performance improvement over the baseline CSR implementation and 83.9% over the typical symmetric SpMxV Kernel. The CSX-Sym variant has further increased the symmetric SpMxV performance by 43.4%. Finally, we evaluate the effect of our optimizations in the context of the CG iterative method, where we achieve an 77.8% acceleration of the overall solver.

  • improving the performance of multithreaded sparse matrix vector Multiplication using index and value compression
    International Conference on Parallel Processing, 2008
    Co-Authors: Kornilios Kourtis, Georgios Goumas, Nectarios Koziris
    Abstract:

    The sparse matrix-vector Multiplication Kernel exhibits limited potential for taking advantage of modern shared memory architectures due to its large memory bandwidth requirements. To decrease memory contention and improve the performance of the Kernel we propose two compression schemes. The first, called CSR-DU, targets the reduction of the matrix structural data by applying coarse grain delta encoding for the column indices. The second scheme, called CSR-VI, targets the reduction of the numerical values using indirect indexing and can only be applied to matrices which contain a small number of unique values. Evaluation of both methods on a rich matrix set showed that they can significantly improve the performance of the multithreaded version of the Kernel and achieve good scalability for large matrices.

  • ICPP - Improving the Performance of Multithreaded Sparse Matrix-Vector Multiplication Using Index and Value Compression
    2008 37th International Conference on Parallel Processing, 2008
    Co-Authors: Kornilios Kourtis, Georgios Goumas, Nectarios Koziris
    Abstract:

    The sparse matrix-vector Multiplication Kernel exhibits limited potential for taking advantage of modern shared memory architectures due to its large memory bandwidth requirements. To decrease memory contention and improve the performance of the Kernel we propose two compression schemes. The first, called CSR-DU, targets the reduction of the matrix structural data by applying coarse grain delta encoding for the column indices. The second scheme, called CSR-VI, targets the reduction of the numerical values using indirect indexing and can only be applied to matrices which contain a small number of unique values. Evaluation of both methods on a rich matrix set showed that they can significantly improve the performance of the multithreaded version of the Kernel and achieve good scalability for large matrices.

  • Conf. Computing Frontiers - Optimizing sparse matrix-vector Multiplication using index and value compression
    Proceedings of the 2008 conference on Computing frontiers - CF '08, 2008
    Co-Authors: Kornilios Kourtis, Georgios Goumas, Nectarios Koziris
    Abstract:

    Previous research work has identified memory bandwidth as the main bottleneck of the ubiquitous Sparse Matrix-Vector Multiplication Kernel. To attack this problem, we aim at reducing the overall data volume of the algorithm. Typical sparse matrix representation schemes store only the non-zero elements of the matrix and employ additional indexing information to properly iterate over these elements. In this paper we propose two distinct compression methods targeting index and numerical values respectively. We perform a set of experiments on a large real-world matrix set and demonstrate that the index compression method can be applied successfully to a wide range of matrices. Moreover, the value compression method is able to achieve impressive speedups in a more limited yet important class of sparse matrices that contain a small number of distinct values

Keisuke Katsushima - One of the best experts on this subject based on the ideXlab platform.

  • a fast scalable implicit solver with concentrated computation for nonlinear time evolution problems on low order unstructured finite elements
    International Parallel and Distributed Processing Symposium, 2018
    Co-Authors: Tsuyoshi Ichimura, Kohei Fujita, Masashi Horikoshi, Larry Meadows, Kengo Nakajima, Takuma Yamaguchi, Kentaro Koyama, Hikaru Inoue, Akira Naruse, Keisuke Katsushima
    Abstract:

    Many supercomputers are shifting to architectures with low B (byte/s; memory transfer capability) per F (FLOPS capability) ratios. However, utilizing increased F is difficult for applications that inherently require large B. Targeting an implicit unstructured low-order finite-element analysis solver, which typically requires large B, we have developed a concentrated computation algorithm that yields significant performance improvements on low B/F supercomputers. 35.7% peak performance was achieved for a sparse matrix-vector Multiplication Kernel, and 15.6% peak performance was achieved for the whole solver on the second generation Xeon Phi-based Oakforest-PACS. This is 5.02 times faster than (and 6.90 times the peak performance of) the state-of-the-art solver (the SC14 Gordon Bell finalist solver). On Oakforest-PACS, the proposed solver was approximately 2.42 times faster than the state-of-the-art solver running on the K computer. The proposed approach has implications for systems and applications and is expected to have significant impact on various fields that use finite-element methods for nonlinear time evolution problems.

  • IPDPS - A Fast Scalable Implicit Solver with Concentrated Computation for Nonlinear Time-Evolution Problems on Low-Order Unstructured Finite Elements
    2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018
    Co-Authors: Tsuyoshi Ichimura, Kohei Fujita, Masashi Horikoshi, Larry Meadows, Kengo Nakajima, Takuma Yamaguchi, Kentaro Koyama, Hikaru Inoue, Akira Naruse, Keisuke Katsushima
    Abstract:

    Many supercomputers are shifting to architectures with low B (byte/s; memory transfer capability) per F (FLOPS capability) ratios. However, utilizing increased F is difficult for applications that inherently require large B. Targeting an implicit unstructured low-order finite-element analysis solver, which typically requires large B, we have developed a concentrated computation algorithm that yields significant performance improvements on low B/F supercomputers. 35.7% peak performance was achieved for a sparse matrix-vector Multiplication Kernel, and 15.6% peak performance was achieved for the whole solver on the second generation Xeon Phi-based Oakforest-PACS. This is 5.02 times faster than (and 6.90 times the peak performance of) the state-of-the-art solver (the SC14 Gordon Bell finalist solver). On Oakforest-PACS, the proposed solver was approximately 2.42 times faster than the state-of-the-art solver running on the K computer. The proposed approach has implications for systems and applications and is expected to have significant impact on various fields that use finite-element methods for nonlinear time evolution problems.

Jakub Kurzak - One of the best experts on this subject based on the ideXlab platform.

  • IPDPS Workshops - Search Space Generation and Pruning System for Autotuners
    2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016
    Co-Authors: Piotr Luszczek, Jakub Kurzak, Mark Gates, Anthony Danalis, Jack Dongarra
    Abstract:

    This work tackles two simultaneous challenges faced by autotuners: the ease ofdescribing a complex, multidimensional search space, and the speed ofevaluating that space, while applying a multitude of pruning constraints. Thisarticle presents a declarative notation for describing a search space and atranslation system for conversion to a standard C code for fast andmultithreaded, as necessary, evaluation. The notation is Python-based and thussimple in syntax and easy to assimilate by the user interested in tuningrather than learning a new programming language. A large number of dimensionsand a large number of pruning constraints may be expressed with littleeffort. The system is discussed in the context of autotuning the canonicalmatrix Multiplication Kernel for NVIDIA GPUs, where the search space has 15dimensions and involves application of 10 complex pruning constrains. Thespeed of evaluation is compared against generators created using imperativeprogramming style in various scripting and compiled languages.

  • Numerical Computations with GPUs - Accelerating Numerical Dense Linear Algebra Calculations with GPUs
    Numerical Computations with GPUs, 2014
    Co-Authors: Jack Dongarra, Jakub Kurzak, Piotr Luszczek, Mark Gates, Azzam Haidar, Stanimire Tomov, Ichitaro Yamazaki
    Abstract:

    This chapter presents the current best design and implementation practices for the acceleration of dense linear algebra (DLA) on GPUs. Examples are given with fundamental algorithms—from the matrix–matrix Multiplication Kernel written in CUDA to the higher level algorithms for solving linear systems, eigenvalue and SVD problems. The implementations are available through the MAGMA library—a redesign for GPUs of the popular LAPACK. To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using either static scheduling or a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows the exploration of the unique strengths of the various hardware components.

  • accelerating numerical dense linear algebra calculations with gpus
    Numerical Computations with GPUs, 2014
    Co-Authors: Jakub Kurzak, Jack Dongarra, Piotr Luszczek, Mark Gates, Azzam Haidar, Stanimire Tomov, Ichitaro Yamazaki
    Abstract:

    This chapter presents the current best design and implementation practices for the acceleration of dense linear algebra (DLA) on GPUs. Examples are given with fundamental algorithms—from the matrix–matrix Multiplication Kernel written in CUDA to the higher level algorithms for solving linear systems, eigenvalue and SVD problems. The implementations are available through the MAGMA library—a redesign for GPUs of the popular LAPACK. To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using either static scheduling or a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows the exploration of the unique strengths of the various hardware components.

  • Optimizing matrix Multiplication for a short-vector SIMD architecture - CELL processor
    Parallel Computing, 2009
    Co-Authors: Jakub Kurzak, Wesley Alvaro, Jack Dongarra
    Abstract:

    Matrix Multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix Multiplication operation is essential. The crucial component is the matrix Multiplication Kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix Multiplication Kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.

  • fast and small short vector simd matrix Multiplication Kernels for the synergistic processing element of the cell processor
    International Conference on Computational Science, 2008
    Co-Authors: Wesley Alvaro, Jakub Kurzak, Jack Dongarra
    Abstract:

    Matrix Multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix Multiplication operation is essential. The crutial component is the matrix Multiplication Kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix Multiplication Kernels are presented implementing the C= Ci¾? A×BToperation and the C= Ci¾? A×Boperation for matrices of size 64 ×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.

Wesley Alvaro - One of the best experts on this subject based on the ideXlab platform.

  • Optimizing matrix Multiplication for a short-vector SIMD architecture - CELL processor
    Parallel Computing, 2009
    Co-Authors: Jakub Kurzak, Wesley Alvaro, Jack Dongarra
    Abstract:

    Matrix Multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix Multiplication operation is essential. The crucial component is the matrix Multiplication Kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix Multiplication Kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.

  • fast and small short vector simd matrix Multiplication Kernels for the synergistic processing element of the cell processor
    International Conference on Computational Science, 2008
    Co-Authors: Wesley Alvaro, Jakub Kurzak, Jack Dongarra
    Abstract:

    Matrix Multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix Multiplication operation is essential. The crutial component is the matrix Multiplication Kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix Multiplication Kernels are presented implementing the C= Ci¾? A×BToperation and the C= Ci¾? A×Boperation for matrices of size 64 ×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.

  • ICCS (1) - Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor
    Computational Science – ICCS 2008, 2008
    Co-Authors: Wesley Alvaro, Jakub Kurzak, Jack Dongarra
    Abstract:

    Matrix Multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix Multiplication operation is essential. The crutial component is the matrix Multiplication Kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix Multiplication Kernels are presented implementing the C= Ci¾? A×BToperation and the C= Ci¾? A×Boperation for matrices of size 64 ×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.