Partial Pivoting

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1494 Experts worldwide ranked by ideXlab platform

Sivan Toledo - One of the best experts on this subject based on the ideXlab platform.

  • THE SNAP-BACK Pivoting METHOD FOR SYMMETRIC BANDED INDEFINITE MATRICES ∗
    2013
    Co-Authors: Dror Irony, Sivan Toledo
    Abstract:

    Abstract. The four existing stable factorization methods for symmetric indefinite matrices suffer serious defects when applied to banded matrices. Partial Pivoting (row or column exchanges) maintains a band structure in the reduced matrix and the factors, but destroys symmetry completely once an off-diagonal pivot is used. Two-by-two block Pivoting and Gaussian reduction to tridiagonal (Aasen’s algorithm) maintain symmetry at all times, but quickly destroy the band structure in the reduced matrices. Orthogonal reductions to tridiagonal maintain both symmetry and the band structure, but are too expensive for linear-equation solvers. We propose a new Pivoting method, which we call snap-back Pivoting. When applied to banded symmetric matrices, it maintains the band structure (like Partial Pivoting does), it keeps the reduced matrix symmetric (like 2-by-2 Pivoting and reductions to tridiagonal), and it is fast. Snap-back Pivoting reduces the matrix to a diagonal form using a sequence of elementary elimination steps, most of which are applied symmetrically from the left and from the right (but some are applied unsymmetrically). In snap-back Pivoting, if the next diagonal element is too small, the next Pivoting step might be unsymmetric, leading to asymmetry in the next row and column of the factors. But the reduced matrix snaps back to symmetry once the next step is completed. Key words. element growth symmetric-indefinite matrices, Pivoting, banded matrices, matrix factorizations

  • THE GROWTH-FACTOR BOUND FOR THE BUNCH-KAUFMAN FACTORIZATION IS TIGHT
    2013
    Co-Authors: Alex Druinsky, Sivan Toledo
    Abstract:

    Abstract. We show that the growth factor bound in the Bunch-Kaufman factorization method is essentially tight. The method factors a symmetric matrix A into A = P T LDL T P where P is a permutation matrix, L is lower triangular, and D is block diagonal with 1-by-1 and 2-by-2 diagonal blocks. The method uses one of several Partial Pivoting rules that ensure bounded by possibly exponential growth in the elements of the reduced matrix and the factor D (growth in L is not bounded). We show that the exponential bound is essentially tight, thereby solving a question that has been open since 1977. 1

  • communication efficient gaussian elimination with Partial Pivoting using a shape morphing data layout
    ACM Symposium on Parallel Algorithms and Architectures, 2013
    Co-Authors: Grey Ballard, James Demmel, Benjamin Lipshitz, Oded Schwartz, Sivan Toledo
    Abstract:

    High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via Partial Pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: Partial Pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with Partial Pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.

  • Communication efficient Gaussian elimination with Partial Pivoting using a shape morphing data layout
    2013
    Co-Authors: Grey Ballard, Sivan Toledo, James Demmel, Oded Schwartz, Benjamin Lipshitz
    Abstract:

    High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via Partial Pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: Partial Pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with Partial Pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation. Categories and Subject Descriptor

  • parallel unsymmetric pattern multifrontal sparse lu with column preordering
    ACM Transactions on Mathematical Software, 2008
    Co-Authors: Haim Avron, Gil Shklarski, Sivan Toledo
    Abstract:

    We present a new parallel sparse LU factorization algorithm and code. The algorithm uses a column-preordering Partial-Pivoting unsymmetric-pattern multifrontal approach. Our baseline sequential algorithm is based on UMFPACK 4, but is somewhat simpler and is often somewhat faster than UMFPACK version 4.0. Our parallel algorithm is designed for shared-memory machines with a small or moderate number of processors (we tested it on up to 32 processors). We experimentally compare our algorithm with SuperLU_MT, an existing shared-memory sparse LU factorization with Partial Pivoting. SuperLU_MT scales better than our new algorithm, but our algorithm is more reliable and is usually faster. More specifically, on matrices that are costly to factor, our algorithm is usually faster on up to 4 processors, and is usually faster on 8 and 16. We were not able to run SuperLU_MT on 32. The main contribution of this article is showing that the column-preordering Partial-Pivoting unsymmetric-pattern multifrontal approach, developed as a sequential algorithm by Davis in several recent versions of UMFPACK, can be effectively parallelized.

Tao Yang - One of the best experts on this subject based on the ideXlab platform.

  • S+: Efficient 2D sparse LU factorization on parallel machines
    2009
    Co-Authors: Kai Shen, Tao Yang, Xiangmin Jiao
    Abstract:

    Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high giga op rates for parallel sparse LU factorization with Partial Pivoting. This paper studies properties of elimination forests and uses them to optimize supernode partitioning/amalgamation and execution scheduling. It also proposes supernodal matrix multiplication to speed-up kernel computation by retaining the BLAS-3 level e ciency and avoiding unnecessary arithmetic operations. The experiments show that our new design with proper space optimization, called S +, improves our previous solution substantially and can achieve up to 10 GFLOPS on 128 Cray T3E 450MHz nodes

  • S+: Efficient 2D sparse LU factorization on parallel machines
    2001
    Co-Authors: Kai Shen, Tao Yang, Xiangmin Jiao
    Abstract:

    Abstract. Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high gigaflop rates for parallel sparse LU factorization with Partial Pivoting. This paper studies properties of elimination forests and uses them to optimize supernode partitioning/amalgamation and execution scheduling. It also proposes supernodal matrix multiplication to speed up kernel computation by retaining the BLAS-3 level efficiency and avoiding unnecessary arithmetic operations. The experiments show that our new design with proper space optimization, called S +, improves our previous solution substantially and can achieve up to 10 GFLOPS on 128 Cray T3E 450MHz nodes. Key words. Gaussian elimination with Partial Pivoting, LU factorization, sparse matrices, elimination forests, supernode amalgamation and partitioning, asynchronous computation scheduling AMS subject classifications. 65F50, 65F05 PII. S089547989833738

  • Efficient Sparse LU Factorization with Lazy Space Allocation
    1999
    Co-Authors: Bin Jiang, Kai Shen, Steven Richman, Tao Yang
    Abstract:

    Static symbolic factorization coupled with 2D supernode partitioning and asynchronous computation scheduling is a viable approach for sparse LU with dynamic Partial Pivoting. Our previous implementation, called S +, uses those techniques and achieves high giga op rates on distributed memory machines. This paper studies the space requirement of this approach and proposes an optimization strategy called lazy space allocation which acquires memory on-the-fly only when it is necessary. This strategy can effectively control memory usage, especially when static symbolic factorization overestimates fill-ins excessively. Our experiments show that the improved S + code, which combines this strategy with elimination-forest guided partitioning and scheduling, has sequential time and space cost competitive with SuperLU, is space scalable for solving problems of large sizes on multiple processors, and can deliver up to 10 GFLOPS on 128 Cray 450Mhz T3E nodes

  • efficient sparse lu factorization with Partial Pivoting on distributed memory architectures
    IEEE Transactions on Parallel and Distributed Systems, 1998
    Co-Authors: Xiangmin Jiao, Tao Yang
    Abstract:

    A sparse LU factorization based on Gaussian elimination with Partial Pivoting (GEPP) is important to many scientific applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main difficulty is that Partial Pivoting operations dynamically change computation and nonzero fill-in structures during the elimination process. This paper presents an approach called S* for parallelizing this problem on distributed memory machines. The S* approach adopts static symbolic factorization to avoid run-time control overhead, incorporates 2D L/U supemode partitioning and amalgamation strategies to improve caching performance, and exploits irregular task parallelism embedded in sparse LU using asynchronous computation scheduling. The paper discusses and compares the algorithms using 1D and 2D data mapping schemes, and presents experimental studies on Cray-T3D and T3E. The performance results for a set of nonsymmetric benchmark matrices are very encouraging, and S* has achieved up to 6.878 GFLOPS on 128 T3E nodes. To the best of our knowledge, this is the highest performance ever achieved for this challenging problem and the previous record was 2.583 GFLOPS on shared memory machines.

  • Elimination Forest Guided 2D Sparse LU Factorization
    1998
    Co-Authors: Kai Shen, Xiangmin Jiao, Tao Yang
    Abstract:

    Sparse LU factorization with Partial Pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S + , uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S + improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450MHz nodes, which is the hi..

Xiangmin Jiao - One of the best experts on this subject based on the ideXlab platform.

  • S+: Efficient 2D sparse LU factorization on parallel machines
    2009
    Co-Authors: Kai Shen, Tao Yang, Xiangmin Jiao
    Abstract:

    Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high giga op rates for parallel sparse LU factorization with Partial Pivoting. This paper studies properties of elimination forests and uses them to optimize supernode partitioning/amalgamation and execution scheduling. It also proposes supernodal matrix multiplication to speed-up kernel computation by retaining the BLAS-3 level e ciency and avoiding unnecessary arithmetic operations. The experiments show that our new design with proper space optimization, called S +, improves our previous solution substantially and can achieve up to 10 GFLOPS on 128 Cray T3E 450MHz nodes

  • S+: Efficient 2D sparse LU factorization on parallel machines
    2001
    Co-Authors: Kai Shen, Tao Yang, Xiangmin Jiao
    Abstract:

    Abstract. Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high gigaflop rates for parallel sparse LU factorization with Partial Pivoting. This paper studies properties of elimination forests and uses them to optimize supernode partitioning/amalgamation and execution scheduling. It also proposes supernodal matrix multiplication to speed up kernel computation by retaining the BLAS-3 level efficiency and avoiding unnecessary arithmetic operations. The experiments show that our new design with proper space optimization, called S +, improves our previous solution substantially and can achieve up to 10 GFLOPS on 128 Cray T3E 450MHz nodes. Key words. Gaussian elimination with Partial Pivoting, LU factorization, sparse matrices, elimination forests, supernode amalgamation and partitioning, asynchronous computation scheduling AMS subject classifications. 65F50, 65F05 PII. S089547989833738

  • efficient sparse lu factorization with Partial Pivoting on distributed memory architectures
    IEEE Transactions on Parallel and Distributed Systems, 1998
    Co-Authors: Xiangmin Jiao, Tao Yang
    Abstract:

    A sparse LU factorization based on Gaussian elimination with Partial Pivoting (GEPP) is important to many scientific applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main difficulty is that Partial Pivoting operations dynamically change computation and nonzero fill-in structures during the elimination process. This paper presents an approach called S* for parallelizing this problem on distributed memory machines. The S* approach adopts static symbolic factorization to avoid run-time control overhead, incorporates 2D L/U supemode partitioning and amalgamation strategies to improve caching performance, and exploits irregular task parallelism embedded in sparse LU using asynchronous computation scheduling. The paper discusses and compares the algorithms using 1D and 2D data mapping schemes, and presents experimental studies on Cray-T3D and T3E. The performance results for a set of nonsymmetric benchmark matrices are very encouraging, and S* has achieved up to 6.878 GFLOPS on 128 T3E nodes. To the best of our knowledge, this is the highest performance ever achieved for this challenging problem and the previous record was 2.583 GFLOPS on shared memory machines.

  • Elimination Forest Guided 2D Sparse LU Factorization
    1998
    Co-Authors: Kai Shen, Xiangmin Jiao, Tao Yang
    Abstract:

    Sparse LU factorization with Partial Pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S + , uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S + improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450MHz nodes, which is the hi..

  • Elimination Forest Guided 2D Sparse LU Factorization
    1998
    Co-Authors: Kai Shen, Xiangmin Jiao, Tao Yang
    Abstract:

    Sparse LU factorization with Partial Pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S + , uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S + improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450MHz nodes, which i..

Jakub Kurzak - One of the best experts on this subject based on the ideXlab platform.

  • linear systems solvers for distributed memory machines with gpu accelerators
    European Conference on Parallel Processing, 2019
    Co-Authors: Jakub Kurzak, Asim Yarkhan, Mark Gates, Ichitaro Yamazaki, Ali Charara, J.j. Dongarra
    Abstract:

    This work presents two implementations of linear solvers for distributed-memory machines with GPU accelerators—one based on the Cholesky factorization and one based on the LU factorization with Partial Pivoting. The routines are developed as part of the Software for Linear Algebra Targeting Exascale (SLATE) package, which represents a sharp departure from the traditional conventions established by legacy packages, such as LAPACK and ScaLAPACK. The article lays out the principles of the new approach, discusses the implementation details, and presents the performance results.

  • Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3306 A survey of recent developments in parallel implementations of Gaussian elimination
    2015
    Co-Authors: Simplice Donfack, Jakub Kurzak, Jack Dongarra, Piotr Luszczek, Mathieu Faverge, Mark Gates, Ichitaro Yamazaki
    Abstract:

    Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for Pivoting: Partial Pivoting, incremental Pivoting, and tournament Pivoting. The fourth one replaces Pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without Pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented

  • LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System - eScholarship
    2014
    Co-Authors: Jakub Kurzak
    Abstract:

    LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System – LAPACK Working Note 266 Jakub Kurzak Piotr Luszczek Mathieu Faverge Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Electrical Engineering and Computer Science, University of Tennessee Computer Science and Mathematics Division, Oak Ridge National Laboratory School of Mathematics & School of Computer Science, University of Manchester Abstract LU factorization with Partial Pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs. Introduction This paper presents an implementation of the canonical formulation of the LU factorization, which relies on Partial (row) Pivoting for numerical stability. It is equivalent to the DGETRF function in the LAPACK numerical library. Since the algorithm is coded in double precision, it can serve as the basis for an implementa- tion of the High Performance LINPACK benchmark (HPL) [1]. The target platform is a hybrid, multi-CPU, multi-GPU shared memory system. Background The LAPACK block LU factorization is the main point of reference here, and LAPACK naming convention is followed. The LU factorization of a matrix M has the form M = PLU, where L is a unit lower triangular matrix, U is an upper

  • Mathieu Faverge
    2013
    Co-Authors: Simplice Donfack, Jakub Kurzak, Jack Dongarra, Mark Gates, Ichitaro Yamazaki, Inria Bordeaux Sud-ouest, Piotr Luszczek
    Abstract:

    Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for Pivoting: Partial Pivoting, incremental Pivoting, and tournament Pivoting. The fourth one replaces Pivoting with the Random Butterfly Transformation, and finally, an implementation without Pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed

  • lu factorization with Partial Pivoting for a multicore system with accelerators
    IEEE Transactions on Parallel and Distributed Systems, 2013
    Co-Authors: Jakub Kurzak, Piotr Luszczek, Mathieu Faverge
    Abstract:

    LU factorization with Partial Pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of Partial Pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.

Mathieu Faverge - One of the best experts on this subject based on the ideXlab platform.

  • Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3306 A survey of recent developments in parallel implementations of Gaussian elimination
    2015
    Co-Authors: Simplice Donfack, Jakub Kurzak, Jack Dongarra, Piotr Luszczek, Mathieu Faverge, Mark Gates, Ichitaro Yamazaki
    Abstract:

    Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for Pivoting: Partial Pivoting, incremental Pivoting, and tournament Pivoting. The fourth one replaces Pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without Pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented

  • achieving numerical accuracy and high performance using recursive tile lu factorization with Partial Pivoting
    Concurrency and Computation: Practice and Experience, 2014
    Co-Authors: Mathieu Faverge, Hatem Ltaief, Piotr Luszczek
    Abstract:

    The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with Partial Pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd.

  • lu factorization with Partial Pivoting for a multicore system with accelerators
    IEEE Transactions on Parallel and Distributed Systems, 2013
    Co-Authors: Jakub Kurzak, Piotr Luszczek, Mathieu Faverge
    Abstract:

    LU factorization with Partial Pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of Partial Pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.

  • On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
    2013
    Co-Authors: Simplice Donfack, Jakub Kurzak, Jack Dongarra, Piotr Luszczek, Mathieu Faverge, Mark Gates, Ichitaro Yamazaki
    Abstract:

    Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for Pivoting: Partial Pivoting, incremental Pivoting, and tournament Pivoting. The fourth one replaces Pivoting with the Random Butterfly Transformation, and finally, an implementation without Pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.

  • LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
    2012
    Co-Authors: Jakub Kurzak, Piotr Luszczek, Mathieu Faverge
    Abstract:

    LU factorization with Partial Pivoting is a canonical numerical procedure and the main component of the High Performance Linpack benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The optimizations include lookahead, dynamic task scheduling, fine grain parallelism for memory-bound operations, autotuning, and data layout geared towards complex memory hierarchies. Performance in excess of one Tera flop/s is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.