Lu Factorization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 5412 Experts worldwide ranked by ideXlab platform

Enrique Quintana-ortí - One of the best experts on this subject based on the ideXlab platform.

  • Hierarchical approach for deriving a reproducible unblocked Lu Factorization
    International Journal of High Performance Computing Applications, 2019
    Co-Authors: Roman Iakymchuk, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the Lu Factorization for GPUs, which accommodates partial pivoting for stability and can be eventually integrated in a high performance and stable algorithm for the (blocked) Lu Factorization.

  • Towards Reproducible Blocked Lu Factorization
    2017
    Co-Authors: Roman Iakymchuk, Enrique Quintana-ortí, Erwin Laure, Stef Graillat
    Abstract:

    In this article, we address the problem of reproducibility of the blocked Lu Factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant; the latter is correspon-dently built upon the Level-1/2 BLAS kernels. In addition, we strengthen numerical stability of the blocked Lu Factorization via partial row pivoting. Therefore, we propose a double-layer bottom-up approach for ensuring reproducibility of the blocked Lu Factorization and provide experimental results for its underlying blocks.

  • IPDPS Workshops - Towards Reproducible Blocked Lu Factorization
    2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017
    Co-Authors: Roman Iakymchuk, Enrique Quintana-ortí, Erwin Laure, Stef Graillat
    Abstract:

    In this article, we address the problem of reproducibility of the blocked Lu Factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant of the Factorization, while the latter is correspondingly built upon the Level-1/2 BLAS kernels. In addition, we strengthen numerical stability of the blocked Lu Factorization via partial row pivoting. Therefore, we propose a double-layer bottom-up approach for ensuring reproducibility of the blocked LuFactorization and provide experimental results for its underlying blocks.

  • Hierarchical Approach for Deriving a Reproducible Lu Factorization
    2016
    Co-Authors: Roman Iakymchuk, Erwin Laure, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the Lu Factorization for GPUs, which accommodates partial pivoting for stability and can be eventually integrated into a (blocked) high performance and stable algorithm for the Lu Factorization.

  • Hierarchical Approach for Deriving a Reproducible Lu Factorization on GPUs
    2016
    Co-Authors: Roman Iakymchuk, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we provide Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via inexpensive iterative refinement. Following a bottom-up approach, we finally construct a reproducible implementation of the Lu Factorization for GPUs, which can easily accommodate partial pivoting for stability and be eventually integrated into a (blocked) high performance and stable algorithm for the Lu Factorization.

Jack Dongarra - One of the best experts on this subject based on the ideXlab platform.

  • Achieving Numerical Accuracy and High Performance using Recursive Tile Lu Factorization
    Concurrency and Computation: Practice and Experience, 2013
    Co-Authors: Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek
    Abstract:

    The Lu Factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the Lu Factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard Lu Factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the Lu Factorization is the panel Factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel Factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd.

  • Hybrid Lu Factorization on multi-GPU multi-core heterogeneous platforms
    2012
    Co-Authors: Piotr Luszczek, Jack Dongarra
    Abstract:

    DESCRIPTION Lu Factorization is an important step in solving systems of linear equations. It is the most computationally intensive step compared to the subsequent backward substitution. Hence, to solve a linear system fast requires performing Lu Factorization fast. Accelerator-based approach for linear algebra has been steadily gaining attention over recent years. GPUs appear to be the most prominent in many respects and nowadays are widely used accelerators. They are one of the fastest hardware for math operations on large data sets which feature high data parallelism. In this work, we designed and implemented a hybrid Lu Factorization on a multi-core and multi-GPU heterogeneous platform.

  • multi gpu implementation of Lu Factorization
    International Conference on Conceptual Structures, 2012
    Co-Authors: Yulu Jia, Piotr Luszczek, Jack Dongarra
    Abstract:

    Abstract Lu Factorization is the most computationally intensive step in solving systems of linear equations. By obtaining first the Lu Factorization of the coefficient matrix, we then may readily solve the system using backward substitution. The computational cost of Lu Factorization in terms fioating point operations is cubic. There are various efforts to improve the performance of Lu Factorization. We propose a multi-core multi-GPU hybrid Lu Factorization algorithm that leverages the strengths of both multiple CPUs and multiple GPUs. Our algorithm uses some of the CPU cores for panel Factorization, and the rest of the CPU cores together with all the available GPUs for trailing submatrix updates. Our algorithm employs both dynamic scheduling and static scheduling. Experiments show that our approach reaches 1134 Gflop/s with 4 Fermi GPU boards when combined with the total of 48 CPU cores from AMD. This is the first time such level of performance have been reported in a shared memory environment. Execution trace shows that our code also achieves good load balance and high system utilization.

  • ICCS - Multi-GPU Implementation of Lu Factorization
    Procedia Computer Science, 2012
    Co-Authors: Yulu Jia, Piotr Luszczek, Jack Dongarra
    Abstract:

    Abstract Lu Factorization is the most computationally intensive step in solving systems of linear equations. By obtaining first the Lu Factorization of the coefficient matrix, we then may readily solve the system using backward substitution. The computational cost of Lu Factorization in terms fioating point operations is cubic. There are various efforts to improve the performance of Lu Factorization. We propose a multi-core multi-GPU hybrid Lu Factorization algorithm that leverages the strengths of both multiple CPUs and multiple GPUs. Our algorithm uses some of the CPU cores for panel Factorization, and the rest of the CPU cores together with all the available GPUs for trailing submatrix updates. Our algorithm employs both dynamic scheduling and static scheduling. Experiments show that our approach reaches 1134 Gflop/s with 4 Fermi GPU boards when combined with the total of 48 CPU cores from AMD. This is the first time such level of performance have been reported in a shared memory environment. Execution trace shows that our code also achieves good load balance and high system utilization.

  • exploiting fine grain parallelism in recursive Lu Factorization
    Parallel Computing, 2011
    Co-Authors: Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek
    Abstract:

    The Lu Factorization is an important numerical algorithm for solving system of linear equations in science and engineering and is characteristic of many dense linear algebra computations. It has even become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected in the TOP500 website. In this context, the challenge in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance and maintaining the accuracy of the numerical algorithm. This paper proposes a novel approach for computing the Lu Factorization in parallel on multicore architectures, which not only improves the overall performance, but also sustains the numerical quality of the standard Lu Factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the Lu Factorization is the panel Factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. We present a new approach to Lu Factorization of (narrow and tall) panel submatrices. It uses a parallel fine-grained recursive formulation of the Factorization. It is based on conflict-free partitioning of the data and lockless synchronization mechanisms. As a result, our implementation lets the overall computation naturally flow without contention. Our recursive panel Factorization provides the necessary performance increase for the inherently problematic portion of the Lu Factorization of square matrices. The reason is that even though the panel

Stef Graillat - One of the best experts on this subject based on the ideXlab platform.

  • Hierarchical approach for deriving a reproducible unblocked Lu Factorization
    International Journal of High Performance Computing Applications, 2019
    Co-Authors: Roman Iakymchuk, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the Lu Factorization for GPUs, which accommodates partial pivoting for stability and can be eventually integrated in a high performance and stable algorithm for the (blocked) Lu Factorization.

  • Towards Reproducible Blocked Lu Factorization
    2017
    Co-Authors: Roman Iakymchuk, Enrique Quintana-ortí, Erwin Laure, Stef Graillat
    Abstract:

    In this article, we address the problem of reproducibility of the blocked Lu Factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant; the latter is correspon-dently built upon the Level-1/2 BLAS kernels. In addition, we strengthen numerical stability of the blocked Lu Factorization via partial row pivoting. Therefore, we propose a double-layer bottom-up approach for ensuring reproducibility of the blocked Lu Factorization and provide experimental results for its underlying blocks.

  • IPDPS Workshops - Towards Reproducible Blocked Lu Factorization
    2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017
    Co-Authors: Roman Iakymchuk, Enrique Quintana-ortí, Erwin Laure, Stef Graillat
    Abstract:

    In this article, we address the problem of reproducibility of the blocked Lu Factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant of the Factorization, while the latter is correspondingly built upon the Level-1/2 BLAS kernels. In addition, we strengthen numerical stability of the blocked Lu Factorization via partial row pivoting. Therefore, we propose a double-layer bottom-up approach for ensuring reproducibility of the blocked LuFactorization and provide experimental results for its underlying blocks.

  • Hierarchical Approach for Deriving a Reproducible Lu Factorization
    2016
    Co-Authors: Roman Iakymchuk, Erwin Laure, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the Lu Factorization for GPUs, which accommodates partial pivoting for stability and can be eventually integrated into a (blocked) high performance and stable algorithm for the Lu Factorization.

  • Hierarchical Approach for Deriving a Reproducible Lu Factorization on GPUs
    2016
    Co-Authors: Roman Iakymchuk, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we provide Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via inexpensive iterative refinement. Following a bottom-up approach, we finally construct a reproducible implementation of the Lu Factorization for GPUs, which can easily accommodate partial pivoting for stability and be eventually integrated into a (blocked) high performance and stable algorithm for the Lu Factorization.

Huazhong Yang - One of the best experts on this subject based on the ideXlab platform.

  • GPU-Accelerated Sparse Lu Factorization for Circuit Simulation with Performance Modeling
    IEEE Transactions on Parallel and Distributed Systems, 2015
    Co-Authors: Xiaoming Chen, Ling Ren, Yu Wang, Huazhong Yang
    Abstract:

    The sparse matrix solver by Lu Factorization is a serious bottleneck in Simulation Program with Integrated Circuit Emphasis (SPICE)-based circuit simulators. The state-of-the-art Graphics Processing Units (GPU) have numerous cores sharing the same memory, provide attractive memory bandwidth and compute capability, and support massive thread-level parallelism, so GPUs can potentially accelerate the sparse solver in circuit simulators. In this paper, an efficient GPU-based sparse solver for circuit problems is proposed. We develop a hybrid parallel Lu Factorization approach combining task-level and data-level parallelism on GPUs. Work partitioning, number of active thread groups, and memory access patterns are optimized based on the GPU architecture. Experiments show that the proposed Lu Factorization approach on NVIDIA GTX580 attains an average speedup of 7.02 $\times$ (geometric mean) compared with sequential PARDISO, and 1.55 $\times$ compared with 16-threaded PARDISO. We also investigate bottlenecks of the proposed approach by a parametric performance model. The performance of the sparse Lu Factorization on GPUs is constrained by the global memory bandwidth, so the performance can be further improved by future GPUs with larger memory bandwidth.

  • sparse Lu Factorization for parallel circuit simulation on gpu
    Design Automation Conference, 2012
    Co-Authors: Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang
    Abstract:

    Sparse solver has become the bottleneck of SPICE simulators. There has been few work on GPU-based sparse solver because of the high data-dependency. The strong data-dependency determines that parallel sparse Lu Factorization runs efficiently on shared-memory computing devices. But the number of CPU cores sharing the same memory is often limited. The state of the art Graphic Processing Units (GPU) naturally have numerous cores sharing the device memory, and provide a possible soLution to the problem. In this paper, we propose a GPU-based sparse Lu solver for circuit simulation. We optimize the work partitioning, the number of active thread groups, and the memory access pattern, based on GPU architecture. On matrices whose Factorization involves many floating-point operations, our GPU-based sparse Lu Factorization achieves 7.90× speedup over 1-core CPU and 1.49× speedup over 8-core CPU. We also analyze the scalability of parallel sparse Lu Factorization and investigate the specifications on CPUs and GPUs that most infLuence the performance.

  • an adaptive Lu Factorization algorithm for parallel circuit simulation
    Asia and South Pacific Design Automation Conference, 2012
    Co-Authors: Xiaoming Chen, Yu Wang, Huazhong Yang
    Abstract:

    Sparse matrix solver has become the bottleneck in SPICE simulator. It is difficult to parallelize the solver because of the high data-dependency during the numerical Lu Factorization. This paper proposes a parallel Lu Factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation. Since not every matrix is suitable for parallel algorithm, a predictive method is proposed to decide whether a matrix should use parallel or sequential algorithm. The experimental results on 35 circuit matrices reveal that the developed algorithm achieves speedups of 2.11×∼8.38× (on geometric-average), compared with KLu, with 1∼8 threads, on the matrices which are suitable for parallel algorithm. Our solver can be downloaded from http://nicsLu.weebly.com.

  • DAC - Sparse Lu Factorization for parallel circuit simulation on GPU
    Proceedings of the 49th Annual Design Automation Conference on - DAC '12, 2012
    Co-Authors: Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang
    Abstract:

    Sparse solver has become the bottleneck of SPICE simulators. There has been few work on GPU-based sparse solver because of the high data-dependency. The strong data-dependency determines that parallel sparse Lu Factorization runs efficiently on shared-memory computing devices. But the number of CPU cores sharing the same memory is often limited. The state of the art Graphic Processing Units (GPU) naturally have numerous cores sharing the device memory, and provide a possible soLution to the problem. In this paper, we propose a GPU-based sparse Lu solver for circuit simulation. We optimize the work partitioning, the number of active thread groups, and the memory access pattern, based on GPU architecture. On matrices whose Factorization involves many floating-point operations, our GPU-based sparse Lu Factorization achieves 7.90× speedup over 1-core CPU and 1.49× speedup over 8-core CPU. We also analyze the scalability of parallel sparse Lu Factorization and investigate the specifications on CPUs and GPUs that most infLuence the performance.

  • ASP-DAC - An adaptive Lu Factorization algorithm for parallel circuit simulation
    17th Asia and South Pacific Design Automation Conference, 2012
    Co-Authors: Xiaoming Chen, Yu Wang, Huazhong Yang
    Abstract:

    Sparse matrix solver has become the bottleneck in SPICE simulator. It is difficult to parallelize the solver because of the high data-dependency during the numerical Lu Factorization. This paper proposes a parallel Lu Factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation. Since not every matrix is suitable for parallel algorithm, a predictive method is proposed to decide whether a matrix should use parallel or sequential algorithm. The experimental results on 35 circuit matrices reveal that the developed algorithm achieves speedups of 2.11×∼8.38× (on geometric-average), compared with KLu, with 1∼8 threads, on the matrices which are suitable for parallel algorithm. Our solver can be downloaded from http://nicsLu.weebly.com.

Roman Iakymchuk - One of the best experts on this subject based on the ideXlab platform.

  • Hierarchical approach for deriving a reproducible unblocked Lu Factorization
    International Journal of High Performance Computing Applications, 2019
    Co-Authors: Roman Iakymchuk, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the Lu Factorization for GPUs, which accommodates partial pivoting for stability and can be eventually integrated in a high performance and stable algorithm for the (blocked) Lu Factorization.

  • Towards Reproducible Blocked Lu Factorization
    2017
    Co-Authors: Roman Iakymchuk, Enrique Quintana-ortí, Erwin Laure, Stef Graillat
    Abstract:

    In this article, we address the problem of reproducibility of the blocked Lu Factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant; the latter is correspon-dently built upon the Level-1/2 BLAS kernels. In addition, we strengthen numerical stability of the blocked Lu Factorization via partial row pivoting. Therefore, we propose a double-layer bottom-up approach for ensuring reproducibility of the blocked Lu Factorization and provide experimental results for its underlying blocks.

  • IPDPS Workshops - Towards Reproducible Blocked Lu Factorization
    2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017
    Co-Authors: Roman Iakymchuk, Enrique Quintana-ortí, Erwin Laure, Stef Graillat
    Abstract:

    In this article, we address the problem of reproducibility of the blocked Lu Factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant of the Factorization, while the latter is correspondingly built upon the Level-1/2 BLAS kernels. In addition, we strengthen numerical stability of the blocked Lu Factorization via partial row pivoting. Therefore, we propose a double-layer bottom-up approach for ensuring reproducibility of the blocked LuFactorization and provide experimental results for its underlying blocks.

  • Hierarchical Approach for Deriving a Reproducible Lu Factorization
    2016
    Co-Authors: Roman Iakymchuk, Erwin Laure, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the Lu Factorization for GPUs, which accommodates partial pivoting for stability and can be eventually integrated into a (blocked) high performance and stable algorithm for the Lu Factorization.

  • Hierarchical Approach for Deriving a Reproducible Lu Factorization on GPUs
    2016
    Co-Authors: Roman Iakymchuk, Stef Graillat, David Defour, Enrique Quintana-ortí
    Abstract:

    We propose a reproducible variant of the unblocked Lu Factorization for graphics processor units (GPUs). For this purpose, we provide Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via inexpensive iterative refinement. Following a bottom-up approach, we finally construct a reproducible implementation of the Lu Factorization for GPUs, which can easily accommodate partial pivoting for stability and be eventually integrated into a (blocked) high performance and stable algorithm for the Lu Factorization.