QR Factorization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 4512 Experts worldwide ranked by ideXlab platform

Zizhong Chen - One of the best experts on this subject based on the ideXlab platform.

  • sucaQR a simplified communication avoiding QR Factorization solver using the tblas framework
    International Conference on Parallel and Distributed Systems, 2016
    Co-Authors: Weijian Zheng, Fengguang Song, Lan Lin, Zizhong Chen
    Abstract:

    The scope of this paper is to design and implement a scalable QR Factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR Factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR Factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR Factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.

  • ICPADS - suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework
    2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016
    Co-Authors: Weijian Zheng, Fengguang Song, Lan Lin, Zizhong Chen
    Abstract:

    The scope of this paper is to design and implement a scalable QR Factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR Factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR Factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR Factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.

George A Constantinides - One of the best experts on this subject based on the ideXlab platform.

  • Enhancing performance of Tall-Skinny QR Factorization using FPGAs
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, 2012
    Co-Authors: Asim Rafique, Nachiket Kapre, George A Constantinides
    Abstract:

    Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth requirements like Tall-Skinny QR Factorization (TSQR) are highly appropriate for acceleration using FPGAs. TSQR parallelizes QR Factorization of tall-skinny matrices in a divide-and-conquer fashion by decomposing them into sub-matrices, performing local QR Factorizations and then merging the intermediate results. As TSQR is a dense linear algebra problem, one would therefore imagine GPU to show better performance. However, the performance of GPU is limited by the memory bandwidth in local QR Factorizations and global communication latency in the merge stage. We exploit the shape of the matrix and propose an FPGA-based custom architecture which avoids these bottlenecks by using high-bandwidth on-chip memories for local QR Factorizations and by performing the merge stage entirely on-chip to reduce communication latency. We achieve a peak double-precision floating-point performance of 129 GFLOPs on Virtex-6 SX475T. A quantitative comparison of our proposed design with recent QR Factorization on FPGAs and GPU shows up to 7.7× and 12.7× speed up respectively. Additionally, we show even higher performance over optimized linear algebra libraries like Intel MKL for multi-cores, CULA for GPUs and MAGMA for hybrid systems.

  • FPL - Enhancing performance of Tall-Skinny QR Factorization using FPGAs
    22nd International Conference on Field Programmable Logic and Applications (FPL), 2012
    Co-Authors: Abid Rafique, Nachiket Kapre, George A Constantinides
    Abstract:

    Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth requirements like Tall-Skinny QR Factorization (TSQR) are highly appropriate for acceleration using FPGAs. TSQR parallelizes QR Factorization of tall-skinny matrices in a divide-and-conquer fashion by decomposing them into sub-matrices, performing local QR Factorizations and then merging the intermediate results. As TSQR is a dense linear algebra problem, one would therefore imagine GPU to show better performance. However, the performance of GPU is limited by the memory bandwidth in local QR Factorizations and global communication latency in the merge stage. We exploit the shape of the matrix and propose an FPGA-based custom architecture which avoids these bottlenecks by using high-bandwidth on-chip memories for local QR Factorizations and by performing the merge stage entirely on-chip to reduce communication latency. We achieve a peak double-precision floating-point performance of 129 GFLOPs on Virtex-6 SX475T. A quantitative comparison of our proposed design with recent QR Factorization on FPGAs and GPU shows up to 7.7× and 12.7× speed up respectively. Additionally, we show even higher performance over optimized linear algebra libraries like Intel MKL for multi-cores, CULA for GPUs and MAGMA for hybrid systems.

Weijian Zheng - One of the best experts on this subject based on the ideXlab platform.

  • sucaQR a simplified communication avoiding QR Factorization solver using the tblas framework
    International Conference on Parallel and Distributed Systems, 2016
    Co-Authors: Weijian Zheng, Fengguang Song, Lan Lin, Zizhong Chen
    Abstract:

    The scope of this paper is to design and implement a scalable QR Factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR Factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR Factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR Factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.

  • ICPADS - suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework
    2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016
    Co-Authors: Weijian Zheng, Fengguang Song, Lan Lin, Zizhong Chen
    Abstract:

    The scope of this paper is to design and implement a scalable QR Factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR Factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR Factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR Factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.

Jack Dongarra - One of the best experts on this subject based on the ideXlab platform.

  • soft error resilient QR Factorization for hybrid system with gpgpu
    Journal of Computational Science, 2013
    Co-Authors: Piotr Luszczek, Jack Dongarra, Stanimire Tomov
    Abstract:

    Abstract The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR Factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R ; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR Factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.

  • Hierarchical QR Factorization algorithms for multi-core clusters
    2013
    Co-Authors: Jack Dongarra, Julien Langou, Thomas Herault, Mathieu Faverge, Mathias Jacquelin, Yves Robert
    Abstract:

    This paper describes a new QR Factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR Factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ''communication-avoiding''), it is natural to consider hierarchical trees composed of an ''inter-node'' tree which acts on top of ''intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ''TS level'' for cache-friendliness, (1) ''low-level'' for decoupled highly parallel inter-node reductions, (2) ''domino level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR Factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.

  • hierarchical QR Factorization algorithms for multi core cluster systems
    International Parallel and Distributed Processing Symposium, 2012
    Co-Authors: Jack Dongarra, Julien Langou, Thomas Herault, Mathieu Faverge, Yves Robert
    Abstract:

    This paper describes a new QR Factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. %equipped with accelerators. These platforms make the present and the foreseeable future of high-performance computing. Our new QR Factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of multicores, in order to minimize the number of inter-processor communications (aka, ``communication-avoiding'' algorithm), it is natural to consider two-level hierarchical trees composed of an ``inter-node'' tree which acts on top of ``intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ``TS level'' for cache-friendliness, (1) ``low level'' for decoupled highly parallel inter-node reductions, (2) ``coupling level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-cluster and intra-cluster. Numerical experiments on a cluster of multicore nodes (1) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (2) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the \Dague scheduling tool significantly outperforms currently available QR Factorization softwares for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.

  • Hierarchical QR Factorization algorithms for multi-core cluster systems
    2012
    Co-Authors: Jack Dongarra, Julien Langou, Thomas Herault, Mathieu Faverge, Yves Robert
    Abstract:

    This paper describes a new QR Factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR Factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of multicores, in order to minimize the number of inter-processor communications (aka, "communication-avoiding'' algorithm), it is natural to consider two-level hierarchical trees composed of an "inter-node'' tree which acts on top of "intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) "TS level'' for cache-friendliness, (1) "low level'' for decoupled highly parallel inter-node reductions, (2) "coupling level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-cluster and intra-cluster. Numerical experiments on a cluster of multicore nodes (1) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (2) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGUE scheduling tool significantly outperforms currently available QR Factorization softwares for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platfo- ms.

  • soft error resilient QR Factorization for hybrid system with gpgpu
    Proceedings of the second workshop on Scalable algorithms for large-scale systems, 2011
    Co-Authors: Piotr Luszczek, Stanimire Tomov, Jack Dongarra
    Abstract:

    The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR Factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR Factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.

Lan Lin - One of the best experts on this subject based on the ideXlab platform.

  • sucaQR a simplified communication avoiding QR Factorization solver using the tblas framework
    International Conference on Parallel and Distributed Systems, 2016
    Co-Authors: Weijian Zheng, Fengguang Song, Lan Lin, Zizhong Chen
    Abstract:

    The scope of this paper is to design and implement a scalable QR Factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR Factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR Factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR Factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.

  • ICPADS - suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework
    2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016
    Co-Authors: Weijian Zheng, Fengguang Song, Lan Lin, Zizhong Chen
    Abstract:

    The scope of this paper is to design and implement a scalable QR Factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR Factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR Factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR Factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.