Global Communication

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 219336 Experts worldwide ranked by ideXlab platform

Wim Vanroose - One of the best experts on this subject based on the ideXlab platform.

  • numerically stable recurrence relations for the Communication hiding pipelined conjugate gradient method
    IEEE Transactions on Parallel and Distributed Systems, 2019
    Co-Authors: Siegfried Cools, Jeffrey Cornelis, Wim Vanroose
    Abstract:

    Pipelined Krylov subspace methods (also referred to as Communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method, p($l$l)-CG, outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping Global Communication with essential computations like the matrix-vector product, thus “hiding” Global Communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by $\ell$l three-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

  • numerically stable recurrence relations for the Communication hiding pipelined conjugate gradient method
    arXiv: Numerical Analysis, 2019
    Co-Authors: Siegfried Cools, Jeffrey Cornelis, Wim Vanroose
    Abstract:

    Pipelined Krylov subspace methods (also referred to as Communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping Global Communication with essential computations like the matrix-vector product, thus hiding Global Communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by two-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

  • Hiding Global Communication latency and increasing the arithmetic intensity in extreme-scale Krylov solvers
    2013
    Co-Authors: Wim Vanroose, Pieter Ghysels, Karl Meerbergen, Dirk Roose
    Abstract:

    For many HPC codes the major cost lies in the solution of large sparse linear systems [1]. For many problems, the methods of choice for such systems are preconditioned Krylov solvers. However, Krylov solvers are hard to scale to a large numbers of cores due to two main bottlenecks: the inter-node latency and the on-node bandwidth. In this position paper, we review recently proposed techniques to overcome each of these bottlenecks and we put forward possible ways to achieve preconditioned Krylov solvers that efficiently use all the resources on many-core chips and are extremely scalable on massively parallel machines. A key ingredient in our approach is the use of stencil compilers. Future supercomputers will have a large number of nodes, each being a many-core processor. In addition, the cores will feature vector processing units (VPU) with very long vectors. New algorithms and software should exploit these three levels of parallelism. On massively parallel machines, Global Communication should be avoided as much as possible. Global Communication is very expensive due to the large latency on the wire, unless it can be overlapped with calculations. In Krylov solvers, there are usually at least two such Global Communication phases per iteration, used for orthogonalization and normalization of the Krylov base vectors. In the standard formulation of most Krylov methods, there is no possibility to overlap this Communication with local work, which leads to a bulk synchronous execution pattern that leaves many resources idle. Recently, pipelined Krylov methods [6, 7] reorganized the algorithms with only one Global reduction per iteration. The reduction’s latency can be overlapped with other work such as the (preconditioned) sparse matrix-vector product ((P)SpMV). While the reduction takes place in the background, new Krylov base vectors can be computed using the (P)SpMV. Only when enough (P)SpMVs have been computed to completely hide the Global Communication latency, an orthogonalization and normalization step is performed. This deferred orthogonalization obviously changes the numerical properties of the Krylov algorithm. However, since only very few (P)SpMVs are required to completely hide the Global latency, numerical stability is mildly affected. This can be remediated by introducing shifts in the (P)SpMV that prevents the base vectors from aligning to the dominant eigenvector and results in an improved Krylov basis. Pipelined methods lift the main bottleneck for scaling Krylov solvers to extreme numbers of cores and the resulting solver scales as well as the (P)SpMV. For many applications, good scalability for the sparse matrix-vector product (SpMV) can be achieved even for 100k cores, if the problem is partitioned such that there is only local Communication. For the preconditioner, there is typically a trade-off between parallelism and efficiency. We expect pipelined methods to be better suited for cheap preconditioners with a high degree of parallelism. 2 Although the (P)SpMV may scale well on a distributed memory system, the on-node performance may still be poor. Within a node, the different threads have to share the available

  • hiding Global Communication latency in the gmres algorithm on massively parallel machines
    SIAM Journal on Scientific Computing, 2013
    Co-Authors: Pieter Ghysels, Karl Meerbergen, Thomas J Ashby, Wim Vanroose
    Abstract:

    In the generalized minimal residual method (GMRES), the Global all-to-all Communication required in each iteration for orthogonalization and normalization of the Krylov base vectors is becoming a performance bottleneck on massively parallel machines. Long latencies, system noise, and load imbalance cause these Global reductions to become very costly Global synchronizations. In this work, we propose the use of nonblocking or asynchronous Global reductions to hide these Global Communication latencies by overlapping them with other Communications and calculations. A pipelined variation of GMRES is presented in which the result of a Global reduction is used only one or more iterations after the Communication phase has started. This way, Global synchronization is relaxed and scalability is much improved at the expense of some extra computations. The numerical instabilities that inevitably arise due to the typical monomial basis by powering the matrix are reduced and often annihilated by using Newton or Chebysh...

Steven M Nowick - One of the best experts on this subject based on the ideXlab platform.

  • error correcting unordered codes and hardware support for robust asynchronous Global Communication
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012
    Co-Authors: M Y Agyekum, Steven M Nowick
    Abstract:

    This paper introduces a new family of error-correction unordered (ECU) codes for Global Communication, called Zero-Sum. They combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the fault-tolerance of error-correcting codes (providing 1-bit error correction or 2-bit detection). Two key features of the codes are that they are systematic, allowing direct extraction of data, and weighted, where the check field is computed as the sum of data index weights. A wide variety of weight assignments is shown to be feasible. Two practical enhancements are also proposed. The Zero-Sum+ code extends error detection to 3-bit errors, or alternatively handles 2-bit detection and 1-bit correction. The Zero-Sum* code supports heuristic 2-bit correction, while still guaranteeing 2-bit detection, under different strategies of weight assignment. Detailed hardware implementations of the supporting components (encoder, completion detection, error corrector) are given, as well as an outline of the system microarchitecture. In comparison to the best alternative systematic ECU code, the basic Zero-Sum code provided better or comparable coding efficiency, with a 5.74%-18.18% reduction in average number of wire transitions for most field sizes. Several Zero-Sum* codes were also evaluated for their 2-bit error correction coverage; initial results are promising, where the best strategy corrected 52.92%-71.16% of all 2-bit errors for most field sizes, with only a moderate decrease in coding efficiency and increase in wire transitions. Technology-mapped pre-layout implementations of the supporting Zero-Sum code hardware were synthesized with the UC Berkeley ABC tool using a 90 nm industrial standard cell library. Results indicate that they have moderate area and delay overheads. In comparison, supporting hardware for the best nonsystematic ECU codes have 3.82-10.44× greater area for larger field sizes.

  • an error correcting unordered code and hardware support for robust asynchronous Global Communication
    Design Automation and Test in Europe, 2010
    Co-Authors: M Y Agyekum, Steven M Nowick
    Abstract:

    A new delay-insensitive data encoding scheme for Global asynchronous Communication is introduced. The goal of this work is to combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the fault-tolerance of error-correcting codes. The proposed error-correcting unordered (ECU) code, called Zero-Sum, can safely accommodate arbitrary skew in arrival times of individual bits in a packet, while simultaneously providing 1-bit correction and 2-bit detection. A systematic code is targeted, where data can be directly extracted from the codewords. A basic method for generating the code is presented, as well as detailed designs for the supporting hardware blocks. An outline of the system micro-architecture and its operating protocol is also given. When compared to the best previous systematic ECU code, the new code provides a 5.74 to 18.18% reduction in transition power for most field sizes, with better or comparable coding efficiency. Pre-layout technology-mapped implementations of the supporting hardware (encoder, completion detector, error-corrector) were synthesized with the UC Berkeley ABC tool using a 90nm industrial standard cell library. Results indicate that they have moderate area and delay overheads, while the best non-systematic ECU codes have 3.82 to 10.44x greater area for larger field sizes.

  • a level encoded transition signaling protocol for high throughput asynchronous Global Communication
    IEEE International Symposium on Asynchronous Circuits and Systems, 2008
    Co-Authors: P B Mcgee, M Y Agyekum, M A Mohamed, Steven M Nowick
    Abstract:

    A new delay-insensitive data encoding scheme for Global Communication, level-encoded transition signaling (LETS), is introduced. LETS is a generalization of level-encoded dual rail (LEDR), an earlier non-return-to-zero encoding scheme where one of two wires changes value per data bit per transaction. In LETS, only one of N = 2n (1-of-N) wire changes value per n data bits per transaction. Compared to most common return-to-zero encoding schemes, LETS has potential power and throughput advantages, since fewer rails switch and no return-to-zero phase is required. Compared to existing nonreturn-to-zero schemes (i.e., LEDR), higher-dimension LETS codes have a potential power advantage, with significantly reduced switching activity per data bit.Two alternative 1-of-4 LETS codes are proposed, and efficient hardware for completion detection and conversion to return-to-zero protocols is introduced. Finally, a general theoretical framework is presented which characterizes the properties of arbitrary 1-of-N LETS codes, as well as a simple procedure to generate such codes.

  • efficient asynchronous protocol converters for two phase delay insensitive Global Communication
    IEEE International Symposium on Asynchronous Circuits and Systems, 2007
    Co-Authors: A Mitra, W F Mclaughlin, Steven M Nowick
    Abstract:

    As system-level interconnect incurs increasing penalties in latency, round-trip cycle time and power, and as timing-variability becomes an increasing design challenge, there is renewed interest in using two-phase delay-insensitive protocols for Global system-level Communication. However, in practice, when designing asynchronous systems, it is extremely inefficient to build local computation nodes with two-phase logic, hence four-phase computation blocks are typically used. This paper proposes a new architecture, and circuit-level implementations, for a family of asynchronous protocol converters, which efficiently convert between two- and four-phase protocols, thus facilitating system design with robust Global two-phase protocols and local four-phase protocols. The main focus is on a level-encoded dual-rail (LEDR) two-phase protocol for Global Communication, and a four-phase return-to-zero (RZ) protocol for asynchronous computation blocks. However, with small modifications, the converters are extended to handle other common four-phase protocols, such as 1- of-4 and single-rail bundled data. The converters are highly robust, with almost entirely quasi delay- insensitive implementations, yet exhibit high performance and modest area overhead. Initial post-layout simulations in a 0.18 micron TSMC process are provided, both assuming a small computation block (8times8 combinational multiplier) as well as an empty computation block (FIFO stage).

Siegfried Cools - One of the best experts on this subject based on the ideXlab platform.

  • numerically stable recurrence relations for the Communication hiding pipelined conjugate gradient method
    IEEE Transactions on Parallel and Distributed Systems, 2019
    Co-Authors: Siegfried Cools, Jeffrey Cornelis, Wim Vanroose
    Abstract:

    Pipelined Krylov subspace methods (also referred to as Communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method, p($l$l)-CG, outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping Global Communication with essential computations like the matrix-vector product, thus “hiding” Global Communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by $\ell$l three-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

  • numerically stable recurrence relations for the Communication hiding pipelined conjugate gradient method
    arXiv: Numerical Analysis, 2019
    Co-Authors: Siegfried Cools, Jeffrey Cornelis, Wim Vanroose
    Abstract:

    Pipelined Krylov subspace methods (also referred to as Communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping Global Communication with essential computations like the matrix-vector product, thus hiding Global Communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by two-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

Jeffrey Cornelis - One of the best experts on this subject based on the ideXlab platform.

  • numerically stable recurrence relations for the Communication hiding pipelined conjugate gradient method
    IEEE Transactions on Parallel and Distributed Systems, 2019
    Co-Authors: Siegfried Cools, Jeffrey Cornelis, Wim Vanroose
    Abstract:

    Pipelined Krylov subspace methods (also referred to as Communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method, p($l$l)-CG, outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping Global Communication with essential computations like the matrix-vector product, thus “hiding” Global Communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by $\ell$l three-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

  • numerically stable recurrence relations for the Communication hiding pipelined conjugate gradient method
    arXiv: Numerical Analysis, 2019
    Co-Authors: Siegfried Cools, Jeffrey Cornelis, Wim Vanroose
    Abstract:

    Pipelined Krylov subspace methods (also referred to as Communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping Global Communication with essential computations like the matrix-vector product, thus hiding Global Communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by two-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.

M Y Agyekum - One of the best experts on this subject based on the ideXlab platform.

  • error correcting unordered codes and hardware support for robust asynchronous Global Communication
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012
    Co-Authors: M Y Agyekum, Steven M Nowick
    Abstract:

    This paper introduces a new family of error-correction unordered (ECU) codes for Global Communication, called Zero-Sum. They combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the fault-tolerance of error-correcting codes (providing 1-bit error correction or 2-bit detection). Two key features of the codes are that they are systematic, allowing direct extraction of data, and weighted, where the check field is computed as the sum of data index weights. A wide variety of weight assignments is shown to be feasible. Two practical enhancements are also proposed. The Zero-Sum+ code extends error detection to 3-bit errors, or alternatively handles 2-bit detection and 1-bit correction. The Zero-Sum* code supports heuristic 2-bit correction, while still guaranteeing 2-bit detection, under different strategies of weight assignment. Detailed hardware implementations of the supporting components (encoder, completion detection, error corrector) are given, as well as an outline of the system microarchitecture. In comparison to the best alternative systematic ECU code, the basic Zero-Sum code provided better or comparable coding efficiency, with a 5.74%-18.18% reduction in average number of wire transitions for most field sizes. Several Zero-Sum* codes were also evaluated for their 2-bit error correction coverage; initial results are promising, where the best strategy corrected 52.92%-71.16% of all 2-bit errors for most field sizes, with only a moderate decrease in coding efficiency and increase in wire transitions. Technology-mapped pre-layout implementations of the supporting Zero-Sum code hardware were synthesized with the UC Berkeley ABC tool using a 90 nm industrial standard cell library. Results indicate that they have moderate area and delay overheads. In comparison, supporting hardware for the best nonsystematic ECU codes have 3.82-10.44× greater area for larger field sizes.

  • an error correcting unordered code and hardware support for robust asynchronous Global Communication
    Design Automation and Test in Europe, 2010
    Co-Authors: M Y Agyekum, Steven M Nowick
    Abstract:

    A new delay-insensitive data encoding scheme for Global asynchronous Communication is introduced. The goal of this work is to combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the fault-tolerance of error-correcting codes. The proposed error-correcting unordered (ECU) code, called Zero-Sum, can safely accommodate arbitrary skew in arrival times of individual bits in a packet, while simultaneously providing 1-bit correction and 2-bit detection. A systematic code is targeted, where data can be directly extracted from the codewords. A basic method for generating the code is presented, as well as detailed designs for the supporting hardware blocks. An outline of the system micro-architecture and its operating protocol is also given. When compared to the best previous systematic ECU code, the new code provides a 5.74 to 18.18% reduction in transition power for most field sizes, with better or comparable coding efficiency. Pre-layout technology-mapped implementations of the supporting hardware (encoder, completion detector, error-corrector) were synthesized with the UC Berkeley ABC tool using a 90nm industrial standard cell library. Results indicate that they have moderate area and delay overheads, while the best non-systematic ECU codes have 3.82 to 10.44x greater area for larger field sizes.

  • a level encoded transition signaling protocol for high throughput asynchronous Global Communication
    IEEE International Symposium on Asynchronous Circuits and Systems, 2008
    Co-Authors: P B Mcgee, M Y Agyekum, M A Mohamed, Steven M Nowick
    Abstract:

    A new delay-insensitive data encoding scheme for Global Communication, level-encoded transition signaling (LETS), is introduced. LETS is a generalization of level-encoded dual rail (LEDR), an earlier non-return-to-zero encoding scheme where one of two wires changes value per data bit per transaction. In LETS, only one of N = 2n (1-of-N) wire changes value per n data bits per transaction. Compared to most common return-to-zero encoding schemes, LETS has potential power and throughput advantages, since fewer rails switch and no return-to-zero phase is required. Compared to existing nonreturn-to-zero schemes (i.e., LEDR), higher-dimension LETS codes have a potential power advantage, with significantly reduced switching activity per data bit.Two alternative 1-of-4 LETS codes are proposed, and efficient hardware for completion detection and conversion to return-to-zero protocols is introduced. Finally, a general theoretical framework is presented which characterizes the properties of arbitrary 1-of-N LETS codes, as well as a simple procedure to generate such codes.