Barrier Synchronization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 2361 Experts worldwide ranked by ideXlab platform

Dhabaleswar K Panda - One of the best experts on this subject based on the ideXlab platform.

  • a reliable hardware Barrier Synchronization scheme
    International Parallel Processing Symposium, 1997
    Co-Authors: Rajeev Sivaram, Craig B Stunkel, Dhabaleswar K Panda
    Abstract:

    Barrier Synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast Barrier Synchronization through software, hardware, or a combination of these mechanisms. However few of these schemes emphasize fault-tolerant Barrier operations. In this paper, we describe inexpensive support that can be added to network switches for achieving reliable hardware-based Barrier Synchronization while recovering from lost or corrupted messages. Necessary modifications to the switch architecture and the associated fault-tolerant message-passing protocols are presented. The protocols are optimized for the no-fault case while providing means to detect the failure of any step of the operation and to recover from it. The proposed scheme shows significant potential for use in parallel systems, especially the emerging systems based on networks of workstations.

  • fast Barrier Synchronization in wormhole k ary n cube networks with multidestination worms
    High-Performance Computer Architecture, 1995
    Co-Authors: Dhabaleswar K Panda
    Abstract:

    This paper presents a new approach to implement fast Barrier Synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination messages instead of the traditional single destination messages. Two different multidestination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of Barrier Synchronization, respectively. Algorithms for complete and arbitrary set Barrier Synchronization are presented using these new worms. It is shown that complete Barrier Synchronization in a k-ary n-cube system with e-cube routing can be implemented with 2n communication start-ups as compared to 2n log/sub 2/ k start-ups needed with unicast-based message passing. For arbitrary set Barrier, an interesting trend is observed where the Synchronization cost keeps on reducing beyond a certain number of participating nodes. >

  • HPCA - Fast Barrier Synchronization in wormhole k-ary n-cube networks with multidestination worms
    1995
    Co-Authors: Dhabaleswar K Panda
    Abstract:

    This paper presents a new approach to implement fast Barrier Synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination messages instead of the traditional single destination messages. Two different multidestination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of Barrier Synchronization, respectively. Algorithms for complete and arbitrary set Barrier Synchronization are presented using these new worms. It is shown that complete Barrier Synchronization in a k-ary n-cube system with e-cube routing can be implemented with 2n communication start-ups as compared to 2n log/sub 2/ k start-ups needed with unicast-based message passing. For arbitrary set Barrier, an interesting trend is observed where the Synchronization cost keeps on reducing beyond a certain number of participating nodes. >

  • Barrier Synchronization in distributed memory multiprocessors using rendezvous primitives
    International Parallel Processing Symposium, 1993
    Co-Authors: S K S Gupta, Dhabaleswar K Panda
    Abstract:

    This paper deals with Barrier Synchronization in wormhole routed distributed-memory multiprocessors. New rendezvous and multirendezvous Synchronization primitives are proposed to implement a Barrier between two and multiple processors, respectively. These primitives reduce the number of communication steps required to implement a Barrier; thus, significantly reducing the Synchronization overhead for networks with high communication start-up cost. Two algorithms for Barrier Synchronization on k-ary n-cube networks are presented. The rendezvous primitive allows one to synchronize all processors in nlog/sub 2/(k) steps. The multirendezvous primitive allows one to synchronize an arbitrary subset of processors in optimal number of communication steps depending on the ratio of the communication start-up (t/sub s/) to the link-propagation (t/sub p/) cost. >

  • IPPS - A reliable hardware Barrier Synchronization scheme
    Proceedings 11th International Parallel Processing Symposium, 1
    Co-Authors: Rajeev Sivaram, Craig B Stunkel, Dhabaleswar K Panda
    Abstract:

    Barrier Synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast Barrier Synchronization through software, hardware, or a combination of these mechanisms. However few of these schemes emphasize fault-tolerant Barrier operations. In this paper, we describe inexpensive support that can be added to network switches for achieving reliable hardware-based Barrier Synchronization while recovering from lost or corrupted messages. Necessary modifications to the switch architecture and the associated fault-tolerant message-passing protocols are presented. The protocols are optimized for the no-fault case while providing means to detect the failure of any step of the operation and to recover from it. The proposed scheme shows significant potential for use in parallel systems, especially the emerging systems based on networks of workstations.

Mikel Lujan - One of the best experts on this subject based on the ideXlab platform.

  • effective Barrier Synchronization on intel xeon phi coprocessor
    European Conference on Parallel Processing, 2015
    Co-Authors: Andrey Rodchenko, Andy Nisbet, Antoniu Pop, Mikel Lujan
    Abstract:

    Barriers are a fundamental Synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art Barrier Synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid Barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP Barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC Barrier OpenMP microbenchmark. The optimized Barriers presented in the paper are available at https://github.com/arodchen/cBarriers released as free software.

  • Euro-Par - Effective Barrier Synchronization on Intel Xeon Phi Coprocessor
    Lecture Notes in Computer Science, 2015
    Co-Authors: Andrey Rodchenko, Andy Nisbet, Antoniu Pop, Mikel Lujan
    Abstract:

    Barriers are a fundamental Synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art Barrier Synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid Barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP Barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC Barrier OpenMP microbenchmark. The optimized Barriers presented in the paper are available at https://github.com/arodchen/cBarriers released as free software.

Chungta King - One of the best experts on this subject based on the ideXlab platform.

  • designing tree based Barrier Synchronization on 2d mesh networks
    IEEE Transactions on Parallel and Distributed Systems, 1998
    Co-Authors: Jenqshyan Yang, Chungta King
    Abstract:

    In this paper, we consider a tree-based routing scheme for supporting Barrier Synchronization on scalable parallel computers with a 2D mesh network. Based on the characteristics of a standard programming interface, the scheme builds a collective Synchronization (CS) tree among the participating nodes using a distributed algorithm. When the routers are set up properly with the CS tree information, Barrier Synchronization can be accomplished very efficiently by passing simple messages. Performance evaluations show that our proposed method performs better than previous path-based approaches and is less sensitive to variations in group size and startup delay. However, our scheme has the extra overhead of building the CS tree. Thus, it is more suitable for parallel iterative computations in which the same Barrier is invoked repetitively.

  • Efficient Barrier Synchronization in wormhole-routed mesh networks supporting turn model
    Parallel Computing, 1998
    Co-Authors: Kuo-pao Fan, Chungta King
    Abstract:

    Barrier is an important Synchronization operation. On scalable parallel computers, it is often implemented as a collective communication. A typical Barrier Synchronization operation consists of a reduction operation followed by a distribution operation. In this paper, we introduce a systematic way of generating efficient algorithms to perform Barrier Synchronization in mesh networks. The scheme works with any base routing algorithm that is derivable from the turn model C.J. Glass, L.M. Ni, in: Proc. Intl. Symp. Computer Architecture, pp. 278–297. It extends the turn grouping method proposed by K.P. Fan, C.T. King, Turn grouping for supporting efficient multicast in wormhole mesh networks, in: Proc. 6th Symp. on the Frontiers of Massively Parallel Computing (Frontiers '96), October 1996 with two new algorithms, Tail_to_Central and Central_to_Tail. These two algorithms schedule the transmissions of Synchronization messages in the reduction and distribution phase respectively. Performance of the proposed method is evaluated using four typical turn-model based algorithms. The simulation results show that our approach can take advantage of the adaptivity of the base routing algorithms and outperforms methods proposed previously.

  • ICDCS - Hardware supports for efficient Barrier Synchronization on 2-D mesh networks
    Proceedings of 16th International Conference on Distributed Computing Systems, 1996
    Co-Authors: Jeng-shyan Yang, Chungta King
    Abstract:

    In this paper, we consider a hardware scheme for supporting Barrier Synchronization on scalable systems with a 2D mesh network. Our design takes into account of the program execution path in such systems-from programming interfaces down to routers. The hardware router design will be based on the MPI-1 standard. A distributed algorithm is proposed to construct a collective Synchronization tree (CS tree) from the nodes participating in the Barrier based upon the CS tree, the status registers in the routers are set up and Synchronization messages are transmitted along the paths set by the status registers. Performance evaluations show that our proposed method has better performance for Barrier Synchronization and is less sensitive to variations in group size and startup delay than previous approaches. However our scheme has the extra overhead of building the CS tree. Thus it is more suitable for parallel iterative computations, in which the same Barrier is invoked repetitively.

Jean-philippe Diguet - One of the best experts on this subject based on the ideXlab platform.

  • Broadcast Mechanism Based on Hybrid Wireless/Wired NoC for Efficient Barrier Synchronization in Parallel Computing
    2020
    Co-Authors: Hemanta Kumar Mondal, Navonil Chatterjee, Rodrigo Cataldo, Jean-philippe Diguet
    Abstract:

    Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for Barrier Synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the Synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient Barrier Synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the Barrier Synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark.

  • broadcast mechanism based on hybrid wireless wired noc for efficient Barrier Synchronization in parallel computing
    Asia and South Pacific Design Automation Conference, 2020
    Co-Authors: Hemanta Kumar Mondal, Navonil Chatterjee, Rodrigo Cataldo, Jean-philippe Diguet
    Abstract:

    Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for Barrier Synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the Synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient Barrier Synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the Barrier Synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark.

  • ASP-DAC - Broadcast Mechanism Based on Hybrid Wireless/Wired NoC for Efficient Barrier Synchronization in Parallel Computing
    2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), 2020
    Co-Authors: Hemanta Kumar Mondal, Navonil Chatterjee, Rodrigo Cataldo, Jean-philippe Diguet
    Abstract:

    Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for Barrier Synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the Synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient Barrier Synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the Barrier Synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark.

  • Broadcast- and Power-Aware Wireless NoC for Barrier Synchronization in Parallel Computing
    2018
    Co-Authors: Hemanta Kumar Mondal, Rodrigo Cataldo, Cesar Augusto Missio Marcon, Kevin Martin, Sujay Deb, Jean-philippe Diguet
    Abstract:

    Efficient Synchronization is one of the basic requirements of effective parallel computing. A key operation of the POSIX Thread standard (PThread) is Barrier Synchronization, where multiple threads block on a user-specified point of execution until all of them have reached it. Conventional architectures for broadcast operations limit the achievable performance benefits as Synchronization is significantly affected due to critical path communications. This increases the network latency and degrades the performance dramatically. A Wireless Network-on-Chip (WiNoC) offers a promising solution to reduce the long distance/critical path communication bottlenecks of conventional architectures by augmenting them with single hop, long-range wireless links. In this paper, we propose a power-aware broadcast enabled WiNoC architecture to reduce the cost of broadcast operations for Barrier-based applications. The proposed architecture reduces the Barrier Synchronization cost up to 43.97% regarding network latency under the PARSEC benchmarks. It also saves up to 80.49% idle-state power consumption in WIs for a 64-core system compared with the conventional WiNoC architecture without incurring significant overhead.

  • SoCC - Broadcast- and Power-Aware Wireless NoC for Barrier Synchronization in Parallel Computing
    2018 31st IEEE International System-on-Chip Conference (SOCC), 2018
    Co-Authors: Hemanta Kumar Mondal, Rodrigo Cataldo, Kevin Martin, Sujay Deb, Cesar Marcon, Jean-philippe Diguet
    Abstract:

    Efficient Synchronization is one of the basic requirements of effective parallel computing. A key operation of the POSIX Thread standard (PThread) is Barrier Synchronization, where multiple threads block on a user-specified point of execution until all of them have reached it. Conventional architectures for broadcast operations limit the achievable performance benefits as Synchronization is significantly affected due to critical path communications. This increases the network latency and degrades the performance dramatically. A Wireless Network-on-Chip (WiNoC) offers a promising solution to reduce the long distance/critical path communication bottlenecks of conventional architectures by augmenting them with single hop, long-range wireless links. In this paper, we propose a power-aware broadcast enabled WiNoC architecture to reduce the cost of broadcast operations for Barrier-based applications. The proposed architecture reduces the Barrier Synchronization cost up to 43.97% regarding network latency under the PARSEC benchmarks. It also saves up to 80.49% idle-state power consumption in WIs for a 64-core system compared with the conventional WiNoC architecture without incurring significant overhead.

Philip K. Mckinley - One of the best experts on this subject based on the ideXlab platform.