The Experts below are selected from a list of 2361 Experts worldwide ranked by ideXlab platform
Dhabaleswar K Panda - One of the best experts on this subject based on the ideXlab platform.
-
a reliable hardware Barrier Synchronization scheme
International Parallel Processing Symposium, 1997Co-Authors: Rajeev Sivaram, Craig B Stunkel, Dhabaleswar K PandaAbstract:Barrier Synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast Barrier Synchronization through software, hardware, or a combination of these mechanisms. However few of these schemes emphasize fault-tolerant Barrier operations. In this paper, we describe inexpensive support that can be added to network switches for achieving reliable hardware-based Barrier Synchronization while recovering from lost or corrupted messages. Necessary modifications to the switch architecture and the associated fault-tolerant message-passing protocols are presented. The protocols are optimized for the no-fault case while providing means to detect the failure of any step of the operation and to recover from it. The proposed scheme shows significant potential for use in parallel systems, especially the emerging systems based on networks of workstations.
-
fast Barrier Synchronization in wormhole k ary n cube networks with multidestination worms
High-Performance Computer Architecture, 1995Co-Authors: Dhabaleswar K PandaAbstract:This paper presents a new approach to implement fast Barrier Synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination messages instead of the traditional single destination messages. Two different multidestination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of Barrier Synchronization, respectively. Algorithms for complete and arbitrary set Barrier Synchronization are presented using these new worms. It is shown that complete Barrier Synchronization in a k-ary n-cube system with e-cube routing can be implemented with 2n communication start-ups as compared to 2n log/sub 2/ k start-ups needed with unicast-based message passing. For arbitrary set Barrier, an interesting trend is observed where the Synchronization cost keeps on reducing beyond a certain number of participating nodes. >
-
HPCA - Fast Barrier Synchronization in wormhole k-ary n-cube networks with multidestination worms
1995Co-Authors: Dhabaleswar K PandaAbstract:This paper presents a new approach to implement fast Barrier Synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination messages instead of the traditional single destination messages. Two different multidestination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of Barrier Synchronization, respectively. Algorithms for complete and arbitrary set Barrier Synchronization are presented using these new worms. It is shown that complete Barrier Synchronization in a k-ary n-cube system with e-cube routing can be implemented with 2n communication start-ups as compared to 2n log/sub 2/ k start-ups needed with unicast-based message passing. For arbitrary set Barrier, an interesting trend is observed where the Synchronization cost keeps on reducing beyond a certain number of participating nodes. >
-
Barrier Synchronization in distributed memory multiprocessors using rendezvous primitives
International Parallel Processing Symposium, 1993Co-Authors: S K S Gupta, Dhabaleswar K PandaAbstract:This paper deals with Barrier Synchronization in wormhole routed distributed-memory multiprocessors. New rendezvous and multirendezvous Synchronization primitives are proposed to implement a Barrier between two and multiple processors, respectively. These primitives reduce the number of communication steps required to implement a Barrier; thus, significantly reducing the Synchronization overhead for networks with high communication start-up cost. Two algorithms for Barrier Synchronization on k-ary n-cube networks are presented. The rendezvous primitive allows one to synchronize all processors in nlog/sub 2/(k) steps. The multirendezvous primitive allows one to synchronize an arbitrary subset of processors in optimal number of communication steps depending on the ratio of the communication start-up (t/sub s/) to the link-propagation (t/sub p/) cost. >
-
IPPS - A reliable hardware Barrier Synchronization scheme
Proceedings 11th International Parallel Processing Symposium, 1Co-Authors: Rajeev Sivaram, Craig B Stunkel, Dhabaleswar K PandaAbstract:Barrier Synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast Barrier Synchronization through software, hardware, or a combination of these mechanisms. However few of these schemes emphasize fault-tolerant Barrier operations. In this paper, we describe inexpensive support that can be added to network switches for achieving reliable hardware-based Barrier Synchronization while recovering from lost or corrupted messages. Necessary modifications to the switch architecture and the associated fault-tolerant message-passing protocols are presented. The protocols are optimized for the no-fault case while providing means to detect the failure of any step of the operation and to recover from it. The proposed scheme shows significant potential for use in parallel systems, especially the emerging systems based on networks of workstations.
Mikel Lujan - One of the best experts on this subject based on the ideXlab platform.
-
effective Barrier Synchronization on intel xeon phi coprocessor
European Conference on Parallel Processing, 2015Co-Authors: Andrey Rodchenko, Andy Nisbet, Antoniu Pop, Mikel LujanAbstract:Barriers are a fundamental Synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art Barrier Synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid Barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP Barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC Barrier OpenMP microbenchmark. The optimized Barriers presented in the paper are available at https://github.com/arodchen/cBarriers released as free software.
-
Euro-Par - Effective Barrier Synchronization on Intel Xeon Phi Coprocessor
Lecture Notes in Computer Science, 2015Co-Authors: Andrey Rodchenko, Andy Nisbet, Antoniu Pop, Mikel LujanAbstract:Barriers are a fundamental Synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art Barrier Synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid Barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP Barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC Barrier OpenMP microbenchmark. The optimized Barriers presented in the paper are available at https://github.com/arodchen/cBarriers released as free software.
Chungta King - One of the best experts on this subject based on the ideXlab platform.
-
designing tree based Barrier Synchronization on 2d mesh networks
IEEE Transactions on Parallel and Distributed Systems, 1998Co-Authors: Jenqshyan Yang, Chungta KingAbstract:In this paper, we consider a tree-based routing scheme for supporting Barrier Synchronization on scalable parallel computers with a 2D mesh network. Based on the characteristics of a standard programming interface, the scheme builds a collective Synchronization (CS) tree among the participating nodes using a distributed algorithm. When the routers are set up properly with the CS tree information, Barrier Synchronization can be accomplished very efficiently by passing simple messages. Performance evaluations show that our proposed method performs better than previous path-based approaches and is less sensitive to variations in group size and startup delay. However, our scheme has the extra overhead of building the CS tree. Thus, it is more suitable for parallel iterative computations in which the same Barrier is invoked repetitively.
-
Efficient Barrier Synchronization in wormhole-routed mesh networks supporting turn model
Parallel Computing, 1998Co-Authors: Kuo-pao Fan, Chungta KingAbstract:Barrier is an important Synchronization operation. On scalable parallel computers, it is often implemented as a collective communication. A typical Barrier Synchronization operation consists of a reduction operation followed by a distribution operation. In this paper, we introduce a systematic way of generating efficient algorithms to perform Barrier Synchronization in mesh networks. The scheme works with any base routing algorithm that is derivable from the turn model C.J. Glass, L.M. Ni, in: Proc. Intl. Symp. Computer Architecture, pp. 278–297. It extends the turn grouping method proposed by K.P. Fan, C.T. King, Turn grouping for supporting efficient multicast in wormhole mesh networks, in: Proc. 6th Symp. on the Frontiers of Massively Parallel Computing (Frontiers '96), October 1996 with two new algorithms, Tail_to_Central and Central_to_Tail. These two algorithms schedule the transmissions of Synchronization messages in the reduction and distribution phase respectively. Performance of the proposed method is evaluated using four typical turn-model based algorithms. The simulation results show that our approach can take advantage of the adaptivity of the base routing algorithms and outperforms methods proposed previously.
-
ICDCS - Hardware supports for efficient Barrier Synchronization on 2-D mesh networks
Proceedings of 16th International Conference on Distributed Computing Systems, 1996Co-Authors: Jeng-shyan Yang, Chungta KingAbstract:In this paper, we consider a hardware scheme for supporting Barrier Synchronization on scalable systems with a 2D mesh network. Our design takes into account of the program execution path in such systems-from programming interfaces down to routers. The hardware router design will be based on the MPI-1 standard. A distributed algorithm is proposed to construct a collective Synchronization tree (CS tree) from the nodes participating in the Barrier based upon the CS tree, the status registers in the routers are set up and Synchronization messages are transmitted along the paths set by the status registers. Performance evaluations show that our proposed method has better performance for Barrier Synchronization and is less sensitive to variations in group size and startup delay than previous approaches. However our scheme has the extra overhead of building the CS tree. Thus it is more suitable for parallel iterative computations, in which the same Barrier is invoked repetitively.
Jean-philippe Diguet - One of the best experts on this subject based on the ideXlab platform.
-
Broadcast Mechanism Based on Hybrid Wireless/Wired NoC for Efficient Barrier Synchronization in Parallel Computing
2020Co-Authors: Hemanta Kumar Mondal, Navonil Chatterjee, Rodrigo Cataldo, Jean-philippe DiguetAbstract:Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for Barrier Synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the Synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient Barrier Synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the Barrier Synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark.
-
broadcast mechanism based on hybrid wireless wired noc for efficient Barrier Synchronization in parallel computing
Asia and South Pacific Design Automation Conference, 2020Co-Authors: Hemanta Kumar Mondal, Navonil Chatterjee, Rodrigo Cataldo, Jean-philippe DiguetAbstract:Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for Barrier Synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the Synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient Barrier Synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the Barrier Synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark.
-
ASP-DAC - Broadcast Mechanism Based on Hybrid Wireless/Wired NoC for Efficient Barrier Synchronization in Parallel Computing
2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), 2020Co-Authors: Hemanta Kumar Mondal, Navonil Chatterjee, Rodrigo Cataldo, Jean-philippe DiguetAbstract:Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for Barrier Synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the Synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient Barrier Synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the Barrier Synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark.
-
Broadcast- and Power-Aware Wireless NoC for Barrier Synchronization in Parallel Computing
2018Co-Authors: Hemanta Kumar Mondal, Rodrigo Cataldo, Cesar Augusto Missio Marcon, Kevin Martin, Sujay Deb, Jean-philippe DiguetAbstract:Efficient Synchronization is one of the basic requirements of effective parallel computing. A key operation of the POSIX Thread standard (PThread) is Barrier Synchronization, where multiple threads block on a user-specified point of execution until all of them have reached it. Conventional architectures for broadcast operations limit the achievable performance benefits as Synchronization is significantly affected due to critical path communications. This increases the network latency and degrades the performance dramatically. A Wireless Network-on-Chip (WiNoC) offers a promising solution to reduce the long distance/critical path communication bottlenecks of conventional architectures by augmenting them with single hop, long-range wireless links. In this paper, we propose a power-aware broadcast enabled WiNoC architecture to reduce the cost of broadcast operations for Barrier-based applications. The proposed architecture reduces the Barrier Synchronization cost up to 43.97% regarding network latency under the PARSEC benchmarks. It also saves up to 80.49% idle-state power consumption in WIs for a 64-core system compared with the conventional WiNoC architecture without incurring significant overhead.
-
SoCC - Broadcast- and Power-Aware Wireless NoC for Barrier Synchronization in Parallel Computing
2018 31st IEEE International System-on-Chip Conference (SOCC), 2018Co-Authors: Hemanta Kumar Mondal, Rodrigo Cataldo, Kevin Martin, Sujay Deb, Cesar Marcon, Jean-philippe DiguetAbstract:Efficient Synchronization is one of the basic requirements of effective parallel computing. A key operation of the POSIX Thread standard (PThread) is Barrier Synchronization, where multiple threads block on a user-specified point of execution until all of them have reached it. Conventional architectures for broadcast operations limit the achievable performance benefits as Synchronization is significantly affected due to critical path communications. This increases the network latency and degrades the performance dramatically. A Wireless Network-on-Chip (WiNoC) offers a promising solution to reduce the long distance/critical path communication bottlenecks of conventional architectures by augmenting them with single hop, long-range wireless links. In this paper, we propose a power-aware broadcast enabled WiNoC architecture to reduce the cost of broadcast operations for Barrier-based applications. The proposed architecture reduces the Barrier Synchronization cost up to 43.97% regarding network latency under the PARSEC benchmarks. It also saves up to 80.49% idle-state power consumption in WIs for a 64-core system compared with the conventional WiNoC architecture without incurring significant overhead.
Philip K. Mckinley - One of the best experts on this subject based on the ideXlab platform.
-
efficient implementation of Barrier Synchronization in wormhole routed hypercube multicomputers
International Conference on Distributed Computing Systems, 1992Co-Authors: Philip K. MckinleyAbstract:Practical and efficient implementations of Barrier Synchronization for wormhole-routed hypercube multicomputers are presented. Both broadcast and multicast Barrier Synchronization are considered. For systems that do not support hardware broadcast or multicast, a software U-cube tree is proposed. This method generalizes to n-dimensional meshes. Performance measurements for several Barrier Synchronization techniques implemented on a 64-node nCUBE-2 are given. >
-
Efficient implementation of Barrier Synchronization in wormhole-routed hypercube multicomputers
Journal of Parallel and Distributed Computing, 1992Co-Authors: Philip K. MckinleyAbstract:Abstract Efficient implementation of Barrier Synchronization is important to the performance of many parallel algorithms. This paper addresses Barrier Synchronization in wormhole-routed hypercube multicomputers. A broadcast Barrier involves all nodes in a system, whereas the more general multicast Barrier involves an arbitrary subset of nodes. Although performance of Barrier Synchronization can benefit from hardware-supported broadcast and multicast operations, many systems support only single-destination, or unicast, communication in hardware. For such systems, a novel software tree approach, the U-cube tree, is proposed as the basis of Barrier Synchronization. An important feature of the U-cube tree is that all messages injected into the network are guaranteed to be contention-free. Performance measurements of several Barrier Synchronization techniques implemented on a 64-node nCUBE-2 are given.
-
ICDCS - Efficient implementation of Barrier Synchronization in wormhole-routed hypercube multicomputers
[1992] Proceedings of the 12th International Conference on Distributed Computing Systems, 1Co-Authors: Philip K. MckinleyAbstract:Practical and efficient implementations of Barrier Synchronization for wormhole-routed hypercube multicomputers are presented. Both broadcast and multicast Barrier Synchronization are considered. For systems that do not support hardware broadcast or multicast, a software U-cube tree is proposed. This method generalizes to n-dimensional meshes. Performance measurements for several Barrier Synchronization techniques implemented on a 64-node nCUBE-2 are given. >