Infiniband

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 3894 Experts worldwide ranked by ideXlab platform

Dhabaleswar K. Panda - One of the best experts on this subject based on the ideXlab platform.

  • hpc meets cloud building efficient clouds for hpc big data and deep learning middleware and applications
    Utility and Cloud Computing, 2017
    Co-Authors: Dhabaleswar K. Panda, Xiaoyi Lu
    Abstract:

    Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as Infiniband, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization for both scientific computing and Big Data processing is becoming more and more attractive. In this tutorial, we first provide an overview of popular virtualization system software on HPC cloud environments, such as hypervisors (e.g., KVM), containers (e.g., Docker, Singularity), OpenStack, Slurm, etc. Then we provide an overview of high-performance interconnects and communication mechanisms on HPC clouds, such as Infiniband, RDMA, SR-IOV, IVShmem, etc. We further discuss the opportunities and technical challenges of designing high-performance MPI runtime over these environments. Next, we introduce our proposed novel approaches to enhance MPI library design over SR-IOV enabled Infiniband clusters with both virtual machines and containers. We also discuss how to integrate these designs into popular cloud management systems like OpenStack and HPC cluster resource managers like Slurm. Not only for HPC middleware and applications, we will demonstrate how high- performance solutions can be designed to run Big Data and Deep Learning workloads (like Hadoop, Spark, TensorFlow, CNTK, Caffe) in HPC cloud environments.

  • scalable reduction collectives with data partitioning based multi leader design
    IEEE International Conference on High Performance Computing Data and Analytics, 2017
    Co-Authors: Mohammadreza Bayatpour, Hari Subramoni, Sourav Chakraborty, Dhabaleswar K. Panda
    Abstract:

    Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with modern interconnects like Infiniband and Omni-Path. In this paper, we propose a high-performance and scalable D ata P artitioning-based M ulti- L eader (DPML) solution for MPI_Allreduce that can take advantage of the parallelism offered by multi-/many-core architectures in conjunction with the high throughput and high-end features offered by Infiniband and Omni-Path to significantly enhance the performance of MPI_Allreduce on modern HPC systems. We also model DPML-based designs to analyze the communication costs theoretically. Microbenchmark level evaluations show that the proposed DPML-based designs are able to deliver up to 3.5 times performance improvement for MPI_Allreduce for multiple HPC systems at scale. At the application-level, up to 35% and 60% improvement is seen in communication for HPCG and miniAMR respectively.

  • high performance virtual machine migration framework for mpi applications on sr iov enabled Infiniband clusters
    International Parallel and Distributed Processing Symposium, 2017
    Co-Authors: Jie Zhang, Dhabaleswar K. Panda
    Abstract:

    High-speed interconnects (e.g. Infiniband) have been widely deployed on modern HPC clusters. With the emergence of HPC in the cloud, high-speed interconnects have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology, which is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However, recent studies have shown that SR-IOV-based virtual networks prevent virtual machine migration, which is an essential virtualization capability towards high availability and resource provisioning. Although several initial solutions have been pro- posed in the literature to solve this problem, our investigations show that there are still many restrictions on these proposed approaches, such as depending on specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this paper, we propose a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled Infiniband clusters. Our proposed method does not need any modification to the hypervisor and Infiniband drivers and it can efficiently handle virtual machine (VM) migration with SR-IOV IB device. Our evaluation results indicate that the proposed design is able to not only achieve fast VM migration speed but also guarantee the high performance for MPI applications during the migration in the HPC cloud. At the application level, for NPB LU benchmark running inside VM, our proposed design is able to completely hide the migration overhead through the computation and migration overlapping. Furthermore, our proposed design shows good scaling when migrating multiple VMs.

  • system level scalable checkpoint restart for petascale computing
    International Conference on Parallel and Distributed Systems, 2016
    Co-Authors: Jiajun Cao, Jerome Vienne, Hari Subramoni, Dhabaleswar K. Panda, Kapil Arya, Rohan Garg, Shawn Matott, Gene Cooperman
    Abstract:

    Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the Infiniband UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that Infiniband UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.

  • performance characterization of hypervisor and container based virtualization for hpc on sr iov enabled Infiniband clusters
    International Parallel and Distributed Processing Symposium, 2016
    Co-Authors: Jie Zhang, Dhabaleswar K. Panda
    Abstract:

    Hypervisor (e.g. KVM) based virtualization has been used as a fundamental technology in cloud computing. However, it has the inherent performance overhead in the virtualized environments, more specifically, the virtualized I/O devices. To alleviate such overhead, PCI passthrough can be utilized to have exclusive access to I/O device. However, this way prevents the I/O device from sharing with multiple VMs. Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as Infiniband to address such sharing issue while having ideal performance. On the other hand, with the advances in container-based virtualization (e.g. Docker), it is also possible to reduce the virtualization overhead by deploying containers instead of VMs so that the near-native performance can be obtained. In order to build high-performance HPC cloud, it is important to fully understand the performance characteristics of different virtualization solutions and virtualized I/O technologies on Infiniband clusters. In this paper, we conduct a comprehensive evaluation using IB verbs, MPI benchmarks and applications. We characterize the performance of hypervisor-and container-based virtualization with PCI passthrough and SR-IOV for HPC on Infiniband clusters. Our evaluation results indicate that VM with PCI passthrough (VM-PT) outperforms VM with SR-IOV (VM-SR-IOV), while SR-IOV enables efficient resource sharing. Overall, the container-based solution can deliver better performance than the hypervisor-based solution. Compared with the native performance, container with PCI passthrough (Container-PT) only incurs up to 9% overhead on HPC applications.

Hari Subramoni - One of the best experts on this subject based on the ideXlab platform.

  • scalable reduction collectives with data partitioning based multi leader design
    IEEE International Conference on High Performance Computing Data and Analytics, 2017
    Co-Authors: Mohammadreza Bayatpour, Hari Subramoni, Sourav Chakraborty, Dhabaleswar K. Panda
    Abstract:

    Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with modern interconnects like Infiniband and Omni-Path. In this paper, we propose a high-performance and scalable D ata P artitioning-based M ulti- L eader (DPML) solution for MPI_Allreduce that can take advantage of the parallelism offered by multi-/many-core architectures in conjunction with the high throughput and high-end features offered by Infiniband and Omni-Path to significantly enhance the performance of MPI_Allreduce on modern HPC systems. We also model DPML-based designs to analyze the communication costs theoretically. Microbenchmark level evaluations show that the proposed DPML-based designs are able to deliver up to 3.5 times performance improvement for MPI_Allreduce for multiple HPC systems at scale. At the application-level, up to 35% and 60% improvement is seen in communication for HPCG and miniAMR respectively.

  • system level scalable checkpoint restart for petascale computing
    International Conference on Parallel and Distributed Systems, 2016
    Co-Authors: Jiajun Cao, Jerome Vienne, Hari Subramoni, Dhabaleswar K. Panda, Kapil Arya, Rohan Garg, Shawn Matott, Gene Cooperman
    Abstract:

    Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the Infiniband UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that Infiniband UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.

  • mvapich prism a proxy based communication framework using Infiniband and scif for intel mic clusters
    IEEE International Conference on High Performance Computing Data and Analytics, 2013
    Co-Authors: Sreeram Potluri, Krishna Kandalla, Hari Subramoni, D Bureddy, Khaled Hamidouche, Akshay Venkatesh, Dhabaleswar K. Panda
    Abstract:

    Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86__64 compatibility. On the other hand, Infiniband is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an Infiniband HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.

  • high performance design of hadoop rpc with rdma over Infiniband
    International Conference on Parallel Processing, 2013
    Co-Authors: Nusrat Sharmin Islam, Hari Subramoni, Jithin Jose, Hao Wang, Md Wasiurrahman, Dhabaleswar K. Panda
    Abstract:

    Hadoop RPC is the basic communication mechanism in the Hadoop ecosystem. It is used with other Hadoop components like MapReduce, HDFS, and HBase in real world data-centers, e.g. Facebook and Yahoo!. However, the current Hadoop RPC design is built on Java sockets interface, which limits its potential performance. The High Performance Computing community has exploited high throughput and low latency networks such as Infiniband for many years. In this paper, we first analyze the performance of current Hadoop RPC design by unearthing buffer management and communication bottlenecks, that are not apparent on the slower speed networks. Then we propose a novel design (RPCoIB) of Hadoop RPC with RDMA over Infiniband networks. RPCoIB provides a JVM-bypassed buffer management scheme and utilizes message size locality to avoid multiple memory allocations and copies in data serialization and deserialization. Our performance evaluations reveal that the basic ping-pong latencies for varied data sizes are reduced by 42%-49% and 46%-50% compared with 10GigE and IPoIB QDR (32Gbps), respectively, while the RPCoIB design also improves the peak throughput by 82% and 64% compared with 10GigE and IPoIB. As compared to default Hadoop over IPoIB QDR, our RPCoIB design improves the performance of the Sort benchmark on 64 compute nodes by 15%, while it improves the performance of CloudBurst application by 10%. We also present thorough, integrated evaluations of our RPCoIB design with other research directions, which optimize HDFS and HBase using RDMA over Infiniband. Compared with their best performance, we observe 10% improvement for HDFS-IB, and 24% improvement for HBase-IB. To the best of our knowledge, this is the first such design of the Hadoop RPC system over high performance networks such as Infiniband.

  • high performance rdma based design of hadoop mapreduce over Infiniband
    IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013
    Co-Authors: Nusrat Sharmin Islam, Hari Subramoni, Jithin Jose, Hao Wang, Dhabaleswar K. Panda
    Abstract:

    MapReduce is a very popular programming model used to handle large datasets in enterprise data centers and clouds. Although various implementations of MapReduce exist, Hadoop MapReduce is the most widely used in large data centers like Facebook, Yahoo! and Amazon due to its portability and fault tolerance. Network performance plays a key role in determining the performance of data intensive applications using Hadoop MapReduce as data required by the map and reduce processes can be distributed across the cluster. In this context, data center designers have been looking at high performance interconnects such as Infiniband to enhance the performance of their Hadoop MapReduce based applications. However, achieving better performance through usage of high performance interconnects like Infiniband is a significant task. It requires a careful redesign of communication framework inside MapReduce. Several assumptions made for current socket based communication in the current framework do not hold true for high performance interconnects. In this paper, we propose the design of an RDMA-based Hadoop MapReduce over Infiniband and several design elements: data shuffle over Infiniband, in-memory merge mechanism for the Reducer, and pre-fetch data for the Mapper. We perform our experiments on native Infiniband using Remote Direct Memory Access (RDMA) and compare our results with that of Hadoop-A [1] and default Hadoop over different interconnects and protocols. For all these experiments, we perform network level parameter tuning and use optimum values for each Hadoop design. Our performance results show that, for a 100GB TeraSort running on an eight node cluster, we achieve a performance improvement of 32% over IP-over Infiniband (IPoIB) and 21% over Hadoop-A. With multiple disks per node, this benefit rises up to 39% over IPoIB and 31% over Hadoop-A.

Jithin Jose - One of the best experts on this subject based on the ideXlab platform.

  • High performance MPI library over SR-IOV enabled Infiniband clusters
    2014 21st International Conference on High Performance Computing (HiPC), 2014
    Co-Authors: Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Dhabaleswar K. D K Panda
    Abstract:

    Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as Infiniband and can attain near to native performance for inter-node communication. However, the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within a same physical node. To address this issue, this paper first proposes a high performance design of MPI library over SR-IOV enabled Infiniband clusters by dynamically detecting VM locality and coordinating data movements between SR-IOV and Inter-VM shared memory (IVShmem) channels. Through our proposed design, MPI applications running in virtualized mode can achieve efficient locality-aware communication on SR-IOV enabled Infiniband clusters. In addition, we optimize communications in IVShmem and SR-IOV channels by analyzing the performance impact of core mechanisms and parameters inside MPI library to deliver better performance in virtual machines. Finally, we conduct comprehensive performance studies by using point-to-point and collective benchmarks, and HPC applications. Experimental evaluations show that our proposed MPI library design can significantly improve the performance for point-to-point and collective operations, and MPI applications with different Infiniband transport protocols (RC and UD) by up to 158%, 76%, 43%, respectively, compared with SR-IOV. To the best of our knowledge, this is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled Infiniband clusters.

  • high performance design of hadoop rpc with rdma over Infiniband
    International Conference on Parallel Processing, 2013
    Co-Authors: Nusrat Sharmin Islam, Hari Subramoni, Jithin Jose, Hao Wang, Md Wasiurrahman, Dhabaleswar K. Panda
    Abstract:

    Hadoop RPC is the basic communication mechanism in the Hadoop ecosystem. It is used with other Hadoop components like MapReduce, HDFS, and HBase in real world data-centers, e.g. Facebook and Yahoo!. However, the current Hadoop RPC design is built on Java sockets interface, which limits its potential performance. The High Performance Computing community has exploited high throughput and low latency networks such as Infiniband for many years. In this paper, we first analyze the performance of current Hadoop RPC design by unearthing buffer management and communication bottlenecks, that are not apparent on the slower speed networks. Then we propose a novel design (RPCoIB) of Hadoop RPC with RDMA over Infiniband networks. RPCoIB provides a JVM-bypassed buffer management scheme and utilizes message size locality to avoid multiple memory allocations and copies in data serialization and deserialization. Our performance evaluations reveal that the basic ping-pong latencies for varied data sizes are reduced by 42%-49% and 46%-50% compared with 10GigE and IPoIB QDR (32Gbps), respectively, while the RPCoIB design also improves the peak throughput by 82% and 64% compared with 10GigE and IPoIB. As compared to default Hadoop over IPoIB QDR, our RPCoIB design improves the performance of the Sort benchmark on 64 compute nodes by 15%, while it improves the performance of CloudBurst application by 10%. We also present thorough, integrated evaluations of our RPCoIB design with other research directions, which optimize HDFS and HBase using RDMA over Infiniband. Compared with their best performance, we observe 10% improvement for HDFS-IB, and 24% improvement for HBase-IB. To the best of our knowledge, this is the first such design of the Hadoop RPC system over high performance networks such as Infiniband.

  • high performance rdma based design of hadoop mapreduce over Infiniband
    IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, 2013
    Co-Authors: Nusrat Sharmin Islam, Hari Subramoni, Jithin Jose, Hao Wang, Dhabaleswar K. Panda
    Abstract:

    MapReduce is a very popular programming model used to handle large datasets in enterprise data centers and clouds. Although various implementations of MapReduce exist, Hadoop MapReduce is the most widely used in large data centers like Facebook, Yahoo! and Amazon due to its portability and fault tolerance. Network performance plays a key role in determining the performance of data intensive applications using Hadoop MapReduce as data required by the map and reduce processes can be distributed across the cluster. In this context, data center designers have been looking at high performance interconnects such as Infiniband to enhance the performance of their Hadoop MapReduce based applications. However, achieving better performance through usage of high performance interconnects like Infiniband is a significant task. It requires a careful redesign of communication framework inside MapReduce. Several assumptions made for current socket based communication in the current framework do not hold true for high performance interconnects. In this paper, we propose the design of an RDMA-based Hadoop MapReduce over Infiniband and several design elements: data shuffle over Infiniband, in-memory merge mechanism for the Reducer, and pre-fetch data for the Mapper. We perform our experiments on native Infiniband using Remote Direct Memory Access (RDMA) and compare our results with that of Hadoop-A [1] and default Hadoop over different interconnects and protocols. For all these experiments, we perform network level parameter tuning and use optimum values for each Hadoop design. Our performance results show that, for a 100GB TeraSort running on an eight node cluster, we achieve a performance improvement of 32% over IP-over Infiniband (IPoIB) and 21% over Hadoop-A. With multiple disks per node, this benefit rises up to 39% over IPoIB and 31% over Hadoop-A.

  • SR-IOV Support for Virtualization on Infiniband Clusters: Early Experience
    2013 13th IEEE ACM International Symposium on Cluster Cloud and Grid Computing, 2013
    Co-Authors: Jithin Jose, Xiaoyi Lu, Krishna Chaitanya Kandalla, Mark Daniel Arnold, Mingzhe Li, Dhabaleswar K. Panda
    Abstract:

    High Performance Computing (HPC) systems are becoming increasingly complex and are also associated with very high operational costs. The cloud computing paradigm, coupled with modern Virtual Machine (VM) technology offers attractive techniques to easily manage large scale systems, while significantly bringing down the cost of computation, memory and storage. However, running HPC applications on cloud systems still remains a major challenge. One of the biggest hurdles in realizing this objective is the performance offered by virtualized computing environments, more specifically, virtualized I/O devices. Since HPC applications and communication middlewares rely heavily on advanced features offered by modern high performance interconnects such as Infiniband, the performance of virtualized Infiniband interfaces is crucial. Emerging hardware-based solutions, such as the Single Root I/O Virtualization (SR-IOV), offer an attractive alternative when compared to existing software-based solutions. The benefits of SR-IOV have been widely studied for GigE and 10GigE networks. However, with Infiniband networks being increasingly adopted in the cloud computing domain, it is critical to fully understand the performance benefits of SR-IOV in Infiniband network, especially for exploring the performance characteristics and trade-offs of HPC communication middlewares (such as Message Passing Interface (MPI), Partitioned Global Address Space (PGAS)) and applications. To the best of our knowledge, this is the first paper that offers an in-depth analysis on SR-IOV with Infiniband. Our experimental evaluations show that for the performance of MPI and PGAS point-to-point communication benchmarks over SR-IOV with Infiniband is comparable to that of the native Infiniband hardware, for most message lengths. However, we observe that the performance of MPI collective operations over SR-IOV with Infiniband is inferior to native (non-virtualized) mode. We also evaluate the trade-offs of various VM to CPU mapping policies on modern multi-core architectures and present our experiences.

  • high performance rdma based design of hdfs over Infiniband
    IEEE International Conference on High Performance Computing Data and Analytics, 2012
    Co-Authors: Nusrat Sharmin Islam, Hari Subramoni, Jithin Jose, Hao Wang, Md Wasiur Rahman, Raghunath Rajachandrasekar, Chet Murthy, Dhabaleswar K. Panda
    Abstract:

    Hadoop Distributed File System (HDFS) acts as the primary storage of Hadoop and has been adopted by reputed organizations (Facebook, Yahoo! etc.) due to its portability and fault-tolerance. The existing implementation of HDFS uses Java-socket interface for communication which delivers suboptimal performance in terms of latency and throughput. For data-intensive applications, network performance becomes key component as the amount of data being stored and replicated to HDFS increases. In this paper, we present a novel design of HDFS using Remote Direct Memory Access (RDMA) over Infiniband via JNI interfaces. Experimental results show that, for 5GB HDFS file writes, the new design reduces the communication time by 87% and 30% over 1Gigabit Ethernet (1GigE) and IP-over-Infiniband (IPoIB), respectively, on QDR platform (32Gbps). For HBase, the Put operation performance is improved by 26% with our design. To the best of our knowledge, this is the first design of HDFS over Infiniband networks.

Matthew J Koop - One of the best experts on this subject based on the ideXlab platform.

  • scalable mpi design over Infiniband using extended reliable connection
    International Conference on Cluster Computing, 2008
    Co-Authors: Matthew J Koop, Jaidev K Sridhar, Dhabaleswar K. Panda
    Abstract:

    A significant component of a high-performance cluster is the compute node interconnect. Infiniband, is an interconnect of such systems that is enjoying wide success due to low latency (1.0-3.0 musec) and high bandwidth and other features. The Message Passing Interface (MPI) is the dominant programming model for parallel scientific applications. As a result, the MPI library and interconnect play a significant role in the scalability. These clusters continue to scale to ever-increasing levels making the role very important. As an example, the ldquoRangerrdquo system at the Texas Advanced Computing Center (TACC) includes over 60,000 cores with nearly 4000 Infiniband ports. Previous work has shown that memory usage simply for connections when using the Reliable Connection (RC) transport of Infiniband can reach hundreds of megabytes of memory per process at that level. To address these scalability problems a new Infiniband transport, eXtended Reliable Connection, has been introduced. In this paper we describe XRC and design MPI over this new transport. We describe the variety of design choices that must be made as well as the various optimizations that XRC allows. We implement our designs and evaluate it on an Infiniband cluster against RC-based designs. The memory scalability in terms of both connection memory and memory efficiency for communication buffers is evaluated for all of the configurations. Connection memory scalability evaluation shows a potential 100 times improvement over a similarly configured RC-based design. Evaluation using NAMD shows a 10% performance improvement for our XRC-based prototype for the jac2000 benchmark.

  • performance analysis and evaluation of pcie 2 0 and quad data rate Infiniband
    High Performance Interconnects, 2008
    Co-Authors: Matthew J Koop, Wei Huang, Karthik Gopalakrishnan, Dhabaleswar K. Panda
    Abstract:

    High-performance systems are undergoing a major shift as commodity multi-core systems become increasingly prevalent. As the number of processes per compute node increase, the other parts of the system must also scale appropriately to maintain a balanced system. In the area of high-performance computing, one very important element of the overall system is the network interconnect that connects compute nodes in the system. Infiniband is a popular interconnect for high- performance clusters. Unfortunately, due to limited bandwidth of the PCI-Express fabric, Infiniband performance has remained limited. PCI-Express (PCIe) 2.0 has become available and has doubled the transfer rates available. This additional I/O bandwidth balances the system and makes higher data rates for external interconnects such as Infiniband feasible. As a result, Infiniband quad-data rate (QDR) mode has become available on the Mellanox Infiniband host channel adapter (HCA) with a 40 Gb/sec signaling rate. In this paper we perform an in-depth performance analysis of PCIe 2.0 and the effect of increased Infiniband signaling rates. We show that even using the double data rate (DDR) interface, PCIe 2.0 enables a 25% improvement in NAS parallel benchmark IS performance. Furthermore, we show that when using QDR on PCIe 2.0, network loopback can outperform a shared memory message passing implementation. We show that increased interconnection bandwidth significantly improves the overall system balance by lowering latency and increasing bandwidth.

  • can software reliability outperform hardware reliability on high performance interconnects a case study with mpi over Infiniband
    International Conference on Supercomputing, 2008
    Co-Authors: Matthew J Koop, Rahul Kumar, Dhabaleswar K. Panda
    Abstract:

    An important part of modern supercomputing platforms is the network interconnect. As the number of computing nodes in clusters have increased, the role of the interconnect has become more important. Modern interconnects, such as Infiniband, Quadrics, and Myrinet have become popular due to their low latency and increased performance over traditional Ethernet. As these interconnects become more widely used and clusters continue to scale, design choices such as where data reliability should be provided are an important issue. In this work we address the issue of network reliability design using Infiniband as a case study. Unlike other high-performance interconnects, Infiniband exposes both reliable and unreliable APIs. As part of our study we implement the Message Passing Interface (MPI) over the Unreliable Connection (UC) transport and compare with the Reliable Connection (RC) and Unreliable Datagram (UD) transports for MPI. We detail the costs of reliability for different message patterns and show that providing reliability in software instead of hardware can increase performance up to 25% in a molecular dynamics application (NAMD) on a 512-core Infiniband cluster.

  • mvapich aptus scalable high performance multi transport mpi over Infiniband
    International Parallel and Distributed Processing Symposium, 2008
    Co-Authors: Matthew J Koop, Terry Jones, Dhabaleswar K. Panda
    Abstract:

    The need for computational cycles continues to exceed availability, driving commodity clusters to increasing scales. With upcoming clusters containing tens-of-thousands of cores, Infiniband is a popular interconnect on these clusters, due to its low latency (1.5 musec) and high bandwidth (1.5 GB/sec). Since most scientific applications running on these clusters are written using the message passing interface (MPI) as the parallel programming model, the MPI library plays a key role in the performance and scalability of the system. Nearly all MPIs implemented over Infiniband currently use the reliable connection (RC) transport of Infiniband to implement message passing. Using this transport exclusively, however, has been shown to potentially reach a memory footprint of over 200 MB/task at 16 K tasks for the MPI library. The Unreliable Datagram (UD) transport, however, offers higher scalability, but at the cost of medium and large message performance. In this paper we present a multi-transport MPI design, MVAPICH-Aptus, that uses both the RC and UD transports of Infiniband to deliver scalability and performance higher than that of a single-transport MPI design. Evaluation of our hybrid design on 512 cores shows a 12% improvement over an RC-based design and 4% better than a UD-based design for the SMG2000 application benchmark. In addition, for the molecular dynamics application NAMD we show a 10% improvement over an RC-only design. To the best of our knowledge, this is the first such analysis and design of optimized MPI using both UD and RC.

  • zero copy protocol for mpi using Infiniband unreliable datagram
    International Conference on Cluster Computing, 2007
    Co-Authors: Matthew J Koop, Sayantan Sur, Dhabaleswar K. Panda
    Abstract:

    Memory copies are widely regarded as detrimental to the overall performance of applications. High-performance systems make every effort to reduce the number of memory copies, especially the copies incurred during message passing. State of the art implementations of message-passing libraries, such as MPI, utilize user-level networking protocols to reduce or eliminate memory copies. Infiniband is an emerging user-level networking technology that is gaining rapid acceptance in several domains, including HPC. In order to eliminate message copies while transferring large messages, MPI libraries over Infiniband employ ldquozero-copyrdquo protocols which use remote direct memory access (RDMA). RDMA is available only in the connection-oriented transports of Infiniband, such as reliable connection (RC). However, the unreliable datagram (UD) transport of Infiniband has been shown to scale much better than the RC transport in regard to memory usage. In an optimal design, it should be possible to perform zero-copy message transfers over scalable transports (such as UD). In this paper, we present our design of a novel zero-copy protocol which is directly based over the scalable UD transport. Thus, our protocol achieves the twin objectives of scalability and good performance. Our analysis shows that uni-directional messaging bandwidth can be within 9% of what is achievable over RC for messages of 64 KB and above. Application benchmark evaluation shows that our design delivers a 21% speedup for the in.rhodo dataset for LAMMPS over a copy-based approach, giving performance within 1% of RC.

Jose Duato - One of the best experts on this subject based on the ideXlab platform.

  • Performance Evaluation of VBR Traffic in Infiniband *
    2020
    Co-Authors: F J Alfaro, J L Sánchez, Luis Orozco, Jose Duato
    Abstract:

    Abstract The Infiniband Architecture (IBA

  • a new proposal to deal with congestion in Infiniband based fat trees
    Journal of Parallel and Distributed Computing, 2014
    Co-Authors: Jesus Escuderosahuquillo, Francisco J Quiles, Svenarne Reinemo, Tor Skeie, Olav Lysne, Pedro Javier Garcia, Jose Duato
    Abstract:

    The overall performance of High-Performance Computing applications may depend largely on the performance achieved by the network interconnecting the end-nodes; thus high-speed interconnect technologies like Infiniband are used to provide high throughput and low latency. Nevertheless, network performance may be degraded due to congestion; thus using techniques to deal with the problems derived from congestion has become practically mandatory. In this paper we propose a straightforward congestion-management method suitable for fat-tree topologies built from Infiniband components. Our proposal is based on a traffic-flow-to-service-level mapping that prevents, as much as possible with the resources available in current Infiniband components (basically Virtual Lanes), the negative impact of the two most common problems derived from congestion: head-of-line blocking and buffer-hogging. We also provide a mathematical approach to analyze the efficiency of our proposal and several ones, by means of a set of analytical metrics. In certain traffic scenarios, we observe up to a 68% of the ideal performance gain that could be achieved in HoL-blocking and buffer-hogging prevention. Cost-efficient network-interconnect designs are a critical task for the HPC Systems.Congestion degrades the network performance: congestion management (CM) is required.Infiniband(IB)-based interconnection networks have a strong presence in the HPC Systems.Flow2SL is a new CM technique for IB Fat-trees, based on mapping traffic-flows to SLs.Flow2SL achieves up to a 68% of improvement compared to the ideal performance gain.

  • qos in Infiniband subnetworks
    IEEE Transactions on Parallel and Distributed Systems, 2004
    Co-Authors: Francisco J Alfaro, Jose L Sanchez, Jose Duato
    Abstract:

    The Infiniband architecture (IBA) has been proposed as an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional bus-based interconnect with a switch-based network for connecting processing nodes and I/O devices. It is being developed by the Infiniband/sup SM/ Trade Association (IBTA) in the aim to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) required by present and future server systems. For this purpose, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. In previous papers, we have proposed a strategy to compute the Infiniband arbitration tables. In one of these, we presented and evaluated our proposal to treat traffic with bandwidth requirements. In another, we evaluated our strategy to compute the Infiniband arbitration tables for traffic with delay requirements, which is a more complex task. In this paper, we evaluate both these proposals together. Furthermore, we also adapt these proposals in order to treat VBR traffic without QoS guarantees, but achieving very good results. Performance results show that, with a correct treatment of each traffic class in the arbitration of the output port, all traffic classes reach their QoS requirements.

  • evaluation of a subnet management mechanism for Infiniband networks
    International Conference on Parallel Processing, 2003
    Co-Authors: Aurelio Bermudez, Rafael Casado, Francisco J Quiles, T M Pinkston, Jose Duato
    Abstract:

    The Infiniband architecture is a high-performance network technology for the interconnection of processor nodes and I/O devices using a point-to-point switch-based fabric. The Infiniband specification defines a basic management infrastructure that is responsible for subnet configuration, activation, and fault tolerance. Subnet management entities and functions are described, but the specifications do not impose any particular implementation. We present and analyze a complete subnet management mechanism for this architecture. We allow to anticipate future directions to obtain efficient management protocols

  • supporting fully adaptive routing in Infiniband networks
    International Parallel and Distributed Processing Symposium, 2003
    Co-Authors: J C Martinez, Jose Flich, P Lopez, A Robles, Jose Duato
    Abstract:

    Infiniband is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The Infiniband Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9.