Global Address Space

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1959 Experts worldwide ranked by ideXlab platform

P. Sadayappan - One of the best experts on this subject based on the ideXlab platform.

  • work stealing for gpu accelerated parallel programs in a Global Address Space framework
    Concurrency and Computation: Practice and Experience, 2016
    Co-Authors: Humayun Arafat, Sriram Krishnamoorthy, James Dinan, Pavan Balaji, P. Sadayappan
    Abstract:

    Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit GPU systems in a Global-Address Space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSDT application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.

  • HiPC - A Global Address Space approach to automated data management for parallel Quantum Monte Carlo applications
    2012 19th International Conference on High Performance Computing, 2012
    Co-Authors: James Dinan, Sravya Tirukkovalur, Lubos Mitas, Lucas K. Wagner, P. Sadayappan
    Abstract:

    Quantum Monte Carlo (QMC) applications perform simulation with respect to an initial state of the quantum mechanical system, which is often captured by using a cubic B-spline basis. This representation is stored as a read-only table of coefficients, and accesses to the table are generated at random as part of the Monte Carlo simulation. Current QMC applications such as QWalk and QMCPACK, replicate this table at every process or node, which limits scalability because increasing the number of processors does not enable larger systems to be run. We present a partitioned Global Address Space (PGAS) approach to transparently managing this data using Global Arrays in a manner that allows the memory of multiple nodes to be aggregated. We develop an automated data management system that significantly reduces communication overheads, enabling new capabilities for QMC codes. Experimental results with the QWalk application demonstrate the effectiveness of the data management system.

  • A Global Address Space approach to automated data management for parallel Quantum Monte Carlo applications
    2012 19th International Conference on High Performance Computing, 2012
    Co-Authors: James Dinan, Sravya Tirukkovalur, Lubos Mitas, Lucas Wagner, P. Sadayappan
    Abstract:

    Quantum Monte Carlo (QMC) applications perform simulation with respect to an initial state of the quantum mechanical system, which is often captured by using a cubic B-spline basis. This representation is stored as a read-only table of coefficients, and accesses to the table are generated at random as part of the Monte Carlo simulation. Current QMC applications such as QWalk and QMCPACK, replicate this table at every process or node, which limits scalability because increasing the number of processors does not enable larger systems to be run. We present a partitioned Global Address Space (PGAS) approach to transparently managing this data using Global Arrays in a manner that allows the memory of multiple nodes to be aggregated. We develop an automated data management system that significantly reduces communication overheads, enabling new capabilities for QMC codes. Experimental results with the QWalk application demonstrate the effectiveness of the data management system.

  • CLUSTER - Non-collective parallel I/O for Global Address Space programming models
    2007 IEEE International Conference on Cluster Computing, 2007
    Co-Authors: S. Krishnamoorthy, J. Nieplocha, Vinod Tipparaju, Juan Piernas Canovas, P. Sadayappan
    Abstract:

    Achieving high performance for out-of-core applications typically involves explicit management of the movement of data between the disk and the physical memory. We are developing a programming environment in which the different levels of the memory hierarchy are handled efficiently in a unified transparent framework. In this paper, we present our experiences with implementing efficient non-collective I/O (GPCIO) as part of this framework. As a generalization of the remote procedure call (RPC) that was used as a foundation for the Sun NFS system, we developed a Global procedure call (GPC) to invoke procedures on a remote node to handle non-collective I/O. We consider alternative approaches that can be employed in implementing this functionality. The approaches are evaluated using a representative computation from quantum chemistry. The results demonstrate that GPC-IO achieves better absolute execution times, strong-scaling, and weak-scaling than the alternatives considered.

  • A Global Address Space framework for locality aware scheduling of block-sparse computations
    2007 IEEE International Parallel and Distributed Processing Symposium, 2007
    Co-Authors: Sriram Krishnamoorthy, Jarek Nieplocha, Atanas Rountev, Umit Catalyurek, P. Sadayappan
    Abstract:

    In this paper, we present a mechanism for automatic management of the memory hierarchy, including secondary storage, in the context of a Global Address Space parallel programming framework. The programmer specifies the parallelism and locality in the computation. The scheduling of the computation into stages, together with the movement of the associated data between secondary storage and Global memory, and between Global memory and local memory, is automatically managed. A novel formulation of hypergraph partitioning is used to model the optimization problem of minimizing disk I/O. Experimental evaluation using a sub-computation from the quantum chemistry domain shows a reduction in the disk I/O cost by up to a factor of 11, and a reduction in turnaround time by up to 49%, as compared to alternative approaches used in state-of-the-art quantum chemistry codes.

Katherine Yelick - One of the best experts on this subject based on the ideXlab platform.

  • ARRAY@PLDI - A Local-View Array Library for Partitioned Global Address Space C++ Programs
    Proceedings of ACM SIGPLAN International Workshop on Libraries Languages and Compilers for Array Programming - ARRAY'14, 2014
    Co-Authors: Amir Kamil, Yili Zheng, Katherine Yelick
    Abstract:

    Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned Global Address Space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a Global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring Global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

  • a local view array library for partitioned Global Address Space c programs
    Programming Language Design and Implementation, 2014
    Co-Authors: Amir Kamil, Yili Zheng, Katherine Yelick
    Abstract:

    Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned Global Address Space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a Global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring Global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

  • tuning collective communication for partitioned Global Address Space programming models
    Parallel Computing, 2011
    Co-Authors: Rajesh Nishtala, Paul Hargrove, Yili Zheng, Katherine Yelick
    Abstract:

    Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform Global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.

  • porting gasnet to portals partitioned Global Address Space pgas language support for the cray xt
    2009
    Co-Authors: Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick
    Abstract:

    Partitioned Global Address Space (PGAS) Languages are an emerging alternative to MPI for HPC applications development. The GASNet library from Lawrence Berkeley National Lab and the University of California at Berkeley provides the network runtime for multiple implementations of four PGAS Languages: Unified Parallel C (UPC), Co-Array Fortran (CAF), Titanium and Chapel. GASNet provides a low overhead one-sided communication layer has enabled portability and high performance of PGAS languages. This paper describes our experiences porting GASNet to the Portals network API on the Cray XT series.

  • productivity and performance using partitioned Global Address Space languages
    Parallel Symbolic Computation, 2007
    Co-Authors: Katherine Yelick, Dan Bonachea, Weiyu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L Graham, Paul Hargrove, Paul N Hilfinger, Parry Husbands
    Abstract:

    Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaTM designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that trans-lates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

Alan D George - One of the best experts on this subject based on the ideXlab platform.

  • parallel performance wizard a performance system for the analysis of partitioned Global Address Space applications
    IEEE International Conference on High Performance Computing Data and Analytics, 2010
    Co-Authors: Hunghsun Su, Max Billingsley, Alan D George
    Abstract:

    Given the complexity of high-performance parallel programs, developers often must rely on performance analysis tools to help them improve the performance of their applications. While many tools support analysis of message-passing programs, tool support is limited for applications written in programming models that present a partitioned Global Address Space (PGAS) to the programmer such as UPC and SHMEM. Existing tools that support message-passing models are difficult to extend to support PGAS models due to differences between the two paradigms and the techniques used in their implementations. In this paper, we present our work on Parallel Performance Wizard (PPW), a performance analysis system for PGAS and MPI application analysis. We discuss new concepts, namely the generic-operation-type abstraction and GASP-enabled data collection, developed to facilitate support for multiple programming models and then give an overview of PPW’s automatic analysis and visualization capabilities. Finally, to show the usefulness of our system, we present results on PPW’s overhead, storage requirements and scalability before demonstrating its effectiveness via application case studies.

  • Parallel performance wizard: A performance analysis tool for partitioned Global-Address-Space programming
    2008 IEEE International Symposium on Parallel and Distributed Processing, 2008
    Co-Authors: Hunghsun Su, Max Billingsley, Alan D George
    Abstract:

    Given the complexity of parallel programs, developers often must rely on performance analysis tools to help them improve the performance of their code. While many tools support the analysis of message-passing programs, no tool exists that fully supports programs written in programming models that present a partitioned Global Address Space (PGAS) to the programmer, such as UPC and SHMEM. Existing tools with support for message-passing models cannot be easily extended to support PGAS programming models, due to the differences between these paradigms. Furthermore, the inclusion of implicit and one-sided communication in PGAS models renders many of the analyses performed by existing tools irrelevant. For these reasons, there exists a need for a new performance tool capable of handling the challenges associated with PGAS models. In this paper, we first present background research and the framework for Parallel Performance Wizard (PPW), a modularized, event-based performance analysis tool for PGAS programming models. We then discuss features of PPW and how they are used in the analysis of PGAS applications. Finally, we illustrate how one would use PPW in the analysis and optimization of PGAS applications by presenting a small case study using the PPW version 1.0 implementation.

  • IPDPS - Parallel performance wizard: A performance analysis tool for partitioned Global-Address-Space programming
    2008 IEEE International Symposium on Parallel and Distributed Processing, 2008
    Co-Authors: Hunghsun Su, Max Billingsley, Alan D George
    Abstract:

    Given the complexity of parallel programs, developers often must rely on performance analysis tools to help them improve the performance of their code. While many tools support the analysis of message-passing programs, no tool exists that fully supports programs written in programming models that present a partitioned Global Address Space (PGAS) to the programmer, such as UPC and SHMEM. Existing tools with support for message-passing models cannot be easily extended to support PGAS programming models, due to the differences between these paradigms. Furthermore, the inclusion of implicit and one-sided communication in PGAS models renders many of the analyses performed by existing tools irrelevant. For these reasons, there exists a need for a new performance tool capable of handling the challenges associated with PGAS models. In this paper, we first present background research and the framework for Parallel Performance Wizard (PPW), a modularized, event-based performance analysis tool for PGAS programming models. We then discuss features of PPW and how they are used in the analysis of PGAS applications. Finally, we illustrate how one would use PPW in the analysis and optimization of PGAS applications by presenting a small case study using the PPW version 1.0 implementation.

  • parallel performance wizard a performance analysis tool for partitioned Global Address Space programming models
    Conference on High Performance Computing (Supercomputing), 2006
    Co-Authors: Adam Leko, Dan Bonachea, Hunghsun Su, Bryan Golden, Max Billingsley, Alan D George
    Abstract:

    Scientific programmers must optimize the total time-to-solution, the combination of software development and refinement time and actual execution time. The increasing complexity at all levels of supercomputing architectures, coupled with advancements in sequential performance and a growing degree of hardware parallelism, has increasingly placed the bulk of the time-to-solution cost into the software development and tuning phase. Performance analysis tools have been useful for reducing the time-to-solution for message-passing applications; however, there is insufficient tool support for programs developed using Global-Address-Space (GAS) programming models. With the aim of maximizing user productivity, the Parallel Performance Wizard (PPW) fills this void by providing a full range of visualizations and analyses specifically designed for GAS models. To facilitate accurate instrumentation and measurement of GAS programs in PPW, a portable, model-independent performance tool interface (GASP) has been developed and successfully used with Berkeley UPC.

  • GASP: A Performance Analysis Tool Interface for Global AddressSpace Programming Models, Version 1.5
    Lawrence Berkeley National Laboratory, 2006
    Co-Authors: Adam Leko, Dan Bonachea, Hunghsun Su, Alan D George, Hans Sherburne
    Abstract:

    Due to the wide range of compilers and the lack of astandardized performance tool interface, writers of performance toolsface many challenges when incorporating support for Global Address Space(GAS) programming models such as Unified Parallel C (UPC), Titanium, andCo-Array Fortran (CAF). This document presents a Global Address SpacePerformance tool interface (GASP) that is flexible enough to be adaptedinto current Global Address Space compiler and runtime infrastructureswith little effort, while allowing performance analysis tools to gathermuch information about the performance of Global Address Spaceprograms.

Dan Bonachea - One of the best experts on this subject based on the ideXlab platform.

  • porting gasnet to portals partitioned Global Address Space pgas language support for the cray xt
    2009
    Co-Authors: Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick
    Abstract:

    Partitioned Global Address Space (PGAS) Languages are an emerging alternative to MPI for HPC applications development. The GASNet library from Lawrence Berkeley National Lab and the University of California at Berkeley provides the network runtime for multiple implementations of four PGAS Languages: Unified Parallel C (UPC), Co-Array Fortran (CAF), Titanium and Chapel. GASNet provides a low overhead one-sided communication layer has enabled portability and high performance of PGAS languages. This paper describes our experiences porting GASNet to the Portals network API on the Cray XT series.

  • PASCO - Productivity and performance using partitioned Global Address Space languages
    Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO '07, 2007
    Co-Authors: Katherine Yelick, Dan Bonachea, Weiyu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L Graham, Paul Hargrove, Paul N Hilfinger, Parry Husbands
    Abstract:

    Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaTM designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that trans-lates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

  • productivity and performance using partitioned Global Address Space languages
    Parallel Symbolic Computation, 2007
    Co-Authors: Katherine Yelick, Dan Bonachea, Weiyu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L Graham, Paul Hargrove, Paul N Hilfinger, Parry Husbands
    Abstract:

    Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaTM designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that trans-lates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

  • automatic nonblocking communication for partitioned Global Address Space programs
    International Conference on Supercomputing, 2007
    Co-Authors: Weiyu Chen, Dan Bonachea, Costin Iancu, Katherine Yelick
    Abstract:

    Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.

  • ICS - Automatic nonblocking communication for partitioned Global Address Space programs
    Proceedings of the 21st annual international conference on Supercomputing - ICS '07, 2007
    Co-Authors: Weiyu Chen, Dan Bonachea, Costin Iancu, Katherine Yelick
    Abstract:

    Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.

Sriram Krishnamoorthy - One of the best experts on this subject based on the ideXlab platform.

  • work stealing for gpu accelerated parallel programs in a Global Address Space framework
    Concurrency and Computation: Practice and Experience, 2016
    Co-Authors: Humayun Arafat, Sriram Krishnamoorthy, James Dinan, Pavan Balaji, P. Sadayappan
    Abstract:

    Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit GPU systems in a Global-Address Space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSDT application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.

  • performance characterization of Global Address Space applications a case study with nwchem
    Concurrency and Computation: Practice and Experience, 2012
    Co-Authors: Jeff R Hammond, Sriram Krishnamoorthy, Sameer Shende, Nichols A Romero, Allen D Malony
    Abstract:

    The use of Global Address Space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package, which depends on the Global Arrays/Aggregate Remote Memory Copy Interface suite for partitioned Global Address Space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks, which are already being tackled by computational chemists to improve NWChem performance. Copyright © 2011 John Wiley & Sons, Ltd.

  • scalable transparent checkpoint restart of Global Address Space applications on virtual machines over infiniband
    Computing Frontiers, 2009
    Co-Authors: Oreste Villa, Jarek Nieplocha, Sriram Krishnamoorthy, David M Brown
    Abstract:

    Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to Address fault-tolerance for applications based on Global Address Space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution Addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating the scalability of our approach in medium/large-scale systems.

  • Conf. Computing Frontiers - Scalable transparent checkpoint-restart of Global Address Space applications on virtual machines over infiniband
    Proceedings of the 6th ACM conference on Computing frontiers - CF '09, 2009
    Co-Authors: Oreste Villa, Jarek Nieplocha, Sriram Krishnamoorthy, David M Brown
    Abstract:

    Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to Address fault-tolerance for applications based on Global Address Space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution Addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating the scalability of our approach in medium/large-scale systems.

  • A Global Address Space framework for locality aware scheduling of block-sparse computations
    2007 IEEE International Parallel and Distributed Processing Symposium, 2007
    Co-Authors: Sriram Krishnamoorthy, Jarek Nieplocha, Atanas Rountev, Umit Catalyurek, P. Sadayappan
    Abstract:

    In this paper, we present a mechanism for automatic management of the memory hierarchy, including secondary storage, in the context of a Global Address Space parallel programming framework. The programmer specifies the parallelism and locality in the computation. The scheduling of the computation into stages, together with the movement of the associated data between secondary storage and Global memory, and between Global memory and local memory, is automatically managed. A novel formulation of hypergraph partitioning is used to model the optimization problem of minimizing disk I/O. Experimental evaluation using a sub-computation from the quantum chemistry domain shows a reduction in the disk I/O cost by up to a factor of 11, and a reduction in turnaround time by up to 49%, as compared to alternative approaches used in state-of-the-art quantum chemistry codes.