Shared Virtual Memory

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 6879 Experts worldwide ranked by ideXlab platform

Jaswinder Pal Singh - One of the best experts on this subject based on the ideXlab platform.

  • Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance
    2007
    Co-Authors: Angelos Bilas, Liviu Iftode, David Martin, Jaswinder Pal Singh
    Abstract:

    As the workstation market moves form single processor to small-scale Shared Memory multiprocessors, it is very attractive to construct larger-scale multiprocessors by connecting widely available symmetric multiprocessors (SMPs) in a less tightly coupled way. Using a Shared Virtual Memory (SVM) layer for this purpose preserves the Shared Memory programming abstraction across nodes. We explore the feasibility and performance implications of one such approach by extending the AURC (Automatic Update Release Consistency) protocol, used in the SHRIMP multicomputer, to connect hardware-coherent SMPs rather than uniprocessors. We describe the extended AURC protocol, and compare its performance with both the AURC uniprocessor node case as well as with an all-software Lazy Release Consistency (LRC) protocol extended for SMPs. We present results based on detailed simulations of two protocols (AURC and LRC) and two architectural con gurations of a system with 16 processors; one with one processor per node (16 nodes) and one with four processors per node (4 nodes). We nd that, unless the bandwidth of the network interface is increased, the network interface becomes the bottleneck in a clustered architecture especially for AURC. While a LRC protocol can bene t from the reduction in per processor communication in a clustered architecture, the write-through tra c in AURC increases signi cantly the communication demands per network interface. This causes more tra c contention and either prevents the performance of AURC from improving under SMP or hurts it severely for applications with signi cant communication requirements. Thus, while AURC performs better than LRC, for applications with high communication needs, the reverse may be true in clustered architectures. Among possible solutions, two are investigated in the paper: protocol changes and bandwidth increases. Further work is clearly needed on the systems and application sides to evaluate whether AURC can be extended for multiprocessor node systems.

  • Application scaling under Shared Virtual Memory on a cluster of SMPs
    2003
    Co-Authors: Dongming Jiang, Angelos Bilas, Brian Kelley, Xiang Yu, Sanjeev Kumar, Jaswinder Pal Singh
    Abstract:

    In this paper we examine how application performance scales on a state-of-the-art Shared Virtual Memory (SVM) system on a cluster with 64 processors, comprising 4-way SMPs connected with a fast system area network. The protocol we use is home-based and takes advantage of general-purpose data movement and mutual exclusion support provided by a programmable network interface. We find that while the level of application restructuring needed is quite high compared to applications that perform well on a hardware-coherent system of this scale, and larger problem sizes are needed for good performance, SVM, surprisingly, performs quite well at the 64-processor scale for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end hardware-coherent system and often much more. We explore further application restructurings than those developed earlier for smaller-scale SVM systems, examine the main remaining system and application bottlenecks, and point out directions for future research.

  • Shared Virtual Memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems
    Journal of Parallel and Distributed Computing, 2003
    Co-Authors: Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh
    Abstract:

    Although the Shared Memory abstraction is gaining ground as a programming abstraction for parallel computing, the main platforms that support it, small-scale symmetric multiprocessors (SMPs) and hardware cache-coherent distributed Shared Memory systems (DSMs), seem to lie inherently at the extremes of the cost-performance spectrum for parallel systems. In this paper we examine if Shared Virtual Memory (SVM) clusters can bridge this gap by examining how application performance scales on a state-of-the-art Shared Virtual Memory cluster. We find that: (i) The level of application restructuring needed is quite high compared to applications that perform well on a DSM system of the same scale and larger problem sizes are needed for good performance. (ii) However, surprisingly, SVM performs quite well for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end DSM system at the same scale and often much more.

  • using network interface support to avoid asynchronous protocol processing in Shared Virtual Memory systems
    International Symposium on Computer Architecture, 1999
    Co-Authors: Angelos Bilas, Cheng Liao, Jaswinder Pal Singh
    Abstract:

    The performance of page-based software Shared Virtual Memory (SVM) is still far from that achieved on hardware-coherent distributed Shared Memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node Memory system nor code instrumentation to identify Memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a Shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent Shared Memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.

  • ISCA - Using network interface support to avoid asynchronous protocol processing in Shared Virtual Memory systems
    1999
    Co-Authors: Angelos Bilas, Cheng Liao, Jaswinder Pal Singh
    Abstract:

    The performance of page-based software Shared Virtual Memory (SVM) is still far from that achieved on hardware-coherent distributed Shared Memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node Memory system nor code instrumentation to identify Memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a Shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent Shared Memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.

Liviu Iftode - One of the best experts on this subject based on the ideXlab platform.

  • Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance
    2007
    Co-Authors: Angelos Bilas, Liviu Iftode, David Martin, Jaswinder Pal Singh
    Abstract:

    As the workstation market moves form single processor to small-scale Shared Memory multiprocessors, it is very attractive to construct larger-scale multiprocessors by connecting widely available symmetric multiprocessors (SMPs) in a less tightly coupled way. Using a Shared Virtual Memory (SVM) layer for this purpose preserves the Shared Memory programming abstraction across nodes. We explore the feasibility and performance implications of one such approach by extending the AURC (Automatic Update Release Consistency) protocol, used in the SHRIMP multicomputer, to connect hardware-coherent SMPs rather than uniprocessors. We describe the extended AURC protocol, and compare its performance with both the AURC uniprocessor node case as well as with an all-software Lazy Release Consistency (LRC) protocol extended for SMPs. We present results based on detailed simulations of two protocols (AURC and LRC) and two architectural con gurations of a system with 16 processors; one with one processor per node (16 nodes) and one with four processors per node (4 nodes). We nd that, unless the bandwidth of the network interface is increased, the network interface becomes the bottleneck in a clustered architecture especially for AURC. While a LRC protocol can bene t from the reduction in per processor communication in a clustered architecture, the write-through tra c in AURC increases signi cantly the communication demands per network interface. This causes more tra c contention and either prevents the performance of AURC from improving under SMP or hurts it severely for applications with signi cant communication requirements. Thus, while AURC performs better than LRC, for applications with high communication needs, the reverse may be true in clustered architectures. Among possible solutions, two are investigated in the paper: protocol changes and bandwidth increases. Further work is clearly needed on the systems and application sides to evaluate whether AURC can be extended for multiprocessor node systems.

  • Shared Virtual Memory: progress and challenges
    Proceedings of the IEEE, 1999
    Co-Authors: Liviu Iftode, Jaswinder Pal Singh
    Abstract:

    Shared Virtual Memory, a technique for supporting a Shared address space in software on parallel systems, has undergone a decade of research, with significant maturing of protocols and communication layers having now been achieved. We provide a survey of the key developments in this research, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework. Four major research tracks are covered: relaxed consistency models; protocol laziness; architectural support; and application-driven research. Several related avenues are also discussed, such as fine grained software coherence, software protocols across multiprocessor nodes, and performance scalability. We summarize comparative performance results from the literature, discuss their limitations, and identify lessons learned so far, key outstanding questions, and important directions for future research in this area.

  • Shared Virtual Memory : Progress and challenges : Distributed Shared Memory systems
    1999
    Co-Authors: Liviu Iftode, Jaswinder Pal Singh
    Abstract:

    Shared Virtual Memory, a technique for supporting a Shared address space in software on paraller systems, has undergone a decade of research, with significant maturing of protocols and communication layers having now been achieved. We provide a survey of the key developments in this research, placing the multitrack flow of ideas and results obtained so far in a comprehensive ne w frame-work. Four major research tracks are covered, relaxed consistency models, protocol laziness, architectural support, and application-driven research. Several related avenues are also discussed, such as fine-grai d software coherence software protocols across multiprocessor nodes, and performance scalability. We summarize comparative performance results from the literature, discuss their limitations, and identify lessons learned so far, key outstanding questions, and important directions for future research in this area.

  • monitoring Shared Virtual Memory performance on a myrinet based pc cluster
    International Conference on Supercomputing, 1998
    Co-Authors: Cheng Liao, Dongming Jiang, Liviu Iftode, Margaret Martonosi, Douglas W Clark
    Abstract:

    Network-connected clusters of PCs or workstations are becoming a widespread parallel computing platform. Performance methodologies that use either simulation or high-level software instrumentation cannot adequately measure the detailed behavior of such systems. The availability of new network technologies based on programmable network interfaces opens a new avenue of research in analyzing and improving the performance of software Shared Memory protocols. We have developed monitoring firmware embedded in the programmable network interfaces of a Myrinet-based PC cluster. Timestamps on network packets facilitate the collection of low-level statistics on, e.g., network latencies, interrupt handler times and inter-node synchronization. This paper describes our use of the low-level software performance monitor to measure and understand the performance of a Shared Virtual Memory (SVM) system implemented on a Myrinetbased cluster, running the SPLASH-2 benchmarks. We measured time spent in various communication stages during the main protocol operations: remote page fetch, remote lock synchronization, and barriers. These data show that remote request contention in the network interface and hosts can serialize their handling and artificially increase the page miss time. This increase then dilates the critical section within which it occurs, increasing lock contention and causing lock serialization. Furthermore, lock serialization is reflected in the waiting time at barriers. These results of our study sharpen and deepen similar but higher-level speculations in previous simulation-based SVM performance research. Moreover, the insights about different layers, including communication architecture, SVM protocol, and applications, on real systems provide guidelines for better designs in those layers.

  • evaluation of hardware write propagation support for next generation Shared Virtual Memory clusters
    International Conference on Supercomputing, 1998
    Co-Authors: Angelos Bilas, Liviu Iftode, Jaswinder Pal Singh
    Abstract:

    Virtual Memory Clusters Angelos Bilas1, Liviu Iftode2, and Jaswinder Pal Singh1 1 Department of Computer Science, Princeton University Princeton, NJ 08544 2 Department of Computer Science, Rutgers University Piscataway, NJ 08855 fbilas, jpsg@cs.princeton.edu, iftode@cs.rutgers.edu Abstract Clusters of symmetric multiprocessors (SMPs), connected by commodity system-area networks (SANs) and interfaces are fast being adopted as platforms for parallel computing. Page-grained Shared Virtual Memory (SVM) is a popular way to support a coherent Shared address space programming model on these clusters. Previous research has identied several key bottlenecks in the communication, protocol and application layers of a software SVM system that are not so signi cant in more mainstream, hardware-coherent multiprocessors. A key question for the communication layer is how much and what kind of hardware support is particularly valuable in improving the performance of such systems. This paper examines a popular form of hardware support|namely, support for automatic, hardware propagation of writes to remote memories|discussing new design issues and evaluating performance in the context of emerging clusters. Since much of the performance di erence is due to di erences in contention e ects in various parts of the system, performance is examined through very detailed simulation, utilizing the deep visibility into the simulated system to analyze the causes of observed e ects.

Angelos Bilas - One of the best experts on this subject based on the ideXlab platform.

  • Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance
    2007
    Co-Authors: Angelos Bilas, Liviu Iftode, David Martin, Jaswinder Pal Singh
    Abstract:

    As the workstation market moves form single processor to small-scale Shared Memory multiprocessors, it is very attractive to construct larger-scale multiprocessors by connecting widely available symmetric multiprocessors (SMPs) in a less tightly coupled way. Using a Shared Virtual Memory (SVM) layer for this purpose preserves the Shared Memory programming abstraction across nodes. We explore the feasibility and performance implications of one such approach by extending the AURC (Automatic Update Release Consistency) protocol, used in the SHRIMP multicomputer, to connect hardware-coherent SMPs rather than uniprocessors. We describe the extended AURC protocol, and compare its performance with both the AURC uniprocessor node case as well as with an all-software Lazy Release Consistency (LRC) protocol extended for SMPs. We present results based on detailed simulations of two protocols (AURC and LRC) and two architectural con gurations of a system with 16 processors; one with one processor per node (16 nodes) and one with four processors per node (4 nodes). We nd that, unless the bandwidth of the network interface is increased, the network interface becomes the bottleneck in a clustered architecture especially for AURC. While a LRC protocol can bene t from the reduction in per processor communication in a clustered architecture, the write-through tra c in AURC increases signi cantly the communication demands per network interface. This causes more tra c contention and either prevents the performance of AURC from improving under SMP or hurts it severely for applications with signi cant communication requirements. Thus, while AURC performs better than LRC, for applications with high communication needs, the reverse may be true in clustered architectures. Among possible solutions, two are investigated in the paper: protocol changes and bandwidth increases. Further work is clearly needed on the systems and application sides to evaluate whether AURC can be extended for multiprocessor node systems.

  • Using System Emulation to Model Next-Generation Shared Virtual Memory Clusters
    Cluster Computing, 2003
    Co-Authors: Angelos Bilas, Courtney R. Gibson, Reza Azimi, Rosalia Christodoulopoulou, Peter Jamieson
    Abstract:

    Recently much effort has been spent on providing a Shared address space abstraction on clusters of small-scale symmetric multiprocessors. However, advances in technology will soon make it possible to construct these clusters with larger-scale cc-NUMA nodes, connected with non-coherent networks that offer latencies and bandwidth comparable to interconnection networks used in hardware cache-coherent systems. The Shared Memory abstraction can be provided on these systems in software across nodes and hardware within nodes. Recent simulation results have demonstrated that certain features of modern system area networks can be used to greatly reduce Shared Virtual Memory (SVM) overheads [5,19]. In this work we leverage these results and we use detailed system emulation to investigate building future software Shared Memory clusters. We use an existing, large-scale hardware cache-coherent system with 64 processors to emulate a complete future cluster. We port our existing infrastructure (communication layer and Shared Memory protocol) on this system and study the behavior of a set of real applications. We present results for both 32- and 64-processor system configurations. We find that: (i) System emulation is invaluable in quantifying potential benefits from changes in the technology of commodity components. More importantly, it reveals potential problems in future systems that are easily overlooked in simulation studies. Thus, system emulation should be used along with other modeling techniques (e.g., simulation, implementation) to investigate future trends. (ii) Our work shows that current SVM protocols can only partially take advantage of faster interconnects and wider nodes due to operating system and architectural implications. We quantify the related issues and identify the areas where more research is required for future SVM clusters.

  • Application scaling under Shared Virtual Memory on a cluster of SMPs
    2003
    Co-Authors: Dongming Jiang, Angelos Bilas, Brian Kelley, Xiang Yu, Sanjeev Kumar, Jaswinder Pal Singh
    Abstract:

    In this paper we examine how application performance scales on a state-of-the-art Shared Virtual Memory (SVM) system on a cluster with 64 processors, comprising 4-way SMPs connected with a fast system area network. The protocol we use is home-based and takes advantage of general-purpose data movement and mutual exclusion support provided by a programmable network interface. We find that while the level of application restructuring needed is quite high compared to applications that perform well on a hardware-coherent system of this scale, and larger problem sizes are needed for good performance, SVM, surprisingly, performs quite well at the 64-processor scale for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end hardware-coherent system and often much more. We explore further application restructurings than those developed earlier for smaller-scale SVM systems, examine the main remaining system and application bottlenecks, and point out directions for future research.

  • Shared Virtual Memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems
    Journal of Parallel and Distributed Computing, 2003
    Co-Authors: Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh
    Abstract:

    Although the Shared Memory abstraction is gaining ground as a programming abstraction for parallel computing, the main platforms that support it, small-scale symmetric multiprocessors (SMPs) and hardware cache-coherent distributed Shared Memory systems (DSMs), seem to lie inherently at the extremes of the cost-performance spectrum for parallel systems. In this paper we examine if Shared Virtual Memory (SVM) clusters can bridge this gap by examining how application performance scales on a state-of-the-art Shared Virtual Memory cluster. We find that: (i) The level of application restructuring needed is quite high compared to applications that perform well on a DSM system of the same scale and larger problem sizes are needed for good performance. (ii) However, surprisingly, SVM performs quite well for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end DSM system at the same scale and often much more.

  • using network interface support to avoid asynchronous protocol processing in Shared Virtual Memory systems
    International Symposium on Computer Architecture, 1999
    Co-Authors: Angelos Bilas, Cheng Liao, Jaswinder Pal Singh
    Abstract:

    The performance of page-based software Shared Virtual Memory (SVM) is still far from that achieved on hardware-coherent distributed Shared Memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node Memory system nor code instrumentation to identify Memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a Shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent Shared Memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.

Dongming Jiang - One of the best experts on this subject based on the ideXlab platform.

  • Application scaling under Shared Virtual Memory on a cluster of SMPs
    2003
    Co-Authors: Dongming Jiang, Angelos Bilas, Brian Kelley, Xiang Yu, Sanjeev Kumar, Jaswinder Pal Singh
    Abstract:

    In this paper we examine how application performance scales on a state-of-the-art Shared Virtual Memory (SVM) system on a cluster with 64 processors, comprising 4-way SMPs connected with a fast system area network. The protocol we use is home-based and takes advantage of general-purpose data movement and mutual exclusion support provided by a programmable network interface. We find that while the level of application restructuring needed is quite high compared to applications that perform well on a hardware-coherent system of this scale, and larger problem sizes are needed for good performance, SVM, surprisingly, performs quite well at the 64-processor scale for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end hardware-coherent system and often much more. We explore further application restructurings than those developed earlier for smaller-scale SVM systems, examine the main remaining system and application bottlenecks, and point out directions for future research.

  • Shared Virtual Memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems
    Journal of Parallel and Distributed Computing, 2003
    Co-Authors: Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh
    Abstract:

    Although the Shared Memory abstraction is gaining ground as a programming abstraction for parallel computing, the main platforms that support it, small-scale symmetric multiprocessors (SMPs) and hardware cache-coherent distributed Shared Memory systems (DSMs), seem to lie inherently at the extremes of the cost-performance spectrum for parallel systems. In this paper we examine if Shared Virtual Memory (SVM) clusters can bridge this gap by examining how application performance scales on a state-of-the-art Shared Virtual Memory cluster. We find that: (i) The level of application restructuring needed is quite high compared to applications that perform well on a DSM system of the same scale and larger problem sizes are needed for good performance. (ii) However, surprisingly, SVM performs quite well for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end DSM system at the same scale and often much more.

  • International Conference on Supercomputing - Application scaling under Shared Virtual Memory on a cluster of SMPs
    Proceedings of the 13th international conference on Supercomputing - ICS '99, 1999
    Co-Authors: Dongming Jiang, Angelos Bilas, Brian Kelley, Sanjeev Kumar, Jaswinder Pal Singh
    Abstract:

    In this paper we examine how application performance scales on a state-of-the-art Shared Virtual Memory (SVM) system on a cluster with 64 processors, comprising 4-way SMPs connected with a fast system area network. The protocol we use is home-based and takes advantage of general-purpose data movement and mutual exclusion support provided by a programmable network interface. We find that while the level of application restructuring needed is quite high compared to applications that perform well on a hardware-coherent system of this scale, and larger problem sizes are needed for good performance, SVM, surprisingly, performs quite well at the 64-processor scale for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end hardware-coherent system and often much more. We explore further application restructurings than those developed earlier for smaller-scale SVM systems, examine the main remaining system and application bottlenecks, and point out directions for future research.

  • monitoring Shared Virtual Memory performance on a myrinet based pc cluster
    International Conference on Supercomputing, 1998
    Co-Authors: Cheng Liao, Dongming Jiang, Liviu Iftode, Margaret Martonosi, Douglas W Clark
    Abstract:

    Network-connected clusters of PCs or workstations are becoming a widespread parallel computing platform. Performance methodologies that use either simulation or high-level software instrumentation cannot adequately measure the detailed behavior of such systems. The availability of new network technologies based on programmable network interfaces opens a new avenue of research in analyzing and improving the performance of software Shared Memory protocols. We have developed monitoring firmware embedded in the programmable network interfaces of a Myrinet-based PC cluster. Timestamps on network packets facilitate the collection of low-level statistics on, e.g., network latencies, interrupt handler times and inter-node synchronization. This paper describes our use of the low-level software performance monitor to measure and understand the performance of a Shared Virtual Memory (SVM) system implemented on a Myrinetbased cluster, running the SPLASH-2 benchmarks. We measured time spent in various communication stages during the main protocol operations: remote page fetch, remote lock synchronization, and barriers. These data show that remote request contention in the network interface and hosts can serialize their handling and artificially increase the page miss time. This increase then dilates the critical section within which it occurs, increasing lock contention and causing lock serialization. Furthermore, lock serialization is reflected in the waiting time at barriers. These results of our study sharpen and deepen similar but higher-level speculations in previous simulation-based SVM performance research. Moreover, the insights about different layers, including communication architecture, SVM protocol, and applications, on real systems provide guidelines for better designs in those layers.

  • International Conference on Supercomputing - Monitoring Shared Virtual Memory performance on a Myrinet-based PC cluster
    Proceedings of the 12th international conference on Supercomputing - ICS '98, 1998
    Co-Authors: Cheng Liao, Dongming Jiang, Liviu Iftode, Margaret Martonosi, Douglas W Clark
    Abstract:

    Network-connected clusters of PCs or workstations are becoming a widespread parallel computing platform. Performance methodologies that use either simulation or high-level software instrumentation cannot adequately measure the detailed behavior of such systems. The availability of new network technologies based on programmable network interfaces opens a new avenue of research in analyzing and improving the performance of software Shared Memory protocols. We have developed monitoring firmware embedded in the programmable network interfaces of a Myrinet-based PC cluster. Timestamps on network packets facilitate the collection of low-level statistics on, e.g., network latencies, interrupt handler times and inter-node synchronization. This paper describes our use of the low-level software performance monitor to measure and understand the performance of a Shared Virtual Memory (SVM) system implemented on a Myrinetbased cluster, running the SPLASH-2 benchmarks. We measured time spent in various communication stages during the main protocol operations: remote page fetch, remote lock synchronization, and barriers. These data show that remote request contention in the network interface and hosts can serialize their handling and artificially increase the page miss time. This increase then dilates the critical section within which it occurs, increasing lock contention and causing lock serialization. Furthermore, lock serialization is reflected in the waiting time at barriers. These results of our study sharpen and deepen similar but higher-level speculations in previous simulation-based SVM performance research. Moreover, the insights about different layers, including communication architecture, SVM protocol, and applications, on real systems provide guidelines for better designs in those layers.

Jianping Zhu - One of the best experts on this subject based on the ideXlab platform.

  • Performance prediction. A case study using a scalable Shared-Virtual Memory machine
    IEEE Parallel & Distributed Technology: Systems & Applications, 1996
    Co-Authors: Xian-he Sun, Jianping Zhu
    Abstract:

    As computers with tens of thousands of processors successfully deliver high performance power for solving some of the so called "grand challenge" applications, scalability is becoming an important metric in the evaluation of parallel architectures and algorithms. The authors carefully investigate the prediction of scalability and its application. With a simple formula, they show the relation between scalability, single processor computing power, and degradation of parallelism. They conduct a case study on a multi ring KSR-1 Shared Virtual Memory machine. However, the prediction formula and methodology proposed in the study are not bound to any algorithm or architecture. They can be applied to any algorithm-machine combination. Experimental and theoretical results show that the influence of variation of ensemble size is predictable. Therefore, the performance of an algorithm on a sophisticated, hierarchical architecture can be predicted, and the best algorithm-machine combination can be selected for a given application.

  • Performance considerations of Shared Virtual Memory machines
    IEEE Transactions on Parallel and Distributed Systems, 1995
    Co-Authors: Xian-he Sun, Jianping Zhu
    Abstract:

    Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, we show that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in Shared Virtual Memory machines. A scientific application has been implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, an interesting relation between fixed-time and Memory-bounded speedup is revealed. Various causes of superlinear speedup are also presented.

  • IPPS - Shared Virtual Memory and generalized speedup
    Proceedings of 8th International Parallel Processing Symposium, 1
    Co-Authors: Xian-he Sun, Jianping Zhu
    Abstract:

    Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, it is shown that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in Shared Virtual Memory machines. A scientific application has been implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, various causes of superlinear speedup are also presented. >