Private Memory

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 312 Experts worldwide ranked by ideXlab platform

Thomas M Stricker - One of the best experts on this subject based on the ideXlab platform.

  • The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor
    1992
    Co-Authors: Thomas R Gross, David R. O'hallaron, Susan Hinrichs, A. Hasegawa, Thomas M Stricker
    Abstract:

    Programs executing on a Private-Memory parallel system exchange data by explicitly sending and receiving messages. Two communication styles have been identified for such systems: Memory communication (each message exchanged between two processors is buffered in Memory, e.g. as in message passing) and systolic communications (each word of a message is transmitted directly from the sender processor to receiver processor, without any buffering in Memory). The iWarp system supports both communication styles and therefore provides a platform that allows us to evaluate how the choice of communication style impacts the usage of processor resources. Parallel program generators map a machine independent description of a computation onto a Private-Memory parallel system. We use two different parallel program benerators that employ the two communication styles to map a set of application kernels onto iWarp. By using tools to generate the parallel programs, we are able to obtain realistic data on the execution of programs using the different communication styles. This paper reports on measurements of instruction format usge, the utilization of the communication ports (gates), and instructionon frequencies on the iWarp system. It is a first step towards understanding how features and capabilities of parallel processors actually used by parallel programs that have been mapped automatically.

  • subset barrier synchronization on a Private Memory parallel system
    ACM Symposium on Parallel Algorithms and Architectures, 1992
    Co-Authors: Anja Feldmann, Thomas R Gross, David R Ohallaron, Thomas M Stricker
    Abstract:

    A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern PrivateMemory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a PrivateMemory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

  • SPAA - Subset barrier synchronization on a Private-Memory parallel system
    Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures - SPAA '92, 1992
    Co-Authors: Anja Feldmann, Thomas R Gross, David R. O'hallaron, Thomas M Stricker
    Abstract:

    A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern PrivateMemory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a PrivateMemory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

Youliang Yan - One of the best experts on this subject based on the ideXlab platform.

  • rhymes a shared virtual Memory system for non coherent tiled many core architectures
    International Conference on Parallel and Distributed Systems, 2014
    Co-Authors: King Tin Lam, Jinghao Shi, Dominic Hung, Choli Wang, Zhiquan Lai, Wangbin Zhu, Youliang Yan
    Abstract:

    The rising core count per processor is pushing chip complexity to a level that hardware-based cache coherency protocols become too hard and costly to scale someday. We need new designs of many-core hardware and software other than traditional technologies to keep up with the ever-increasing scalability demands. A cluster-on-chip architecture, as exemplified by the Intel Single-chip Cloud Computer (SCC), promotes a software-oriented approach instead of hardware support to implementing shared Memory coherence. This paper presents a shared virtual Memory (SVM) system, dubbed Rhymes, tailored to new processor kinds of non-coherent and hybrid Memory architectures. Rhymes features a two-way cache coherence protocol to enforce release consistency for pages allocated in shared physical Memory (SPM) and scope consistency for pages in percore Private Memory. It also supports page remapping on a percore basis to boost data locality. We implement and test Rhymes on the SCC port of the Barrelfish OS. Experimental results show that our SVM outperforms the pure SPM approach used by Intel's software managed coherence (SMC) library by up to 12 times through improved cache utilization for applications with strong data reuse patterns.

  • ICPADS - Rhymes: A shared virtual Memory system for non-coherent tiled many-core architectures
    2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2014
    Co-Authors: Tin Lam, Jinghao Shi, Dominic Hung, Choli Wang, Zhiquan Lai, Wangbin Zhu, Youliang Yan
    Abstract:

    The rising core count per processor is pushing chip complexity to a level that hardware-based cache coherency protocols become too hard and costly to scale someday. We need new designs of many-core hardware and software other than traditional technologies to keep up with the ever-increasing scalability demands. A cluster-on-chip architecture, as exemplified by the Intel Single-chip Cloud Computer (SCC), promotes a software-oriented approach instead of hardware support to implementing shared Memory coherence. This paper presents a shared virtual Memory (SVM) system, dubbed Rhymes, tailored to new processor kinds of non-coherent and hybrid Memory architectures. Rhymes features a two-way cache coherence protocol to enforce release consistency for pages allocated in shared physical Memory (SPM) and scope consistency for pages in percore Private Memory. It also supports page remapping on a percore basis to boost data locality. We implement and test Rhymes on the SCC port of the Barrelfish OS. Experimental results show that our SVM outperforms the pure SPM approach used by Intel's software managed coherence (SMC) library by up to 12 times through improved cache utilization for applications with strong data reuse patterns.

Anja Feldmann - One of the best experts on this subject based on the ideXlab platform.

  • subset barrier synchronization on a Private Memory parallel system
    ACM Symposium on Parallel Algorithms and Architectures, 1992
    Co-Authors: Anja Feldmann, Thomas R Gross, David R Ohallaron, Thomas M Stricker
    Abstract:

    A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern PrivateMemory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a PrivateMemory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

  • SPAA - Subset barrier synchronization on a Private-Memory parallel system
    Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures - SPAA '92, 1992
    Co-Authors: Anja Feldmann, Thomas R Gross, David R. O'hallaron, Thomas M Stricker
    Abstract:

    A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern PrivateMemory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a PrivateMemory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

Thomas Gross - One of the best experts on this subject based on the ideXlab platform.

  • Generating communication for array statements: design, implementation, and evaluation
    Journal of Parallel and Distributed Computing, 1994
    Co-Authors: James M. Stichnoth, David R. O'hallaron, Thomas Gross
    Abstract:

    Abstract Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a Private Memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern Private Memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver′s address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node Private Memory iWarp system, using a number of different distributions.

  • Exploiting task and data parallelism on a multicomputer
    Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93, 1993
    Co-Authors: Jaspal Subhlok, David R. O'hallaron, James M. Stichnoth, Thomas Gross
    Abstract:

    For many applications, achieving good performance on a Private Memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.

  • LCPC - Utilizing New Communication Features in Compiliation for Private-Memory Machines
    Languages and Compilers for Parallel Computing, 1993
    Co-Authors: Susan Hinrichs, Thomas Gross
    Abstract:

    The communication system of some 3rd generation Private-Memory machines provides long-lived connections (which reserve communication resources like buffers between nodes) as well as direct access by the computation unit(s) of the node to the communication system. These features allow a compiler to find innovative solutions when compiling data-parallel programs for a Private-Memory machine. In this paper, we discuss some optimizations that can be included in a compiler for the iWarp system, an example Private-Memory parallel system with a novel communication architecture.

  • utilizing new communication features in compiliation for Private Memory machines
    Languages and Compilers for Parallel Computing, 1992
    Co-Authors: Susan Hinrichs, Thomas Gross
    Abstract:

    The communication system of some 3rd generation Private-Memory machines provides long-lived connections (which reserve communication resources like buffers between nodes) as well as direct access by the computation unit(s) of the node to the communication system. These features allow a compiler to find innovative solutions when compiling data-parallel programs for a Private-Memory machine. In this paper, we discuss some optimizations that can be included in a compiler for the iWarp system, an example Private-Memory parallel system with a novel communication architecture.

David R. O'hallaron - One of the best experts on this subject based on the ideXlab platform.

  • Generating communication for array statements: design, implementation, and evaluation
    Journal of Parallel and Distributed Computing, 1994
    Co-Authors: James M. Stichnoth, David R. O'hallaron, Thomas Gross
    Abstract:

    Abstract Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a Private Memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern Private Memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver′s address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node Private Memory iWarp system, using a number of different distributions.

  • Exploiting task and data parallelism on a multicomputer
    Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93, 1993
    Co-Authors: Jaspal Subhlok, David R. O'hallaron, James M. Stichnoth, Thomas Gross
    Abstract:

    For many applications, achieving good performance on a Private Memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.

  • The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor
    1992
    Co-Authors: Thomas R Gross, David R. O'hallaron, Susan Hinrichs, A. Hasegawa, Thomas M Stricker
    Abstract:

    Programs executing on a Private-Memory parallel system exchange data by explicitly sending and receiving messages. Two communication styles have been identified for such systems: Memory communication (each message exchanged between two processors is buffered in Memory, e.g. as in message passing) and systolic communications (each word of a message is transmitted directly from the sender processor to receiver processor, without any buffering in Memory). The iWarp system supports both communication styles and therefore provides a platform that allows us to evaluate how the choice of communication style impacts the usage of processor resources. Parallel program generators map a machine independent description of a computation onto a Private-Memory parallel system. We use two different parallel program benerators that employ the two communication styles to map a set of application kernels onto iWarp. By using tools to generate the parallel programs, we are able to obtain realistic data on the execution of programs using the different communication styles. This paper reports on measurements of instruction format usge, the utilization of the communication ports (gates), and instructionon frequencies on the iWarp system. It is a first step towards understanding how features and capabilities of parallel processors actually used by parallel programs that have been mapped automatically.

  • SPAA - Subset barrier synchronization on a Private-Memory parallel system
    Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures - SPAA '92, 1992
    Co-Authors: Anja Feldmann, Thomas R Gross, David R. O'hallaron, Thomas M Stricker
    Abstract:

    A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern PrivateMemory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a PrivateMemory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.