Private Memory

The Experts below are selected from a list of 312 Experts worldwide ranked by ideXlab platform

Thomas M Stricker - One of the best experts on this subject based on the ideXlab platform.

The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor

1992

Co-Authors: Thomas R Gross, David R. O'hallaron, Susan Hinrichs, A. Hasegawa, Thomas M Stricker

Abstract:

Programs executing on a Private-Memory parallel system exchange data by explicitly sending and receiving messages. Two communication styles have been identified for such systems: Memory communication (each message exchanged between two processors is buffered in Memory, e.g. as in message passing) and systolic communications (each word of a message is transmitted directly from the sender processor to receiver processor, without any buffering in Memory). The iWarp system supports both communication styles and therefore provides a platform that allows us to evaluate how the choice of communication style impacts the usage of processor resources. Parallel program generators map a machine independent description of a computation onto a Private-Memory parallel system. We use two different parallel program benerators that employ the two communication styles to map a set of application kernels onto iWarp. By using tools to generate the parallel programs, we are able to obtain realistic data on the execution of programs using the different communication styles. This paper reports on measurements of instruction format usge, the utilization of the communication ports (gates), and instructionon frequencies on the iWarp system. It is a first step towards understanding how features and capabilities of parallel processors actually used by parallel programs that have been mapped automatically.

15 days free trial to Access Article
subset barrier synchronization on a Private Memory parallel system

ACM Symposium on Parallel Algorithms and Architectures, 1992

Co-Authors: Anja Feldmann, Thomas R Gross, David R Ohallaron, Thomas M Stricker

Abstract:

A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern Private–Memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a Private–Memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

15 days free trial to Access Article
SPAA - Subset barrier synchronization on a Private-Memory parallel system

Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures - SPAA '92, 1992

Co-Authors: Anja Feldmann, Thomas R Gross, David R. O'hallaron, Thomas M Stricker

Abstract:

A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern Private–Memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a Private–Memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

15 days free trial to Access Article

Youliang Yan - One of the best experts on this subject based on the ideXlab platform.

rhymes a shared virtual Memory system for non coherent tiled many core architectures

International Conference on Parallel and Distributed Systems, 2014

Co-Authors: King Tin Lam, Jinghao Shi, Dominic Hung, Choli Wang, Zhiquan Lai, Wangbin Zhu, Youliang Yan

Abstract:

The rising core count per processor is pushing chip complexity to a level that hardware-based cache coherency protocols become too hard and costly to scale someday. We need new designs of many-core hardware and software other than traditional technologies to keep up with the ever-increasing scalability demands. A cluster-on-chip architecture, as exemplified by the Intel Single-chip Cloud Computer (SCC), promotes a software-oriented approach instead of hardware support to implementing shared Memory coherence. This paper presents a shared virtual Memory (SVM) system, dubbed Rhymes, tailored to new processor kinds of non-coherent and hybrid Memory architectures. Rhymes features a two-way cache coherence protocol to enforce release consistency for pages allocated in shared physical Memory (SPM) and scope consistency for pages in percore Private Memory. It also supports page remapping on a percore basis to boost data locality. We implement and test Rhymes on the SCC port of the Barrelfish OS. Experimental results show that our SVM outperforms the pure SPM approach used by Intel's software managed coherence (SMC) library by up to 12 times through improved cache utilization for applications with strong data reuse patterns.

15 days free trial to Access Article
ICPADS - Rhymes: A shared virtual Memory system for non-coherent tiled many-core architectures

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2014

Co-Authors: Tin Lam, Jinghao Shi, Dominic Hung, Choli Wang, Zhiquan Lai, Wangbin Zhu, Youliang Yan

Abstract:

The rising core count per processor is pushing chip complexity to a level that hardware-based cache coherency protocols become too hard and costly to scale someday. We need new designs of many-core hardware and software other than traditional technologies to keep up with the ever-increasing scalability demands. A cluster-on-chip architecture, as exemplified by the Intel Single-chip Cloud Computer (SCC), promotes a software-oriented approach instead of hardware support to implementing shared Memory coherence. This paper presents a shared virtual Memory (SVM) system, dubbed Rhymes, tailored to new processor kinds of non-coherent and hybrid Memory architectures. Rhymes features a two-way cache coherence protocol to enforce release consistency for pages allocated in shared physical Memory (SPM) and scope consistency for pages in percore Private Memory. It also supports page remapping on a percore basis to boost data locality. We implement and test Rhymes on the SCC port of the Barrelfish OS. Experimental results show that our SVM outperforms the pure SPM approach used by Intel's software managed coherence (SMC) library by up to 12 times through improved cache utilization for applications with strong data reuse patterns.

15 days free trial to Access Article

Anja Feldmann - One of the best experts on this subject based on the ideXlab platform.

subset barrier synchronization on a Private Memory parallel system

ACM Symposium on Parallel Algorithms and Architectures, 1992

Co-Authors: Anja Feldmann, Thomas R Gross, David R Ohallaron, Thomas M Stricker

Abstract:

A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern Private–Memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a Private–Memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

15 days free trial to Access Article
SPAA - Subset barrier synchronization on a Private-Memory parallel system

Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures - SPAA '92, 1992

Co-Authors: Anja Feldmann, Thomas R Gross, David R. O'hallaron, Thomas M Stricker

Abstract:

A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern Private–Memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a Private–Memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

15 days free trial to Access Article

Thomas Gross - One of the best experts on this subject based on the ideXlab platform.

Generating communication for array statements: design, implementation, and evaluation

Journal of Parallel and Distributed Computing, 1994

Co-Authors: James M. Stichnoth, David R. O'hallaron, Thomas Gross

Abstract:

Abstract Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a Private Memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern Private Memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver′s address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node Private Memory iWarp system, using a number of different distributions.

15 days free trial to Access Article
Exploiting task and data parallelism on a multicomputer

Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93, 1993

Co-Authors: Jaspal Subhlok, David R. O'hallaron, James M. Stichnoth, Thomas Gross

Abstract:

For many applications, achieving good performance on a Private Memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.

15 days free trial to Access Article
LCPC - Utilizing New Communication Features in Compiliation for Private-Memory Machines

Languages and Compilers for Parallel Computing, 1993

Co-Authors: Susan Hinrichs, Thomas Gross

Abstract:

The communication system of some 3rd generation Private-Memory machines provides long-lived connections (which reserve communication resources like buffers between nodes) as well as direct access by the computation unit(s) of the node to the communication system. These features allow a compiler to find innovative solutions when compiling data-parallel programs for a Private-Memory machine. In this paper, we discuss some optimizations that can be included in a compiler for the iWarp system, an example Private-Memory parallel system with a novel communication architecture.

15 days free trial to Access Article
utilizing new communication features in compiliation for Private Memory machines

Languages and Compilers for Parallel Computing, 1992

Co-Authors: Susan Hinrichs, Thomas Gross

Abstract:

The communication system of some 3rd generation Private-Memory machines provides long-lived connections (which reserve communication resources like buffers between nodes) as well as direct access by the computation unit(s) of the node to the communication system. These features allow a compiler to find innovative solutions when compiling data-parallel programs for a Private-Memory machine. In this paper, we discuss some optimizations that can be included in a compiler for the iWarp system, an example Private-Memory parallel system with a novel communication architecture.

15 days free trial to Access Article

David R. O'hallaron - One of the best experts on this subject based on the ideXlab platform.

Generating communication for array statements: design, implementation, and evaluation

Journal of Parallel and Distributed Computing, 1994

Co-Authors: James M. Stichnoth, David R. O'hallaron, Thomas Gross

Abstract:

Abstract Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a Private Memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern Private Memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver′s address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node Private Memory iWarp system, using a number of different distributions.

15 days free trial to Access Article
Exploiting task and data parallelism on a multicomputer

Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93, 1993

Co-Authors: Jaspal Subhlok, David R. O'hallaron, James M. Stichnoth, Thomas Gross

Abstract:

For many applications, achieving good performance on a Private Memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.

15 days free trial to Access Article
The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor

1992

Co-Authors: Thomas R Gross, David R. O'hallaron, Susan Hinrichs, A. Hasegawa, Thomas M Stricker

Abstract:

Programs executing on a Private-Memory parallel system exchange data by explicitly sending and receiving messages. Two communication styles have been identified for such systems: Memory communication (each message exchanged between two processors is buffered in Memory, e.g. as in message passing) and systolic communications (each word of a message is transmitted directly from the sender processor to receiver processor, without any buffering in Memory). The iWarp system supports both communication styles and therefore provides a platform that allows us to evaluate how the choice of communication style impacts the usage of processor resources. Parallel program generators map a machine independent description of a computation onto a Private-Memory parallel system. We use two different parallel program benerators that employ the two communication styles to map a set of application kernels onto iWarp. By using tools to generate the parallel programs, we are able to obtain realistic data on the execution of programs using the different communication styles. This paper reports on measurements of instruction format usge, the utilization of the communication ports (gates), and instructionon frequencies on the iWarp system. It is a first step towards understanding how features and capabilities of parallel processors actually used by parallel programs that have been mapped automatically.

15 days free trial to Access Article
SPAA - Subset barrier synchronization on a Private-Memory parallel system

Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures - SPAA '92, 1992

Co-Authors: Anja Feldmann, Thomas R Gross, David R. O'hallaron, Thomas M Stricker

Abstract:

A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward; a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for Private-Memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern Private–Memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(logN) for common cases, and O( p N) in the worst case. The algorithms have been implemented on iWarp, a Private–Memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Thomas M Stricker - One of the best experts on this subject based on the ideXlab platform.

The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor

subset barrier synchronization on a Private Memory parallel system

SPAA - Subset barrier synchronization on a Private-Memory parallel system

Youliang Yan - One of the best experts on this subject based on the ideXlab platform.

rhymes a shared virtual Memory system for non coherent tiled many core architectures

ICPADS - Rhymes: A shared virtual Memory system for non-coherent tiled many-core architectures

Anja Feldmann - One of the best experts on this subject based on the ideXlab platform.

subset barrier synchronization on a Private Memory parallel system

SPAA - Subset barrier synchronization on a Private-Memory parallel system

Thomas Gross - One of the best experts on this subject based on the ideXlab platform.

Generating communication for array statements: design, implementation, and evaluation

Exploiting task and data parallelism on a multicomputer

LCPC - Utilizing New Communication Features in Compiliation for Private-Memory Machines

utilizing new communication features in compiliation for Private Memory machines

David R. O'hallaron - One of the best experts on this subject based on the ideXlab platform.

Generating communication for array statements: design, implementation, and evaluation

Exploiting task and data parallelism on a multicomputer

The Impact of Communication Style on Machine Resource Usage for the iWarp Parallel Processor

SPAA - Subset barrier synchronization on a Private-Memory parallel system

Private Memory

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Thomas M Stricker - One of the best experts on this subject based on the ideXlab platform.

Youliang Yan - One of the best experts on this subject based on the ideXlab platform.

Anja Feldmann - One of the best experts on this subject based on the ideXlab platform.

Thomas Gross - One of the best experts on this subject based on the ideXlab platform.

David R. O'hallaron - One of the best experts on this subject based on the ideXlab platform.

Related terms