uniform memory access

The Experts below are selected from a list of 4176 Experts worldwide ranked by ideXlab platform

Venkatesh Akella - One of the best experts on this subject based on the ideXlab platform.

Segmented bitline cache : Exploiting non-uniform memory access patterns

Lecture Notes in Computer Science, 2006

Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella

Abstract:

On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

15 days free trial to Access Article
HiPC - Segmented bitline cache: exploiting non-uniform memory access patterns

High Performance Computing - HiPC 2006, 2006

Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella

Abstract:

On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

15 days free trial to Access Article

David A Bader - One of the best experts on this subject based on the ideXlab platform.

using pram algorithms on a uniform memory access shared memory architecture

Lecture Notes in Computer Science, 2001

Co-Authors: David A Bader, Ajith K Illendula, Bernard M E Moret, Nina R Weissebernstein

Abstract:

The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

15 days free trial to Access Article
Algorithm Engineering - Using PRAM Algorithms on a uniform-memory-access Shared-memory Architecture

Algorithm Engineering, 2001

Co-Authors: David A Bader, Ajith K Illendula, Bernard M E Moret, Nina R. Weisse-bernstein

Abstract:

The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

15 days free trial to Access Article

Xiaomei Guo - One of the best experts on this subject based on the ideXlab platform.

The Research of a memory accesses Behavior on Non-uniform memory access Architecture

2019 10th International Conference on Information Technology in Medicine and Education (ITME), 2019

Co-Authors: Xiaomei Guo, Haiyun Han

Abstract:

Choosing a good threads and data distribution scheme to the performance of data-parallel applications is important on Non-uniform memory access (NUMA) architecture workstation. In this paper, we delve into the characterization of an NUMA multiprocessor system, design experiments to compare the differences of two types memory behavior with the memory bandwidth load and without the memory bandwidth load. We analyzed the performance of the two cases and have the important conclusion about the threads accessing to the memory.

15 days free trial to Access Article
the research of several situations about memory accessing on non uniform memory access architecture

Annual ACIS International Conference on Computer and Information Science, 2018

Co-Authors: Xiaomei Guo, Haiyun Han

Abstract:

In this paper, we have focused our work on memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We investigated three types memory operations: threads accessing local data, threads not firing at each other and threads firing at each other. When the system was idle and not idle, We compared and analyzed the performance of the three cases and we have the important conclusion about the threads accessing to the memory.

15 days free trial to Access Article
ICIS - The Research of Several Situations About memory accessing on Non-uniform memory access Architecture

2018 IEEE ACIS 17th International Conference on Computer and Information Science (ICIS), 2018

Co-Authors: Xiaomei Guo, Haiyun Han

Abstract:

In this paper, we have focused our work on memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We investigated three types memory operations: threads accessing local data, threads not firing at each other and threads firing at each other. When the system was idle and not idle, We compared and analyzed the performance of the three cases and we have the important conclusion about the threads accessing to the memory.

15 days free trial to Access Article
a good data allocation strategy on non uniform memory access architecture

Annual ACIS International Conference on Computer and Information Science, 2017

Co-Authors: Xiaomei Guo, Haiyun Han

Abstract:

It is important to Choose a good data distribution to the performance of applications on Non-uniform memory access (NUMA) shared memory multiprocessors. In this paper, we have investigated memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We get important conclusions that it is important to keep data local on node where it is being accessed and it is important to avoid sharing of data between threads running on different cores.

15 days free trial to Access Article
ICIS - A good data allocation strategy on non-uniform memory access architecture

2017 IEEE ACIS 16th International Conference on Computer and Information Science (ICIS), 2017

Co-Authors: Xiaomei Guo, Haiyun Han

Abstract:

It is important to Choose a good data distribution to the performance of applications on Non-uniform memory access (NUMA) shared memory multiprocessors. In this paper, we have investigated memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We get important conclusions that it is important to keep data local on node where it is being accessed and it is important to avoid sharing of data between threads running on different cores.

15 days free trial to Access Article

Ravishankar Rao - One of the best experts on this subject based on the ideXlab platform.

Segmented bitline cache : Exploiting non-uniform memory access patterns

Lecture Notes in Computer Science, 2006

Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella

Abstract:

On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

15 days free trial to Access Article
HiPC - Segmented bitline cache: exploiting non-uniform memory access patterns

High Performance Computing - HiPC 2006, 2006

Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella

Abstract:

On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

15 days free trial to Access Article

Alistair P. Rendell - One of the best experts on this subject based on the ideXlab platform.

a simple performance model for multithreaded applications executing on non uniform memory access computers

High Performance Computing and Communications, 2009

Co-Authors: Rui Yang, Joseph Antony, Alistair P. Rendell

Abstract:

In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth effects for single and multi-threaded calculations within the Gaussian 03 computational chemistry code on a contemporary multi-core, NUMA platform. By using the thread and memory placement APIs in Solaris, we present results for a set of calculations from which we analyze on-chip interconnect and intra-core bandwidth contention and show the importance of load-balancing between threads. The extended model predicts single threaded performance to within 1% errors and most multi-threaded experiments within 15% errors. Our results and modeling shows that accounting for bandwidth constraints within user-space code is beneficial.

15 days free trial to Access Article
HPCC - A Simple Performance Model for Multithreaded Applications Executing on Non-uniform memory access Computers

2009 11th IEEE International Conference on High Performance Computing and Communications, 2009

Co-Authors: Rui Yang, Joseph Antony, Alistair P. Rendell

Abstract:

In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth effects for single and multi-threaded calculations within the Gaussian 03 computational chemistry code on a contemporary multi-core, NUMA platform. By using the thread and memory placement APIs in Solaris, we present results for a set of calculations from which we analyze on-chip interconnect and intra-core bandwidth contention and show the importance of load-balancing between threads. The extended model predicts single threaded performance to within 1% errors and most multi-threaded experiments within 15% errors. Our results and modeling shows that accounting for bandwidth constraints within user-space code is beneficial.

15 days free trial to Access Article
ISPAN - Effective Use of Dynamic Page Migration on NUMA Platforms: The Gaussian Chemistry Code on the SunFire X4600M2 System

2009 10th International Symposium on Pervasive Systems Algorithms and Networks, 2009

Co-Authors: Rui Yang, Joseph Antony, Alistair P. Rendell

Abstract:

In this work we study the effect of data locality on the performance of Gaussian 03 code running on a multi-core Non-uniform memory access (NUMA) system. A user-space protocol which affects runtime data locality, through the use of dynamic page migration and interleaving techniques, is considered. Using this protocol results in a significant performance improvement. Results for parallel Gaussian 03 using up to 16 threads are presented. The overhead of page migration and effect of dual-core contention are also examined.

15 days free trial to Access Article
International Conference on Computational Science - OpenMP and NUMA architectures I: Investigating memory placement on the SGI origin 3000

Lecture Notes in Computer Science, 2003

Co-Authors: Nathan Robertson, Alistair P. Rendell

Abstract:

The OpenMP programming model is based upon the assumption of uniform memory access. Virtually all current day large scale shared memory computers exhibit some degree of Non-uniform memory access (NUMA). Should OpenMP be extended for NUMA architectures? This paper aims to quantify NUMA effects on the SGI Origin 3000 system as a prelude to answering this important question. We discuss the tools required to study NUMA effects and use them in the study of latency, bandwidth and the solution of a 2-D heat diffusion problem.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Venkatesh Akella - One of the best experts on this subject based on the ideXlab platform.

Segmented bitline cache : Exploiting non-uniform memory access patterns

HiPC - Segmented bitline cache: exploiting non-uniform memory access patterns

David A Bader - One of the best experts on this subject based on the ideXlab platform.

using pram algorithms on a uniform memory access shared memory architecture

Algorithm Engineering - Using PRAM Algorithms on a uniform-memory-access Shared-memory Architecture

Xiaomei Guo - One of the best experts on this subject based on the ideXlab platform.

The Research of a memory accesses Behavior on Non-uniform memory access Architecture

the research of several situations about memory accessing on non uniform memory access architecture

ICIS - The Research of Several Situations About memory accessing on Non-uniform memory access Architecture

a good data allocation strategy on non uniform memory access architecture

ICIS - A good data allocation strategy on non-uniform memory access architecture

Ravishankar Rao - One of the best experts on this subject based on the ideXlab platform.

Segmented bitline cache : Exploiting non-uniform memory access patterns

HiPC - Segmented bitline cache: exploiting non-uniform memory access patterns

Alistair P. Rendell - One of the best experts on this subject based on the ideXlab platform.

a simple performance model for multithreaded applications executing on non uniform memory access computers

HPCC - A Simple Performance Model for Multithreaded Applications Executing on Non-uniform memory access Computers

ISPAN - Effective Use of Dynamic Page Migration on NUMA Platforms: The Gaussian Chemistry Code on the SunFire X4600M2 System

International Conference on Computational Science - OpenMP and NUMA architectures I: Investigating memory placement on the SGI origin 3000

uniform memory access

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Venkatesh Akella - One of the best experts on this subject based on the ideXlab platform.

David A Bader - One of the best experts on this subject based on the ideXlab platform.

Xiaomei Guo - One of the best experts on this subject based on the ideXlab platform.

Ravishankar Rao - One of the best experts on this subject based on the ideXlab platform.

Alistair P. Rendell - One of the best experts on this subject based on the ideXlab platform.

Related terms