uniform memory access

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 4176 Experts worldwide ranked by ideXlab platform

Venkatesh Akella - One of the best experts on this subject based on the ideXlab platform.

  • Segmented bitline cache : Exploiting non-uniform memory access patterns
    Lecture Notes in Computer Science, 2006
    Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella
    Abstract:

    On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

  • HiPC - Segmented bitline cache: exploiting non-uniform memory access patterns
    High Performance Computing - HiPC 2006, 2006
    Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella
    Abstract:

    On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

David A Bader - One of the best experts on this subject based on the ideXlab platform.

  • using pram algorithms on a uniform memory access shared memory architecture
    Lecture Notes in Computer Science, 2001
    Co-Authors: David A Bader, Ajith K Illendula, Bernard M E Moret, Nina R Weissebernstein
    Abstract:

    The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

  • Algorithm Engineering - Using PRAM Algorithms on a uniform-memory-access Shared-memory Architecture
    Algorithm Engineering, 2001
    Co-Authors: David A Bader, Ajith K Illendula, Bernard M E Moret, Nina R. Weisse-bernstein
    Abstract:

    The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

Xiaomei Guo - One of the best experts on this subject based on the ideXlab platform.

  • The Research of a memory accesses Behavior on Non-uniform memory access Architecture
    2019 10th International Conference on Information Technology in Medicine and Education (ITME), 2019
    Co-Authors: Xiaomei Guo, Haiyun Han
    Abstract:

    Choosing a good threads and data distribution scheme to the performance of data-parallel applications is important on Non-uniform memory access (NUMA) architecture workstation. In this paper, we delve into the characterization of an NUMA multiprocessor system, design experiments to compare the differences of two types memory behavior with the memory bandwidth load and without the memory bandwidth load. We analyzed the performance of the two cases and have the important conclusion about the threads accessing to the memory.

  • the research of several situations about memory accessing on non uniform memory access architecture
    Annual ACIS International Conference on Computer and Information Science, 2018
    Co-Authors: Xiaomei Guo, Haiyun Han
    Abstract:

    In this paper, we have focused our work on memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We investigated three types memory operations: threads accessing local data, threads not firing at each other and threads firing at each other. When the system was idle and not idle, We compared and analyzed the performance of the three cases and we have the important conclusion about the threads accessing to the memory.

  • ICIS - The Research of Several Situations About memory accessing on Non-uniform memory access Architecture
    2018 IEEE ACIS 17th International Conference on Computer and Information Science (ICIS), 2018
    Co-Authors: Xiaomei Guo, Haiyun Han
    Abstract:

    In this paper, we have focused our work on memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We investigated three types memory operations: threads accessing local data, threads not firing at each other and threads firing at each other. When the system was idle and not idle, We compared and analyzed the performance of the three cases and we have the important conclusion about the threads accessing to the memory.

  • a good data allocation strategy on non uniform memory access architecture
    Annual ACIS International Conference on Computer and Information Science, 2017
    Co-Authors: Xiaomei Guo, Haiyun Han
    Abstract:

    It is important to Choose a good data distribution to the performance of applications on Non-uniform memory access (NUMA) shared memory multiprocessors. In this paper, we have investigated memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We get important conclusions that it is important to keep data local on node where it is being accessed and it is important to avoid sharing of data between threads running on different cores.

  • ICIS - A good data allocation strategy on non-uniform memory access architecture
    2017 IEEE ACIS 16th International Conference on Computer and Information Science (ICIS), 2017
    Co-Authors: Xiaomei Guo, Haiyun Han
    Abstract:

    It is important to Choose a good data distribution to the performance of applications on Non-uniform memory access (NUMA) shared memory multiprocessors. In this paper, we have investigated memory accesses behavior over a AMD Opteron cache coherent Non-uniform memory access (ccNUMA) platform with multicore processors. We get important conclusions that it is important to keep data local on node where it is being accessed and it is important to avoid sharing of data between threads running on different cores.

Ravishankar Rao - One of the best experts on this subject based on the ideXlab platform.

  • Segmented bitline cache : Exploiting non-uniform memory access patterns
    Lecture Notes in Computer Science, 2006
    Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella
    Abstract:

    On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

  • HiPC - Segmented bitline cache: exploiting non-uniform memory access patterns
    High Performance Computing - HiPC 2006, 2006
    Co-Authors: Ravishankar Rao, Justin Wenck, Diana Franklin, Rajeevan Amirtharajah, Venkatesh Akella
    Abstract:

    On chip caches in modern processors account for a sizable fraction of the dynamic and leakage power. Much of this power is wasted, required only because the memory cells farthest from the sense amplifiers in the cache must discharge a large capacitance on the bitlines. We reduce this capacitance by segmenting the memory cells along the bitlines, and turning off the segmenters to reduce the overall bitline capacitance. The success of this cache relies on accessing segments near the sense-amps much more often than remote segments. We show that the access pattern to the first level data and instruction cache is extremely skewed. Only a small set of cache lines are accessed frequently. We exploit this non-uniform cache access pattern by mapping the frequently accessed cache lines closer to the sense amp. These lines are isolated by segmenting circuits on the bitlines and hence dissipate lesser power when accessed. Modifications to the address decoder enable a dynamic re-mapping of cache lines to segments. In this paper, we explore the design-space of segmenting the level one data and instruction caches. Instruction and data caches show potential power savings of 10% and 6% respectively on the subset of benchmarks simulated.

Alistair P. Rendell - One of the best experts on this subject based on the ideXlab platform.

  • a simple performance model for multithreaded applications executing on non uniform memory access computers
    High Performance Computing and Communications, 2009
    Co-Authors: Rui Yang, Joseph Antony, Alistair P. Rendell
    Abstract:

    In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth effects for single and multi-threaded calculations within the Gaussian 03 computational chemistry code on a contemporary multi-core, NUMA platform. By using the thread and memory placement APIs in Solaris, we present results for a set of calculations from which we analyze on-chip interconnect and intra-core bandwidth contention and show the importance of load-balancing between threads. The extended model predicts single threaded performance to within 1% errors and most multi-threaded experiments within 15% errors. Our results and modeling shows that accounting for bandwidth constraints within user-space code is beneficial.

  • HPCC - A Simple Performance Model for Multithreaded Applications Executing on Non-uniform memory access Computers
    2009 11th IEEE International Conference on High Performance Computing and Communications, 2009
    Co-Authors: Rui Yang, Joseph Antony, Alistair P. Rendell
    Abstract:

    In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth effects for single and multi-threaded calculations within the Gaussian 03 computational chemistry code on a contemporary multi-core, NUMA platform. By using the thread and memory placement APIs in Solaris, we present results for a set of calculations from which we analyze on-chip interconnect and intra-core bandwidth contention and show the importance of load-balancing between threads. The extended model predicts single threaded performance to within 1% errors and most multi-threaded experiments within 15% errors. Our results and modeling shows that accounting for bandwidth constraints within user-space code is beneficial.

  • ISPAN - Effective Use of Dynamic Page Migration on NUMA Platforms: The Gaussian Chemistry Code on the SunFire X4600M2 System
    2009 10th International Symposium on Pervasive Systems Algorithms and Networks, 2009
    Co-Authors: Rui Yang, Joseph Antony, Alistair P. Rendell
    Abstract:

    In this work we study the effect of data locality on the performance of Gaussian 03 code running on a multi-core Non-uniform memory access (NUMA) system. A user-space protocol which affects runtime data locality, through the use of dynamic page migration and interleaving techniques, is considered. Using this protocol results in a significant performance improvement. Results for parallel Gaussian 03 using up to 16 threads are presented. The overhead of page migration and effect of dual-core contention are also examined.

  • International Conference on Computational Science - OpenMP and NUMA architectures I: Investigating memory placement on the SGI origin 3000
    Lecture Notes in Computer Science, 2003
    Co-Authors: Nathan Robertson, Alistair P. Rendell
    Abstract:

    The OpenMP programming model is based upon the assumption of uniform memory access. Virtually all current day large scale shared memory computers exhibit some degree of Non-uniform memory access (NUMA). Should OpenMP be extended for NUMA architectures? This paper aims to quantify NUMA effects on the SGI Origin 3000 system as a prelude to answering this important question. We discuss the tools required to study NUMA effects and use them in the study of latency, bandwidth and the solution of a 2-D heat diffusion problem.