Memory Hierarchy

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 23235 Experts worldwide ranked by ideXlab platform

Viktor K Prasanna - One of the best experts on this subject based on the ideXlab platform.

  • tiling block data layout and Memory Hierarchy performance
    IEEE Transactions on Parallel and Distributed Systems, 2003
    Co-Authors: Neungsoo Park, Bo Hong, Viktor K Prasanna
    Abstract:

    Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.

  • analysis of Memory Hierarchy performance of block data layout
    International Conference on Parallel Processing, 2002
    Co-Authors: Neungsoo Park, Bo Hong, Viktor K Prasanna
    Abstract:

    Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. We provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on the number of TLB misses for any data layout and show that block data layout achieves this bound. We show that block data layout improves TLB misses by a factor of O(B) compared with conventional data layouts, where B is the block size of block data layout. This reduction contributes to the improvement in Memory Hierarchy performance. Using our TLB and cache analysis, we also discuss the impact of block size on the overall Memory Hierarchy performance. These results are validated through simulations and experiments on state-of-the-art platforms.

William J Dally - One of the best experts on this subject based on the ideXlab platform.

  • slip reducing wire energy in the Memory Hierarchy
    International Symposium on Computer Architecture, 2015
    Co-Authors: Subhasis Das, Tor M Aamodt, William J Dally
    Abstract:

    Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the Memory Hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management policy which improves cache energy consumption by increasing the number of accesses from energy efficient locations while simultaneously decreasing intra-level data movement. In SLIP, each cache level is partitioned into several cache sublevels of differing sizes. Then, the recent reuse distance distribution of a line is used to choose an energy-optimized insertion and movement policy for the line. The policy choice is made by a hardware unit that predicts the number of accesses and inter-level movements. Using a full-system simulation including OS interactions and hardware overheads, we show that SLIP saves 35% energy at the L2 and 22% energy at the L3 level and performs 0.75% better than a regular cache Hierarchy in a single core system. When configured to include a bypassing policy, SLIP reduces traffic to DRAM by 2.2%. This is achieved at the cost of storing 12b metadata per cache line (2.3% overhead), a 6b policy in the PTE, and 32b distribution metadata for each page in the DRAM (a overhead of 0.1%). Using SLIP in a multiprogrammed system saves 47% LLC energy, and reduces traffic to DRAM by 5.5%.

  • A tuning framework for software-managed Memory hierarchies
    2008 International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008
    Co-Authors: Ji Young Park, Mike Houston, Alex Aiken, William J Dally
    Abstract:

    Achieving good performance on a modern machine with a multi-level Memory Hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine's particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed Memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different Memory Hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3's.

  • SC - Sequoia: programming the Memory Hierarchy
    Proceedings of the 2006 ACM IEEE conference on Supercomputing - SC '06, 2006
    Co-Authors: Kayvon Fatahalian, Mike Houston, Alex Aiken, Ji Young Park, Daniel Reiter Horn, Timothy James Knight, Larkhoon Leem, Mattan Erez, Manman Ren, William J Dally
    Abstract:

    We present Sequoia, a programming language designed to facilitate the development of Memory Hierarchy aware parallel programs that remain portable across modern machines featuring different Memory Hierarchy configurations. Sequoia abstractly exposes hierarchical Memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular Memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed Memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.

  • Memory Hierarchy design for stream computing
    2005
    Co-Authors: William J Dally, Nuwan Jayasena
    Abstract:

    Several classes of applications with abundant fine-grain parallelism, such as media and signal processing, graphics, and scientific computing, have become increasingly dominant consumers of computing resources. Prior research has shown that stream processors provide an energy-efficient, programmable approach to achieving high performance for these applications. However, given the strong compute capabilities of these processors, efficient utilization of bandwidth, particularly when accessing off-chip Memory, is crucial to sustaining high performance. This thesis explores tradeoffs in, and techniques for, improving the efficiency of Memory and bandwidth Hierarchy utilization in stream processors. We first evaluate the appropriate granularity for expressing data-level parallelism—entire records or individual words—and show that record-granularity expression of parallelism leads to reduced intermediate state storage requirements and higher sustained bandwidths in modern Memory systems. We also explore the effectiveness of software- and hardware-managed memories, and identify the relative merits of each type of Memory within the context of stream computing. Software-managed memories are shown to efficiently support coarse-grain and producer-consumer data reuse, while hardware-managed memories are shown to effectively capture fine-grain and irregular temporal reuse. We introduce three new techniques for improving the efficiency of off-chip Memory bandwidth utilization. First, we propose a stream register file architecture that enables indexed, arbitrary access patterns, allowing a wider range of data reuse to be captured in on-chip, software-managed Memory compared to current stream processors. We then introduce epoch-based cache invalidation—a technique that actively identifies and invalidates dead data—to improve the performance of hardware-managed caches for stream computing. Finally, we propose a hybrid bandwidth Hierarchy that incorporates both hardware- and software-managed Memory, and allows dynamic reallocation of capacity between these two types of memories to better cater to application requirements. Our analyses and evaluations show that these techniques not only provide performance improvements for existing streaming applications but also broaden the capabilities of stream processors, enabling new classes of applications to be executed efficiently.

Neungsoo Park - One of the best experts on this subject based on the ideXlab platform.

  • tiling block data layout and Memory Hierarchy performance
    IEEE Transactions on Parallel and Distributed Systems, 2003
    Co-Authors: Neungsoo Park, Bo Hong, Viktor K Prasanna
    Abstract:

    Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.

  • analysis of Memory Hierarchy performance of block data layout
    International Conference on Parallel Processing, 2002
    Co-Authors: Neungsoo Park, Bo Hong, Viktor K Prasanna
    Abstract:

    Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. We provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on the number of TLB misses for any data layout and show that block data layout achieves this bound. We show that block data layout improves TLB misses by a factor of O(B) compared with conventional data layouts, where B is the block size of block data layout. This reduction contributes to the improvement in Memory Hierarchy performance. Using our TLB and cache analysis, we also discuss the impact of block size on the overall Memory Hierarchy performance. These results are validated through simulations and experiments on state-of-the-art platforms.

Bo Hong - One of the best experts on this subject based on the ideXlab platform.

  • tiling block data layout and Memory Hierarchy performance
    IEEE Transactions on Parallel and Distributed Systems, 2003
    Co-Authors: Neungsoo Park, Bo Hong, Viktor K Prasanna
    Abstract:

    Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.

  • analysis of Memory Hierarchy performance of block data layout
    International Conference on Parallel Processing, 2002
    Co-Authors: Neungsoo Park, Bo Hong, Viktor K Prasanna
    Abstract:

    Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. We provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on the number of TLB misses for any data layout and show that block data layout achieves this bound. We show that block data layout improves TLB misses by a factor of O(B) compared with conventional data layouts, where B is the block size of block data layout. This reduction contributes to the improvement in Memory Hierarchy performance. Using our TLB and cache analysis, we also discuss the impact of block size on the overall Memory Hierarchy performance. These results are validated through simulations and experiments on state-of-the-art platforms.

Daniel Sanchez - One of the best experts on this subject based on the ideXlab platform.

  • livia data centric computing throughout the Memory Hierarchy
    Architectural Support for Programming Languages and Operating Systems, 2020
    Co-Authors: Elliot Lockerman, Daniel Sanchez, Axel Feldmann, Mohammad Bakhshalipour, Alexandru Stanescu, Shashwat Gupta, Nathan Beckmann
    Abstract:

    In order to scale, future systems will need to dramatically reduce data movement. Data movement is expensive in current designs because (i) traditional Memory hierarchies force computation to happen unnecessarily far away from data and (ii) processing-in-Memory approaches fail to exploit locality. We propose Memory Services, a flexible programming model that enables data-centric computing throughout the Memory Hierarchy. In Memory Services, applications express functionality as graphs of simple tasks, each task indicating the data it operates on. We design and evaluate Livia, a new system architecture for Memory Services that dynamically schedules tasks and data at the location in the Memory Hierarchy that minimizes overall data movement. Livia adds less than 3% area overhead to a tiled multicore and accelerates challenging irregular workloads by 1.3 × to 2.4 × while reducing dynamic energy by 1.2× to 4.7×.

  • compress objects not cache lines an object based compressed Memory Hierarchy
    Architectural Support for Programming Languages and Operating Systems, 2019
    Co-Authors: Poan Tsai, Daniel Sanchez
    Abstract:

    Existing cache and main Memory compression techniques compress data in small fixed-size blocks, typically cache lines. Moreover, they use simple compression algorithms that focus on exploiting redundancy within a block. These techniques work well for scientific programs that are dominated by arrays. However, they are ineffective on object-based programs because objects do not fall neatly into fixed-size blocks and have a more irregular layout. We present the first compressed Memory Hierarchy designed for object-based applications. We observe that (i) objects, not cache lines, are the natural unit of compression for these programs, as they traverse and operate on object pointers; and (ii) though redundancy within each object is limited, redundancy across objects of the same type is plentiful. We exploit these insights through Zippads, an object-based compressed Memory Hierarchy, and COCO, a cross-object-compression algorithm. Building on a recent object-based Memory Hierarchy, Zippads transparently compresses variable-sized objects and stores them compactly. As a result, Zippads consistently outperforms a state-of-the-art compressed Memory Hierarchy: on a mix of array- and object-dominated workloads, Zippads achieves 1.63x higher compression ratio and improves performance by 17%.

  • rethinking the Memory Hierarchy for modern languages
    International Symposium on Microarchitecture, 2018
    Co-Authors: Poan Tsai, Yee Ling Gan, Daniel Sanchez
    Abstract:

    We present Hotpads, a new Memory Hierarchy designed from the ground up for modern, Memory-safe languages like Java, Go, and Rust. Memory-safe languages hide the Memory layout from the programmer. This prevents Memory corruption bugs and enables automatic Memory management. Hotpads extends the same insight to the Memory Hierarchy: it hides the Memory layout from software and takes control over it, dispensing with the conventional flat address space abstraction. This avoids the need for associative caches. Instead, Hotpads moves objects across a Hierarchy of directly addressed memories. It rewrites pointers to avoid most associative lookups, provides hardware support for Memory allocation, and unifies hierarchical garbage collection and data placement. As a result, Hotpads improves Memory performance and efficiency substantially, and unlocks many new optimizations.