Cache Hierarchy

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 7761 Experts worldwide ranked by ideXlab platform

Chen Ding - One of the best experts on this subject based on the ideXlab platform.

  • rethinking a heap Hierarchy as a Cache Hierarchy a higher order theory of memory demand hotm
    International Symposium on Memory Management, 2016
    Co-Authors: Hao Luo, Chen Ding
    Abstract:

    Modern memory allocators divide the available memory between different threads and object size classes. They use many parameters that are related and mutually affecting. Existing solutions are based on heuristics which cannot serve all applications equally well. This paper presents a theory of memory demand. The theory enables the global optimization of heap parameters for an application. The paper evaluates the theory and the optimization using multi-threaded micro-benchmarks as well as real applications including Apache, Ghostscript interpreter, and a database benchmarking tool and shows that the global optimization theoretically outperforms three typical heuristics by 15% to 113%.

  • defensive loop tiling for shared Cache
    Symposium on Code Generation and Optimization, 2013
    Co-Authors: Bin Bao, Chen Ding
    Abstract:

    Loop tiling is a compiler transformation that tailors an application's working set to fit in a Cache Hierarchy. On today's multicore processors, part of the Hierarchy especially the last level Cache (LLC) is shared. The available Cache space in shared Cache changes depending on co-run applications. Furthermore on machines with an inclusive Cache Hierarchy, the interference in the shared Cache can cause evictions in the private Cache, a problem known as the inclusion victims. This paper presents defensive tiling, a set of compiler techniques to estimate the effect of Cache sharing and then choose the tile sizes that can provide robust performance in co-run environments. The goal of the transformation is to optimize the use of the Cache while at the same time guarding against interference. It is entirely a static technique and does not require program profiling. The paper shows how it can be integrated into a production-quality compiler and evalutes its effect on a set of tililing benchmarks for both program co-run and solo-run performance, using both simulation and testing on real systems.

Luca Benini - One of the best experts on this subject based on the ideXlab platform.

  • dory lightweight memory Hierarchy management for deep nn inference on iot endnodes work in progress
    International Conference on Hardware Software Codesign and System Synthesis, 2019
    Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini
    Abstract:

    IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2x faster compared to its execution directly from L2 memory while consuming 1.9x less energy.

  • Work-in-Progress: DORY: Lightweight Memory Hierarchy Management for Deep NN Inference on IoT Endnodes
    2019 International Conference on Hardware Software Codesign and System Synthesis (CODES+ISSS), 2019
    Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini
    Abstract:

    IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2× faster compared to its execution directly from L2 memory while consuming 1.9× less energy.

Anand Raghunathan - One of the best experts on this subject based on the ideXlab platform.

  • energy efficient all spin Cache Hierarchy using shift based writes and multilevel storage
    ACM Journal on Emerging Technologies in Computing Systems, 2015
    Co-Authors: Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, Anand Raghunathan
    Abstract:

    Spintronic memories are considered to be promising candidates for future on-chip memories due to their high density, nonvolatility, and near-zero leakage. However, they also face challenges such as high write energy and latency and limited read speed due to single-ended sensing. Further, the conflicting requirements of read and write operations lead to stringent design constraints that severely compromises their benefits. Recently, domain wall memory was proposed as a spintronic memory that has a potential for very high density by storing multiple bits in the domains of a ferromagnetic nanowire. While reliable operation of DWM memory with multiple domains faces many challenges, single-bit cells that utilize domain wall motion for writes have been experimentally demonstrated [Fukami et al. 2009]. This bit-cell, which we refer to as Domain Wall Memory with Shift-based Write (DWM-SW), achieves improved write efficiency and features decoupled read-write paths, enabling independent optimizations of read and write operations. However, these benefits are achieved at the cost of sacrificing the original goal of improved density. In this work, we explore multilevel storage as a new direction to enhance the density benefits of DWM-SW. At the device level, we propose a new device--multilevel DWM with shift-based write (ML-DWM-SW)--that is capable of storing 2 bits in a single device. At the circuit level, we propose a ML-DWM-SW based bit-cell design and layout. The ML-DWM-SW bit-cell incurs no additional area overhead compared to the DWM-SW bit-cell despite storing an additional bit, thereby achieving roughly twice the density. However, it requires a two-step write operation and has data-dependent read and write energies, which pose unique challenges. To address these issues, we propose suitable architectural optimizations: (i) intra-word interleaving and (ii) bit encoding. We design “all-spin” Cache architectures using the proposed ML-DWM-SW bit-cell for both general purpose processors as well as general purpose graphics processing units (GPGPUs). We perform an iso-capacity replacement of SRAM with spintronic memories and study the energy and area benefits at iso-performance conditions. For general purpose processors, the ML-DWM-SW Cache achieves 10X reduction in energy and 4.4X reduction in Cache area compared to an SRAM Cache and 2X and 1.7X reduction in energy and area, respectively, compared to an STT-MRAM Cache. For GPGPUs, the ML-DWM-SW Cache achieves 5.3X reduction in energy and 3.6X area reduction compared to SRAM and 3.5X energy reduction and 1.9X area reduction compared to STT-MRAM.

  • stag spintronic tape architecture for gpgpu Cache hierarchies
    International Symposium on Computer Architecture, 2014
    Co-Authors: Rangharajan Venkatesan, Kaushik Roy, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Anand Raghunathan
    Abstract:

    General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory Hierarchy, fueling an ever-increasing demand for on-chip memory.In this work, we propose STAG, a high density, energy-efficient GPGPU Cache Hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing "shift" operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid Cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU Cache Hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a Cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware pro- motion buffer (SaPB), in which accesses to the DWM Cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, IS- PASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM)

  • dwm tapestri an energy efficient all spin Cache using domain wall shift based writes
    Design Automation and Test in Europe, 2013
    Co-Authors: Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, Anand Raghunathan
    Abstract:

    Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach -- shift based write -- that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin Cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the Cache Hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 Caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the Cache Hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWM-TAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM Cache. Compared to an iso-capacity STT-MRAM Cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

Alessio Burrello - One of the best experts on this subject based on the ideXlab platform.

  • dory lightweight memory Hierarchy management for deep nn inference on iot endnodes work in progress
    International Conference on Hardware Software Codesign and System Synthesis, 2019
    Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini
    Abstract:

    IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2x faster compared to its execution directly from L2 memory while consuming 1.9x less energy.

  • Work-in-Progress: DORY: Lightweight Memory Hierarchy Management for Deep NN Inference on IoT Endnodes
    2019 International Conference on Hardware Software Codesign and System Synthesis (CODES+ISSS), 2019
    Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini
    Abstract:

    IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2× faster compared to its execution directly from L2 memory while consuming 1.9× less energy.

Bin Bao - One of the best experts on this subject based on the ideXlab platform.

  • defensive loop tiling for shared Cache
    Symposium on Code Generation and Optimization, 2013
    Co-Authors: Bin Bao, Chen Ding
    Abstract:

    Loop tiling is a compiler transformation that tailors an application's working set to fit in a Cache Hierarchy. On today's multicore processors, part of the Hierarchy especially the last level Cache (LLC) is shared. The available Cache space in shared Cache changes depending on co-run applications. Furthermore on machines with an inclusive Cache Hierarchy, the interference in the shared Cache can cause evictions in the private Cache, a problem known as the inclusion victims. This paper presents defensive tiling, a set of compiler techniques to estimate the effect of Cache sharing and then choose the tile sizes that can provide robust performance in co-run environments. The goal of the transformation is to optimize the use of the Cache while at the same time guarding against interference. It is entirely a static technique and does not require program profiling. The paper shows how it can be integrated into a production-quality compiler and evalutes its effect on a set of tililing benchmarks for both program co-run and solo-run performance, using both simulation and testing on real systems.

  • defensive loop tiling for multi core processor
    Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, 2012
    Co-Authors: Bin Bao, Xiaoya Xiang
    Abstract:

    Loop tiling is a compiler transformation that tailors an application's working set to fit in a Cache Hierarchy. On today's multicore processors, part of the Hierarchy, especially the last level Cache (LLC) is shared. In this paper, we show that Cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private Cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.