Cache Hierarchy

The Experts below are selected from a list of 7761 Experts worldwide ranked by ideXlab platform

Chen Ding - One of the best experts on this subject based on the ideXlab platform.

rethinking a heap Hierarchy as a Cache Hierarchy a higher order theory of memory demand hotm

International Symposium on Memory Management, 2016

Co-Authors: Hao Luo, Chen Ding

Abstract:

Modern memory allocators divide the available memory between different threads and object size classes. They use many parameters that are related and mutually affecting. Existing solutions are based on heuristics which cannot serve all applications equally well. This paper presents a theory of memory demand. The theory enables the global optimization of heap parameters for an application. The paper evaluates the theory and the optimization using multi-threaded micro-benchmarks as well as real applications including Apache, Ghostscript interpreter, and a database benchmarking tool and shows that the global optimization theoretically outperforms three typical heuristics by 15% to 113%.

15 days free trial to Access Article
defensive loop tiling for shared Cache

Symposium on Code Generation and Optimization, 2013

Co-Authors: Bin Bao, Chen Ding

Abstract:

Loop tiling is a compiler transformation that tailors an application's working set to fit in a Cache Hierarchy. On today's multicore processors, part of the Hierarchy especially the last level Cache (LLC) is shared. The available Cache space in shared Cache changes depending on co-run applications. Furthermore on machines with an inclusive Cache Hierarchy, the interference in the shared Cache can cause evictions in the private Cache, a problem known as the inclusion victims. This paper presents defensive tiling, a set of compiler techniques to estimate the effect of Cache sharing and then choose the tile sizes that can provide robust performance in co-run environments. The goal of the transformation is to optimize the use of the Cache while at the same time guarding against interference. It is entirely a static technique and does not require program profiling. The paper shows how it can be integrated into a production-quality compiler and evalutes its effect on a set of tililing benchmarks for both program co-run and solo-run performance, using both simulation and testing on real systems.

15 days free trial to Access Article

Luca Benini - One of the best experts on this subject based on the ideXlab platform.

dory lightweight memory Hierarchy management for deep nn inference on iot endnodes work in progress

International Conference on Hardware Software Codesign and System Synthesis, 2019

Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini

Abstract:

IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2x faster compared to its execution directly from L2 memory while consuming 1.9x less energy.

15 days free trial to Access Article
Work-in-Progress: DORY: Lightweight Memory Hierarchy Management for Deep NN Inference on IoT Endnodes

2019 International Conference on Hardware Software Codesign and System Synthesis (CODES+ISSS), 2019

Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini

Abstract:

IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2× faster compared to its execution directly from L2 memory while consuming 1.9× less energy.

15 days free trial to Access Article

Anand Raghunathan - One of the best experts on this subject based on the ideXlab platform.

energy efficient all spin Cache Hierarchy using shift based writes and multilevel storage

ACM Journal on Emerging Technologies in Computing Systems, 2015

Co-Authors: Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, Anand Raghunathan

Abstract:

Spintronic memories are considered to be promising candidates for future on-chip memories due to their high density, nonvolatility, and near-zero leakage. However, they also face challenges such as high write energy and latency and limited read speed due to single-ended sensing. Further, the conflicting requirements of read and write operations lead to stringent design constraints that severely compromises their benefits. Recently, domain wall memory was proposed as a spintronic memory that has a potential for very high density by storing multiple bits in the domains of a ferromagnetic nanowire. While reliable operation of DWM memory with multiple domains faces many challenges, single-bit cells that utilize domain wall motion for writes have been experimentally demonstrated [Fukami et al. 2009]. This bit-cell, which we refer to as Domain Wall Memory with Shift-based Write (DWM-SW), achieves improved write efficiency and features decoupled read-write paths, enabling independent optimizations of read and write operations. However, these benefits are achieved at the cost of sacrificing the original goal of improved density. In this work, we explore multilevel storage as a new direction to enhance the density benefits of DWM-SW. At the device level, we propose a new device--multilevel DWM with shift-based write (ML-DWM-SW)--that is capable of storing 2 bits in a single device. At the circuit level, we propose a ML-DWM-SW based bit-cell design and layout. The ML-DWM-SW bit-cell incurs no additional area overhead compared to the DWM-SW bit-cell despite storing an additional bit, thereby achieving roughly twice the density. However, it requires a two-step write operation and has data-dependent read and write energies, which pose unique challenges. To address these issues, we propose suitable architectural optimizations: (i) intra-word interleaving and (ii) bit encoding. We design “all-spin” Cache architectures using the proposed ML-DWM-SW bit-cell for both general purpose processors as well as general purpose graphics processing units (GPGPUs). We perform an iso-capacity replacement of SRAM with spintronic memories and study the energy and area benefits at iso-performance conditions. For general purpose processors, the ML-DWM-SW Cache achieves 10X reduction in energy and 4.4X reduction in Cache area compared to an SRAM Cache and 2X and 1.7X reduction in energy and area, respectively, compared to an STT-MRAM Cache. For GPGPUs, the ML-DWM-SW Cache achieves 5.3X reduction in energy and 3.6X area reduction compared to SRAM and 3.5X energy reduction and 1.9X area reduction compared to STT-MRAM.

15 days free trial to Access Article
stag spintronic tape architecture for gpgpu Cache hierarchies

International Symposium on Computer Architecture, 2014

Co-Authors: Rangharajan Venkatesan, Kaushik Roy, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Anand Raghunathan

Abstract:

General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory Hierarchy, fueling an ever-increasing demand for on-chip memory.In this work, we propose STAG, a high density, energy-efficient GPGPU Cache Hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing "shift" operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid Cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU Cache Hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a Cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware pro- motion buffer (SaPB), in which accesses to the DWM Cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, IS- PASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM)

15 days free trial to Access Article
dwm tapestri an energy efficient all spin Cache using domain wall shift based writes

Design Automation and Test in Europe, 2013

Co-Authors: Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, Anand Raghunathan

Abstract:

Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach -- shift based write -- that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin Cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the Cache Hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 Caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the Cache Hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWM-TAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM Cache. Compared to an iso-capacity STT-MRAM Cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

15 days free trial to Access Article

Alessio Burrello - One of the best experts on this subject based on the ideXlab platform.

dory lightweight memory Hierarchy management for deep nn inference on iot endnodes work in progress

International Conference on Hardware Software Codesign and System Synthesis, 2019

Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini

Abstract:

IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2x faster compared to its execution directly from L2 memory while consuming 1.9x less energy.

15 days free trial to Access Article
Work-in-Progress: DORY: Lightweight Memory Hierarchy Management for Deep NN Inference on IoT Endnodes

2019 International Conference on Hardware Software Codesign and System Synthesis (CODES+ISSS), 2019

Co-Authors: Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, Luca Benini

Abstract:

IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware Cache Hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-Cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2× faster compared to its execution directly from L2 memory while consuming 1.9× less energy.

15 days free trial to Access Article

Bin Bao - One of the best experts on this subject based on the ideXlab platform.

defensive loop tiling for shared Cache

Symposium on Code Generation and Optimization, 2013

Co-Authors: Bin Bao, Chen Ding

Abstract:

Loop tiling is a compiler transformation that tailors an application's working set to fit in a Cache Hierarchy. On today's multicore processors, part of the Hierarchy especially the last level Cache (LLC) is shared. The available Cache space in shared Cache changes depending on co-run applications. Furthermore on machines with an inclusive Cache Hierarchy, the interference in the shared Cache can cause evictions in the private Cache, a problem known as the inclusion victims. This paper presents defensive tiling, a set of compiler techniques to estimate the effect of Cache sharing and then choose the tile sizes that can provide robust performance in co-run environments. The goal of the transformation is to optimize the use of the Cache while at the same time guarding against interference. It is entirely a static technique and does not require program profiling. The paper shows how it can be integrated into a production-quality compiler and evalutes its effect on a set of tililing benchmarks for both program co-run and solo-run performance, using both simulation and testing on real systems.

15 days free trial to Access Article
defensive loop tiling for multi core processor

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, 2012

Co-Authors: Bin Bao, Xiaoya Xiang

Abstract:

Loop tiling is a compiler transformation that tailors an application's working set to fit in a Cache Hierarchy. On today's multicore processors, part of the Hierarchy, especially the last level Cache (LLC) is shared. In this paper, we show that Cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private Cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Chen Ding - One of the best experts on this subject based on the ideXlab platform.

rethinking a heap Hierarchy as a Cache Hierarchy a higher order theory of memory demand hotm

defensive loop tiling for shared Cache

Luca Benini - One of the best experts on this subject based on the ideXlab platform.

dory lightweight memory Hierarchy management for deep nn inference on iot endnodes work in progress

Work-in-Progress: DORY: Lightweight Memory Hierarchy Management for Deep NN Inference on IoT Endnodes

Anand Raghunathan - One of the best experts on this subject based on the ideXlab platform.

energy efficient all spin Cache Hierarchy using shift based writes and multilevel storage

stag spintronic tape architecture for gpgpu Cache hierarchies

dwm tapestri an energy efficient all spin Cache using domain wall shift based writes

Alessio Burrello - One of the best experts on this subject based on the ideXlab platform.

dory lightweight memory Hierarchy management for deep nn inference on iot endnodes work in progress

Work-in-Progress: DORY: Lightweight Memory Hierarchy Management for Deep NN Inference on IoT Endnodes

Bin Bao - One of the best experts on this subject based on the ideXlab platform.

defensive loop tiling for shared Cache

defensive loop tiling for multi core processor