Fully Associative Cache

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 576 Experts worldwide ranked by ideXlab platform

Jae W. Lee - One of the best experts on this subject based on the ideXlab platform.

  • ISCA - A Fully Associative, tagless DRAM Cache
    Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15, 2015
    Co-Authors: Yong-jun Lee, Jong-won Kim, Hakbeom Jang, Hyunggyun Yang, Jangwoo Kim, Jinkyu Jeong, Jae W. Lee
    Abstract:

    This paper introduces a tagless Cache architecture for large in-package DRAM Caches. The conventional die-stacked DRAM Cache has both a TLB and a Cache tag array, which are responsible for virtual-to-physical and physical-to-Cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and Cache tag management. To this end, we introduce Cache-map TLB (cTLB), which stores virtual-to-Cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the Cache if it is not Cached yet, and updates both the page table and cTLB with the virtual-to-Cache address mapping. Assuming the availability of large in-package DRAM Caches, this ensures that an access to the memory region within the TLB reach always hits in the Cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the Cache, hence saving a tag-checking operation. The remaining Cache space is used as victim Cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for Cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM Cache achieves best scalability and hit latency, while maintaining high hit rate of a Fully Associative Cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed Cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM Cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag Cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for Cache tags.

  • a Fully Associative tagless dram Cache
    International Symposium on Computer Architecture, 2015
    Co-Authors: Yong-jun Lee, Jong-won Kim, Hakbeom Jang, Hyunggyun Yang, Jangwoo Kim, Jinkyu Jeong, Jae W. Lee
    Abstract:

    This paper introduces a tagless Cache architecture for large in-package DRAM Caches. The conventional die-stacked DRAM Cache has both a TLB and a Cache tag array, which are responsible for virtual-to-physical and physical-to-Cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and Cache tag management. To this end, we introduce Cache-map TLB (cTLB), which stores virtual-to-Cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the Cache if it is not Cached yet, and updates both the page table and cTLB with the virtual-to-Cache address mapping. Assuming the availability of large in-package DRAM Caches, this ensures that an access to the memory region within the TLB reach always hits in the Cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the Cache, hence saving a tag-checking operation. The remaining Cache space is used as victim Cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for Cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM Cache achieves best scalability and hit latency, while maintaining high hit rate of a Fully Associative Cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed Cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM Cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag Cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for Cache tags.

Yong-jun Lee - One of the best experts on this subject based on the ideXlab platform.

  • ISCA - A Fully Associative, tagless DRAM Cache
    Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15, 2015
    Co-Authors: Yong-jun Lee, Jong-won Kim, Hakbeom Jang, Hyunggyun Yang, Jangwoo Kim, Jinkyu Jeong, Jae W. Lee
    Abstract:

    This paper introduces a tagless Cache architecture for large in-package DRAM Caches. The conventional die-stacked DRAM Cache has both a TLB and a Cache tag array, which are responsible for virtual-to-physical and physical-to-Cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and Cache tag management. To this end, we introduce Cache-map TLB (cTLB), which stores virtual-to-Cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the Cache if it is not Cached yet, and updates both the page table and cTLB with the virtual-to-Cache address mapping. Assuming the availability of large in-package DRAM Caches, this ensures that an access to the memory region within the TLB reach always hits in the Cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the Cache, hence saving a tag-checking operation. The remaining Cache space is used as victim Cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for Cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM Cache achieves best scalability and hit latency, while maintaining high hit rate of a Fully Associative Cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed Cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM Cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag Cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for Cache tags.

  • a Fully Associative tagless dram Cache
    International Symposium on Computer Architecture, 2015
    Co-Authors: Yong-jun Lee, Jong-won Kim, Hakbeom Jang, Hyunggyun Yang, Jangwoo Kim, Jinkyu Jeong, Jae W. Lee
    Abstract:

    This paper introduces a tagless Cache architecture for large in-package DRAM Caches. The conventional die-stacked DRAM Cache has both a TLB and a Cache tag array, which are responsible for virtual-to-physical and physical-to-Cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and Cache tag management. To this end, we introduce Cache-map TLB (cTLB), which stores virtual-to-Cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the Cache if it is not Cached yet, and updates both the page table and cTLB with the virtual-to-Cache address mapping. Assuming the availability of large in-package DRAM Caches, this ensures that an access to the memory region within the TLB reach always hits in the Cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the Cache, hence saving a tag-checking operation. The remaining Cache space is used as victim Cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for Cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM Cache achieves best scalability and hit latency, while maintaining high hit rate of a Fully Associative Cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed Cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM Cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag Cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for Cache tags.

Anup Dandapat - One of the best experts on this subject based on the ideXlab platform.

  • A 9-T 833-MHz 1.72-fJ/Bit/Search Quasi-Static Ternary Fully Associative Cache Tag With Selective Matchline Evaluation for Wire Speed Applications
    IEEE Transactions on Circuits and Systems, 2016
    Co-Authors: Sandeep Mishra, Telajala Venkata Mahendra, Anup Dandapat
    Abstract:

    Hardware search engine (HSE) plays a major role to speed up the search operation in wireless applications. Ternary content addressable memory (TCAM) is such an engine which performs the search in a single clock cycle but the use of separate content and mask storage, various wordlines for read/mask/write, and decoupled data/search lines require substantial design area and consume relatively high power. This article proposes implementation of a state of the art energy-efficient quasi-static ternary Fully Associative Cache tag for wire speed memory access. A 4-T static content and dynamic mask storage have been used with coupled data and search line for reducing the energy dissipation during search. The proposed 128 $\times$ 32-bit TCAM tag with selective matchline evaluation scheme has been implemented with predictive 45-nm CMOS process and simulated in SPECTRE at the supply voltage of 1.0 V. The design dissipates an energy of 1.72-fJ/bit/search with a reduction of 32% in the cell area compared to the traditional TCAM.

  • a 9 t 833 mhz 1 72 fj bit search quasi static ternary Fully Associative Cache tag with selective matchline evaluation for wire speed applications
    IEEE Transactions on Circuits and Systems, 2016
    Co-Authors: Sandeep Mishra, Telajala Venkata Mahendra, Anup Dandapat
    Abstract:

    Hardware search engine (HSE) plays a major role to speed up the search operation in wireless applications. Ternary content addressable memory (TCAM) is such an engine which performs the search in a single clock cycle but the use of separate content and mask storage, various wordlines for read/mask/write, and decoupled data/search lines require substantial design area and consume relatively high power. This article proposes implementation of a state of the art energy-efficient quasi-static ternary Fully Associative Cache tag for wire speed memory access. A 4-T static content and dynamic mask storage have been used with coupled data and search line for reducing the energy dissipation during search. The proposed 128 $\times$ 32-bit TCAM tag with selective matchline evaluation scheme has been implemented with predictive 45-nm CMOS process and simulated in SPECTRE at the supply voltage of 1.0 V. The design dissipates an energy of 1.72-fJ/bit/search with a reduction of 32% in the cell area compared to the traditional TCAM.

Norman P Jouppi - One of the best experts on this subject based on the ideXlab platform.

  • ISCA - Cache write policies and performance
    Proceedings of the 20th annual international symposium on Computer architecture - ISCA '93, 1993
    Co-Authors: Norman P Jouppi
    Abstract:

    This paper investigates issues involving writes and Caches. First, tradeoffs on writes that miss in the Cache are investigated. In particular, whether the missed Cache block is fetched on a write miss, whether the missed Cache block is allocated in the Cache, and whether the Cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire Cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than Cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a Cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small Fully-Associative Cache behind a write-through Cache. A write Cache can eliminate almost as much write traffic as a write-back Cache.

  • improving direct mapped Cache performance by the addition of a small Fully Associative Cache and prefetch buffers
    International Symposium on Computer Architecture, 1990
    Co-Authors: Norman P Jouppi
    Abstract:

    Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of Caches. Miss caching places a small Fully-Associative Cache between a Cache and its refill path. Misses in the Cache that hit in the miss Cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss Cache. Small miss Caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped Caches. Victim caching is an improvement to miss caching that loads the small Fully-Associative Cache with the victim of a miss and not the requested line. Small victim Caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching. Stream buffers prefetch Cache lines starting at a Cache miss address. The prefetched data is placed in the buffer and not in the Cache. Stream buffers are useful in removing capacity and compulsory Cache misses, as well as some instruction Cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers , is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams. Together, victim Caches and stream buffers reduce the miss rate of the first level in the Cache hierarchy by a factor of two to three on a set of six large benchmarks.

  • 25 Years ISCA: Retrospectives and Reprints - Improving direct-mapped Cache performance by the addition of a small Fully-Associative Cache and prefetch buffers
    25 years of the international symposia on Computer architecture (selected papers) - ISCA '98, 1990
    Co-Authors: Norman P Jouppi
    Abstract:

    Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of Caches. Miss caching places a small Fully-Associative Cache between a Cache and its refill path. Misses in the Cache that hit in the miss Cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss Cache. Small miss Caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped Caches. Victim caching is an improvement to miss caching that loads the small Fully-Associative Cache with the victim of a miss and not the requested line. Small victim Caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching. Stream buffers prefetch Cache lines starting at a Cache miss address. The prefetched data is placed in the buffer and not in the Cache. Stream buffers are useful in removing capacity and compulsory Cache misses, as well as some instruction Cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers , is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams. Together, victim Caches and stream buffers reduce the miss rate of the first level in the Cache hierarchy by a factor of two to three on a set of six large benchmarks.

Wei Zhang - One of the best experts on this subject based on the ideXlab platform.

  • Exploiting the replication Cache to improve Cache read bandwidth cost effectively
    ACM SIGARCH Computer Architecture News, 2006
    Co-Authors: Bramha Allu, Wei Zhang, Mallik Kandala
    Abstract:

    Cache bandwidth and reliability are both of great importance for microprocessor design. Recently, the replication Cache has been proposed to enhance data Cache reliability against soft errors. The replication Cache is a small Fully Associative Cache to store the replica(s) for every write to the L1 data Cache. In addition to enhancing reliability, this paper proposes to make use of the replication Cache in order to improve the performance of multiple-issue superscalar microprocessors by enlarging the Cache read bandwidth effectively. Our experimental results show that exploiting a replication Cache with only 8 entries can improve the performance of a 4-issue superscalar microprocessor by 9.4% on average without compromising the enhanced data integrity.

  • Replication Cache: a small Fully Associative Cache to improve data Cache reliability
    IEEE Transactions on Computers, 2005
    Co-Authors: Wei Zhang
    Abstract:

    Soft error conscious Cache design has become increasingly crucial for reliable computing. The widely used ECC or parity-based integrity checking techniques have only limited capability in error detection and correction, while incurring nontrivial penalty in area or performance. The N modular redundancy (NMR) scheme is too costly for processors with stringent cost constraints. This paper proposes a cost-effective solution to enhance data reliability significantly with minimum impact on performance. The idea is to add a small Fully Associative Cache to store the replica of every write to the L1 data Cache. Due to data locality and its full associativity, the replication Cache can be kept small while providing replicas for a significant fraction of read hits in L1, which can be used to enhance data integrity against soft errors. Our experiments show that a replication Cache with eight blocks can provide replicas for 97.3 percent of read hits in L1 on average. Moreover, compared with the recently proposed in-Cache replication schemes, the replication Cache is more energy efficient, while improving the data integrity against soft errors significantly.

  • Exploiting the replication Cache to improve performance for multiple-issue microprocessors
    ACM Sigarch Computer Architecture News, 2005
    Co-Authors: Bramha Allu, Wei Zhang
    Abstract:

    Performance and reliability are both of great importance for microprocessor design. Recently, the replication Cache has been proposed to enhance data Cache reliability against soft errors. The replication Cache is a small Fully Associative Cache to store the replica for every write to the L1 data Cache. In addition to enhance data reliability, this paper proposes several cost-effective techniques to improve performance of multiple-issue microprocessors by exploiting the replication Cache. The idea is to make use of the replication Cache to increase Cache bandwidth through dual load and to reduce the L1 data Cache miss rate through partial victim caching. Built upon these two schemes, we also propose a hybrid approach to combine the benefits of both dual load and partial victim caching for improving performance further. Our experimental results show that exploiting a replication Cache with only 8 entries can improve performance by 13.0% on average without compromising the enhanced data integrity.

  • ICS - Enhancing data Cache reliability by the addition of a small Fully-Associative replication Cache
    Proceedings of the 18th annual international conference on Supercomputing - ICS '04, 2004
    Co-Authors: Wei Zhang
    Abstract:

    Soft error conscious Cache design is a necessity for reliable computing. ECC or parity-based integrity checking technique in use today either compromises performance for reliability or vice versa, and the N modular redundancy (NMR) scheme is too costly for microprocessors and applications with stringent cost constraint. This paper proposes a novel and cost-effective solution to enhance data reliability with minimum impact on performance. The idea is to add a small Fully-Associative Cache to store the replica(s) of every write to the L1 data Cache. The replicas can be used to detect and correct soft errors. The replication Cache can also be used to increase performance by reducing the L1 data Cache miss rate. Our experiments show that more than 97% read hits of the L1 data Cache can find replicas available in a replication Cache of 8 blocks.

  • enhancing data Cache reliability by the addition of a small Fully Associative replication Cache
    International Conference on Supercomputing, 2004
    Co-Authors: Wei Zhang
    Abstract:

    Soft error conscious Cache design is a necessity for reliable computing. ECC or parity-based integrity checking technique in use today either compromises performance for reliability or vice versa, and the N modular redundancy (NMR) scheme is too costly for microprocessors and applications with stringent cost constraint. This paper proposes a novel and cost-effective solution to enhance data reliability with minimum impact on performance. The idea is to add a small Fully-Associative Cache to store the replica(s) of every write to the L1 data Cache. The replicas can be used to detect and correct soft errors. The replication Cache can also be used to increase performance by reducing the L1 data Cache miss rate. Our experiments show that more than 97% read hits of the L1 data Cache can find replicas available in a replication Cache of 8 blocks.