L2 Cache

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 21789 Experts worldwide ranked by ideXlab platform

Ana Sonia Leon - One of the best experts on this subject based on the ideXlab platform.

  • a power efficient high throughput 32 thread sparc processor
    International Solid-State Circuits Conference, 2006
    Co-Authors: Ana Sonia Leon, Jinuk Luke Shin, Francis X Schumacher, K W Tam, W Bryg, P Kongetira, D Weisner, A Strong
    Abstract:

    This first generation of "Niagara" SPARC processors implements a power-efficient Chip Multi-Threading (CMT) architecture which maximizes overall throughput performance for commercial workloads. The target performance is achieved by exploiting high bandwidth rather than high frequency, thereby reducing hardware complexity and power. The UltraSPARC T1 processor combines eight four-threaded 64-b cores, a floating-point unit, a high-bandwidth interconnect crossbar, a shared 3-MB L2 Cache, four DDR2 DRAM interfaces, and a system interface unit. Power and thermal monitoring techniques further enhance CMT performance benefits, increasing overall chip reliability. The 378-mm2 die is fabricated in Texas Instrument's 90-nm CMOS technology with nine layers of copper interconnect. The chip contains 279 million transistors and consumes a maximum of 63 W at 1.2 GHz and 1.2 V. Key functional units employ special circuit techniques to provide the high bandwidth required by a CMT architecture while optimizing power and silicon area. These include a highly integrated integer register file, a high-bandwidth interconnect crossbar, the shared L2 Cache, and the IO subsystem. Key aspects of the physical design methodology are also discussed

  • design and implementation of an embedded 512 kb level 2 Cache subsystem
    IEEE Journal of Solid-state Circuits, 2005
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Mandeep Singh, Ana Sonia Leon
    Abstract:

    Dual on-chip 512-KB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13-/spl mu/m technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85/spl deg/C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 Cache are discussed.

  • design and implementation of an embedded 512 kb level 2 Cache subsystem
    Custom Integrated Circuits Conference, 2005
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Mandeep Singh, Ana Sonia Leon
    Abstract:

    Dual on-chip 512-KB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13-μm technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85 °C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 Cache are discussed.

Jihong Kim - One of the best experts on this subject based on the ideXlab platform.

  • exploiting replicated Cache blocks to reduce L2 Cache leakage in cmps
    IEEE Transactions on Very Large Scale Integration Systems, 2013
    Co-Authors: Hyunhee Kim, Jung Ho Ahn, Jihong Kim
    Abstract:

    Modern chip multiprocessors (CMPs) employ large L2 Caches to reduce the performance gap between processors and off-chip memory. However, as the size of an L2 Cache increases, its leakage power consumption also becomes a major contributor to the total power dissipation. Managing the leakage power of L2 Caches, therefore, is an important issue in realizing low-power CMPs. In CMPs with private L2 Caches, each processor makes a copy of the data in its local Cache in order to access the data faster, which is called replication. In this paper, we propose a novel leakage management technique that dynamically turns off replications in private L2 Caches for leakage power reduction by exploiting two key observations: 1) the cost of an extra Cache miss due to the turned-off replication is small because the same Cache block exists in another on-chip Cache and 2) turning off the replication incurs no extra Cache miss if it is invalidated by other processors in order to maintain Cache coherence. Since blindly turning off the frequently accessed replications can degrade performance, the proposed technique dynamically controls the number of turned-off replications. The proposed technique can be implemented by slightly modifying the MESI protocol with a new turned-off shared (TOS) coherence state. The TOS state indicates that the corresponding block is shared by other Caches but turned off. Experiments on a four-processor CMP with private L2 Caches show that the proposed technique reduces the energy consumption of the L2 Caches and the main memory by 19.4% on average, with less than 1% performance loss over the existing Cache leakage management technique.

  • a leakage aware L2 Cache management technique for producer consumer sharing in low power chip multiprocessors
    Journal of Parallel and Distributed Computing, 2011
    Co-Authors: Hyunhee Kim, Jihong Kim
    Abstract:

    This paper proposes a novel leakage management technique for applications with producer-consumer sharing patterns. Although previous research has proposed leakage management techniques by turning off inactive Cache blocks, these techniques can be further improved by exploiting the various run-time characteristics of target applications in CMPs. By exploiting particular access sequences observed in producer-consumer sharing patterns and the spatial locality of shared buffers, our technique enables a more aggressive turn-off of L2 Cache blocks of these buffers. Experimental results using a CMP simulator show that our proposed technique reduces the energy consumption of on-chip L2 Caches, a shared bus, and off-chip memory by up to 31.3% over the existing Cache leakage power management techniques with no significant performance loss.

  • low power L2 Cache design for multi core processors
    Electronics Letters, 2010
    Co-Authors: C M Chung, Jihong Kim
    Abstract:

    A low-power set-associative L2 Cache design for a multi-core processor is proposed. Since this way-predicting L2 Cache (WP-L2) predicts a destination way and accesses only the predicted way, it consumes less energy than a conventional set-associative L2 Cache. Exploiting access patterns of an L2 Cache, WP-L2 is based on two prediction logics; a look-ahead buffer (LAB) predicts the next sequential Cache block and a way-affinity table (WAT) records the way number of the previous L2 Cache access. Combining the logics, WP-L2 predicts correct ways for about 83% of L2 Cache accesses and reduces about 22% of access latency and 44% of energy consumption compared to the conventional eight-way set-associative L2 Cache.

  • reusability aware Cache memory sharing for chip multiprocessors with private L2 Caches
    Journal of Systems Architecture, 2009
    Co-Authors: Hyunhee Kim, Sungjun Youn, Jihong Kim
    Abstract:

    In this paper, we propose a novel on-chip L2 Cache organization for chip multiprocessors (CMPs) with private L2 Caches. The proposed approach, called reusability-aware Cache sharing (RACS), combines the advantages of both a private L2 Cache and a shared L2 Cache. Since a private L2 Cache organization has a short access latency, the RACS scheme employs a private L2 Cache organization. However, when a Cache block in a private L2 Cache is selected for eviction, RACS first evaluates its reusability. If the block is likely to be reused in the near future, it may be saved to a peer L2 Cache which has space available. In this way, the RACS scheme effectively simulates the larger capacity of a shared L2 Cache. Simulation results show that RACS reduced the number of off-chip memory accesses by 24% compared to a pure private L2 Cache organization on average for the SPLASH 2 multi-threaded benchmarks, and by 16% for multi-programmed benchmarks.

  • a leakage aware Cache sharing technique for low power chip multi processors cmps with private L2 Caches
    Memory Performance: Dealing With Applications Systems And Architecture, 2008
    Co-Authors: Hyunhee Kim, Sungjun Youn, Jihong Kim
    Abstract:

    Power dissipation becomes an important issue in modern microprocessors such as chip multiprocessors (CMPs). Especially as the process technology advances below 90nm, the leakage power consumption becomes dominant in the total power dissipation, thus reducing the leakage power consumption is an important design goal for low-power CMPs. In particular, since most CMPs employ a large L2 Cache, reducing the leakage power consumption of the L2 Cache is critical in realizing low-power CMPs.In this paper, we propose a leakage-aware on-chip L2 Cache organization called LACS. The proposed LACS, like the existing RACS organization, is based on a private L2 Cache organization with an inter-L2 Cache sharing support. However, unlike the RACS organization, which determines a peer L2 Cache block for an inter-L2 Cache sharing based on the reusability of the evicted L2 block and performance implications of peer L2 Cache blocks, the LACS organization considers both the performance and leakage. The LACS organization reduces the leakage power consumption significantly over the leakage-oblivious RACS organization while achieving a similar performance gain over a private L2 Cache organization. Experimental results show that the proposed LACS technique reduces the energy consumption by 23.6% and improves the energy delay product by 18.6% on average over the existing RACS scheme.

Jinuk Luke Shin - One of the best experts on this subject based on the ideXlab platform.

  • a 40nm 16 core 128 thread cmt sparc soc processor
    International Solid-State Circuits Conference, 2010
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Kenway Tam, Dawei Huang, Ha Pham, Changku Hwang, Alan Smith, Timothy Johnson, Francis X Schumacher, David J Greenhill
    Abstract:

    This next generation of Chip Multithreaded (CMT) SPARC SoC processor, code named Rainbow Falls, doubles on-chip thread count over its predecessor the UltraSparc T2+. The chip offers high levels of integration and scalability with twice the number of cores, a larger L2 Cache, and higher maximum I/O bandwidth, within the same power envelope. Sixteen 8-threaded enhanced SPARC cores (SPC) provide 128 threads in a single die, delivering the highest thread count for a general-purpose microprocessor. The new Cache coherency further allows up to 4-way glueless systems with a total of 512 threads. Each core communicates with the unified 6MB L2 Cache through a crossbar (CCX) delivering 461GB/s (Fig. 5.2.1). A gasket (CXG) is also introduced to manage the congestion and synchronization of the massive interconnect between the 16 cores and the crossbar. This facilitates a synchronized delay control between any core and any L2 bank for partial core product binning and testing.

  • a power efficient high throughput 32 thread sparc processor
    International Solid-State Circuits Conference, 2006
    Co-Authors: Ana Sonia Leon, Jinuk Luke Shin, Francis X Schumacher, K W Tam, W Bryg, P Kongetira, D Weisner, A Strong
    Abstract:

    This first generation of "Niagara" SPARC processors implements a power-efficient Chip Multi-Threading (CMT) architecture which maximizes overall throughput performance for commercial workloads. The target performance is achieved by exploiting high bandwidth rather than high frequency, thereby reducing hardware complexity and power. The UltraSPARC T1 processor combines eight four-threaded 64-b cores, a floating-point unit, a high-bandwidth interconnect crossbar, a shared 3-MB L2 Cache, four DDR2 DRAM interfaces, and a system interface unit. Power and thermal monitoring techniques further enhance CMT performance benefits, increasing overall chip reliability. The 378-mm2 die is fabricated in Texas Instrument's 90-nm CMOS technology with nine layers of copper interconnect. The chip contains 279 million transistors and consumes a maximum of 63 W at 1.2 GHz and 1.2 V. Key functional units employ special circuit techniques to provide the high bandwidth required by a CMT architecture while optimizing power and silicon area. These include a highly integrated integer register file, a high-bandwidth interconnect crossbar, the shared L2 Cache, and the IO subsystem. Key aspects of the physical design methodology are also discussed

  • design and implementation of an embedded 512 kb level 2 Cache subsystem
    IEEE Journal of Solid-state Circuits, 2005
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Mandeep Singh, Ana Sonia Leon
    Abstract:

    Dual on-chip 512-KB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13-/spl mu/m technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85/spl deg/C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 Cache are discussed.

  • design and implementation of an embedded 512 kb level 2 Cache subsystem
    Custom Integrated Circuits Conference, 2005
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Mandeep Singh, Ana Sonia Leon
    Abstract:

    Dual on-chip 512-KB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13-μm technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85 °C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 Cache are discussed.

  • design and implementation of an embedded 512kb level 2 Cache subsystem
    Custom Integrated Circuits Conference, 2004
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Howard L Levy, Jinseung Son, Mandeep Singh, Vikas Mathur, Jungcheng Yeh, Heesung Choi, Vipin Gupta, T Ziaja
    Abstract:

    Dual on-chip 512 kB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13 /spl mu/m technology. Each 512 kB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 13 V and 85 C. This fully integrated subsystem is composed of data and tag SRAMs along with datapaths, controller and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64b microprocessors, with a data latency of only 4 cycles including ECC correction for 128-bit data. The design solutions to build this integrated short latency L2 Cache are discussed.

Paul A. Reed - One of the best experts on this subject based on the ideXlab platform.

  • a 250 mhz 5 w powerpc microprocessor with on chip L2 Cache controller
    IEEE Journal of Solid-state Circuits, 1997
    Co-Authors: Gianfranco Gerosa, M. Alexander, Jose Alvarez, C. Croxton, C. Nicoletta, M Daddeo, A R Kennedy, J P Nissen, R Philip, Paul A. Reed
    Abstract:

    This RISC microprocessor is a new, high-performance, PowerPC microprocessor designed specifically for the mobile and high volume desktop personal computer markets. It is an advanced superscalar design with six execution units, aggressive upstream branch processing, out-of-order instruction execution, and a tightly integrated "backside" L2 Cache. This dual-issue engine has a four-stage pipeline with dual 32-kB eight-way set-associative L1 Caches and an integrated L2 controller with on-chip L2 tag supporting up to 1 MB of external SRAM. A thermal assist unit and an instruction Cache throttling mechanism are included for thermal management in mobile applications. A 60X system bus and L2 interface speeds of 100 and 250 MHz are achieved, respectively. This microprocessor achieves workstation class performance (estimated 10 SPECint95 and 9 SPECfp95) while only dissipating 5 W at 250 MHz. The 6.35-million transistor 66.5-mm/sup 2/ die is fabricated in a 2.5-V, 0.3-/spl mu/m, five-layer metal CMOS process.

  • A 250 MHz 5 W RISC microprocessor with on-chip L2 Cache controller
    1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers, 2024
    Co-Authors: Paul A. Reed, M. Alexander, Jose Alvarez, Michael L. Brauer, Chai-chin Chao, C. Croxton, Lee Evan Eisen, Tai Ngo, C. Nicoletta
    Abstract:

    This superscalar microprocessor is a 32b implementation of the PowerPC Architecture(TM) specification based on a micro-architecture designed for high performance and low power. Two instructions per cycle can be dispatched in this superscalar design. The processor includes dual 32kB 8-way instruction and data Caches, a floating-point unit, two integer units, a branch unit, a load/store unit, and a system unit. An L2 tag and Cache controller with a dedicated L2 bus interface are added to provide a low-cost L2 Cache solution using commodity SRAMs for the data.

Bruce Petrick - One of the best experts on this subject based on the ideXlab platform.

  • a 40nm 16 core 128 thread cmt sparc soc processor
    International Solid-State Circuits Conference, 2010
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Kenway Tam, Dawei Huang, Ha Pham, Changku Hwang, Alan Smith, Timothy Johnson, Francis X Schumacher, David J Greenhill
    Abstract:

    This next generation of Chip Multithreaded (CMT) SPARC SoC processor, code named Rainbow Falls, doubles on-chip thread count over its predecessor the UltraSparc T2+. The chip offers high levels of integration and scalability with twice the number of cores, a larger L2 Cache, and higher maximum I/O bandwidth, within the same power envelope. Sixteen 8-threaded enhanced SPARC cores (SPC) provide 128 threads in a single die, delivering the highest thread count for a general-purpose microprocessor. The new Cache coherency further allows up to 4-way glueless systems with a total of 512 threads. Each core communicates with the unified 6MB L2 Cache through a crossbar (CCX) delivering 461GB/s (Fig. 5.2.1). A gasket (CXG) is also introduced to manage the congestion and synchronization of the massive interconnect between the 16 cores and the crossbar. This facilitates a synchronized delay control between any core and any L2 bank for partial core product binning and testing.

  • design and implementation of an embedded 512 kb level 2 Cache subsystem
    IEEE Journal of Solid-state Circuits, 2005
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Mandeep Singh, Ana Sonia Leon
    Abstract:

    Dual on-chip 512-KB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13-/spl mu/m technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85/spl deg/C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 Cache are discussed.

  • design and implementation of an embedded 512 kb level 2 Cache subsystem
    Custom Integrated Circuits Conference, 2005
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Mandeep Singh, Ana Sonia Leon
    Abstract:

    Dual on-chip 512-KB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13-μm technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85 °C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 Cache are discussed.

  • design and implementation of an embedded 512kb level 2 Cache subsystem
    Custom Integrated Circuits Conference, 2004
    Co-Authors: Jinuk Luke Shin, Bruce Petrick, Howard L Levy, Jinseung Son, Mandeep Singh, Vikas Mathur, Jungcheng Yeh, Heesung Choi, Vipin Gupta, T Ziaja
    Abstract:

    Dual on-chip 512 kB unified second level (L2) Caches for an UltraSparc processor are implemented using 0.13 /spl mu/m technology. Each 512 kB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 13 V and 85 C. This fully integrated subsystem is composed of data and tag SRAMs along with datapaths, controller and test engines. The unit achieves one of the shortest on-chip L2 Cache latencies reported for 64b microprocessors, with a data latency of only 4 cycles including ECC correction for 128-bit data. The design solutions to build this integrated short latency L2 Cache are discussed.