Cache Line Size

The Experts below are selected from a list of 1779 Experts worldwide ranked by ideXlab platform

Riko Jacob - One of the best experts on this subject based on the ideXlab platform.

WADS - Tight bounds for low dimensional star stencils in the external memory model

Lecture Notes in Computer Science, 2013

Co-Authors: Philipp Hupp, Riko Jacob

Abstract:

Stencil computations on low dimensional grids are kernels of many scientific applications including finite difference methods used to solve partial differential equations. On typical modern computer architectures such stencil computations are limited by the performance of the memory subsystem, namely by the bandwidth between main memory and the Cache. This work considers the computation of star stencils, like the 5-point and 7-point stencil, in the external memory model. The analysis focuses on the constant of the leading term of the non-compulsory I/Os. Optimizing stencil computations is an active field of research, but so far, there has been a significant gap between the lower bounds and the performance of the algorithms. In two dimensions, matching constants for lower and upper bounds are provided closing a gap of 4. In three dimensions, the bounds match up to a factor of $\sqrt{2}$ improving the known results by a factor of 2$\sqrt{3}\sqrt{B}$, where B is the block (Cache Line) Size of the external memory model. For higher dimensions n, the presented lower bounds improve the previously known by a factor between 4 and 6 leaving a gap of $\sqrt[n-1]{n!} \thickapprox{{n} \over{e}}$.

15 days free trial to Access Article

Rajesh Gupta - One of the best experts on this subject based on the ideXlab platform.

Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache

2013

Co-Authors: Ru Nicolau, Er Veidenbaum, Rajesh Gupta

Abstract:

Abstract—Caches are crucial components of modern processors; they allow high-performance processors to access data fast and, due to their small Sizes, they enable low-power processors to save energy—by circumventing memory accesses. We examine efficient utilization of data Caches in an adaptive memory hierarchy. We exploit data reuse through the static analysis of Cache-Line Size adaptivity. We present an approach that enables the quantification of data misses with respect to Cache-Line Size at compile-time using (parametric) equations, which model interference. Our approach aims at the analysis of perfect loop nests in scientific applications; it is applied to direct mapped Cache and it is an extension and generalization of the Cache Miss Equation (CME) proposed by Ghosh et al. (1999). Part of this analysis is implemented in a software package, STAMINA. We present analytical results in comparison with simulation-based methods and we show evidence of both the expressiveness and the practicability of the analysis. Index Terms—Cache-Line Size adaptivity, spatial locality, interference, parameterized loop nests. æ

15 days free trial to Access Article
Line Size adaptivity analysis of parameterized loop nests for direct mapped data Cache

IEEE Transactions on Computers, 2005

Co-Authors: Paolo Dalberto, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta

Abstract:

Caches are crucial components of modern processors; they allow high-performance processors to access data fast and, due to their small Sizes, they enable low-power processors to save energy - by circumventing memory accesses. We examine efficient utilization of data Caches in an adaptive memory hierarchy. We exploit data reuse through the static analysis of Cache-Line Size adaptivity. We present an approach that enables the quantification of data misses with respect to Cache-Line Size at compile-time using (parametric) equations, which model interference. Our approach aims at the analysis of perfect loop nests in scientific applications; it is applied to direct mapped Cache and it is an extension and generalization of the Cache miss equation (CME) proposed by Ghosh et al. (1999). Part of this analysis is implemented in a software package, STAMINA. We present analytical results in comparison with simulation-based methods and we show evidence of both the expressiveness and the practicability of the analysis.

15 days free trial to Access Article
Static analysis of parameterized loop nests for energy efficient use of data Caches

Compilers and Operating Systems for Low Power, 2003

Co-Authors: Paolo D'alberto, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta

Abstract:

Caches are an important part of architectural and compiler low-power strategies by reducing memory accesses and energy per access. In this chapter, we examine efficient utilization of data Caches for low power in an adaptive memory hierarchy. We focus on the optimization of data reuse through the static analysis of Line Size adaptivity. We present an approach that enables the quantification of data misses with respect to Cache Line Size at compile-time. This analysis is implemented in a software package STAMINA. Experimental results demonstrate effectiveness and accuracy of the analytical results compared to alternative simulation based methods.

15 days free trial to Access Article
Intelligent Memory Systems - Compiler-Directed Cache Line Size Adaptivity

Intelligent Memory Systems, 2001

Co-Authors: Dan Nicolaescu, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta

Abstract:

The performance of a computer system is highly dependent on the performance of the Cache memory system. The traditional Cache memory system has an organization with a Line Size that is fixed at design time. Miss rates for different applications can be improved if the Line Size could be adjusted dynamically at run time.We propose a system where the compiler can set the Cache Line Size for different portions of the program and we show that the miss rate is greatly reduced as a result of this dynamic resizing.

15 days free trial to Access Article
Compiler-directed Cache Line Size adaptivity

Lecture Notes in Computer Science, 2001

Co-Authors: Dan Nicolaescu, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta

Abstract:

The performance of a computer system is highly dependent on the performance of the Cache memory system. The traditional Cache memory system has an organization with a Line Size that is fixed at design time. Miss rates for different applications can be improved if the Line Size could be adjusted dynamically at run time. We propose a system where the compiler can set the Cache Line Size for different portions of the program and we show that the miss rate is greatly reduced as a result of this dynamic resizing.

15 days free trial to Access Article

Zehra Sura - One of the best experts on this subject based on the ideXlab platform.

Design and Implementation of Software-Managed Caches for Multicores with Local Memory ∗

2009

Co-Authors: Sangmin Seo, Jaejin Lee, Zehra Sura

Abstract:

Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have Caches for their accelerator cores because coherence traffic, Cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed Cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed Cache design, called extended set-index Cache (ESC). It has the benefits of both set-associative and fully associative Caches. Its tag search speed is comparable to the set-associative Cache and its miss rate is comparable to the fully associative Cache. We examine various Line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal Cache Line Size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture.

15 days free trial to Access Article
design and implementation of software managed Caches for multicores with local memory

High-Performance Computer Architecture, 2009

Co-Authors: Zehra Sura

Abstract:

Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have Caches for their accelerator cores because coherence traffic, Cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed Cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed Cache design, called extended set-index Cache (ESC). It has the benefits of both set-associative and fully associative Caches. Its tag search speed is comparable to the set-associative Cache and its miss rate is comparable to the fully associative Cache. We examine various Line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal Cache Line Size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture.

15 days free trial to Access Article
HPCA - Design and implementation of software-managed Caches for multicores with local memory

2009 IEEE 15th International Symposium on High Performance Computer Architecture, 2009

Co-Authors: Sangmin Seo, Jaejin Lee, Zehra Sura

Abstract:

Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have Caches for their accelerator cores because coherence traffic, Cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed Cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed Cache design, called extended set-index Cache (ESC). It has the benefits of both set-associative and fully associative Caches. Its tag search speed is comparable to the set-associative Cache and its miss rate is comparable to the fully associative Cache. We examine various Line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal Cache Line Size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture.

15 days free trial to Access Article

Kazuaki Murakami - One of the best experts on this subject based on the ideXlab platform.

Adaptive Cache-Line Size Management on 3D Integrated Microprocessors

2016

Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami

Abstract:

Abstract—The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the Cache miss penalty because large amount of data can be transferred from the main memory to the Cache at a time. If a large Cache Line Size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable Line-Size Cache scheme. In this paper, we apply it to an L1 data Cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data Cache and stacked DRAM energy consumption up to 75%, compared to a conventional Cache. Keywords-component; low power, variable Line-Size, 3D stacked DRAM I

15 days free trial to Access Article
Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

IEICE Transactions on Electronics, 2009

Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida

Abstract:

This paper proposes a software-controllable variable Line-Size (SC-VLS) Cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small Cache memory to improve the performance. We exploit the Cache to reduce the DRAM energy consumption. During application program executions, an adequate Cache Line Size which produces the lowest Cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large Cache Line Size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small Line Size because of the huge number of banks are accessed. The SC-VLS Cache is able to change a Line Size to an adequate one at runtime with a small area and power overheads. We analyze the adequate Line Size and insert Line Size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS Cache reduces the DRAM energy consumption up to 88%, compared to a conventional Cache with fixed 256 B Lines.

15 days free trial to Access Article
Adaptive Cache-Line Size management on 3D integrated microprocessors

2009 International SoC Design Conference (ISOCC), 2009

Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami

Abstract:

The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the Cache miss penalty because large amount of data can be transferred from the main memory to the Cache at a time. If a large Cache Line Size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable Line-Size Cache scheme. In this paper, we apply it to an L1 data Cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data Cache and stacked DRAM energy consumption up to 75%, compared to a conventional Cache.

15 days free trial to Access Article
PAPER Special Section on Low-Leakage, Low-Voltage, Low-Power and High-Speed Technologies for System LSIs in Deep-Submicron Era Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

2009

Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida

Abstract:

SUMMARY This paper proposes a software-controllable variable LineSize (SC-VLS) Cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small Cache memory to improve the performance. We exploit the Cache to reduce the DRAM energy consumption. During application program executions, an adequate Cache Line Size which produces the lowest Cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large Cache Line Size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small Line Size because of the huge number of banks are accessed. The SC-VLS Cache is able to change a Line Size to an adequate one at runtime with a small area and power overheads. We analyze the adequate Line Size and insert Line Size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS Cache reduces the DRAM energy consumption

15 days free trial to Access Article
A high-performance/low-power on-chip memory-path architecture with variable Cache-Line Size

IEICE Transactions on Electronics, 2000

Co-Authors: Koji Inoue, Koji Kai, Kazuaki Murakami

Abstract:

SUMMARYThis paper proposes an on-chip memory-path architectureemploying the dynamically variable Line-Size (D-VLS) Cache forhigh performance and low energy consumption. The D-VLSCache exploits the high on-chip memory bandwidth attainableon merged DRAM/logic LSIs by replacing a whole large CacheLine in one cycle. At the same time, it attempts to avoid frequentevictions by decreasing the Cache-Line Size when programs havepoor spatial locality. Activating only on-chip DRAM subarrayscorresponding to a replaced Cache-Line Size produces a signiﬁcantenergy reduction. In our simulation, it is observed that our pro-posedon-chip memory-patharchitecture, whichemploys adirect-mapped D-VLS Cache, improves the ED (Energy Delay) productby more than 75 % over a conventional memory-path model. key words: Cache, low power, variable Line-Size, mergedDRAM/logic LSIs, high bandwidth 1. IntroductionIntegrating a main memory (DRAM) and processorsinto a single chip, or a merged DRAM/logic LSI, makespossible to exploit the high on-chip memory bandwidthprovided by widening on-chip bus and on-chip DRAMarray. This approach is well known as a good solutionto break the memory wall problem [6][8]. Although thehighon-chipmemorybandwidthimprovesdatatransferability,stillwewill haveaperformance-gapproblembe-tween recent GHz high-speed processors and low-speedDRAM. Thus, we believe that it will be needed to em-ployhigh-speedon-chipCachesevenifthemainmemoryand the processors are integrated.For merged DRAM/logic LSIs having Cache mem-ory, we canexploit thehigh on-chipmemorybandwidthby replacing a whole Cache Line at a time[2][9][11]. Thisapproach tends to increase the Cache-Line Size if we at-tempt to exploit the attainable high bandwidth. Alarge Cache-Line Size gives a beneﬁt of prefetching ef-fect if programs have rich spatial locality. However, itwill bring the following disadvantages with poor spatiallocality:1. a number of conﬂict misses will take place due tofrequent evictions,

15 days free trial to Access Article

Philipp Hupp - One of the best experts on this subject based on the ideXlab platform.

WADS - Tight bounds for low dimensional star stencils in the external memory model

Lecture Notes in Computer Science, 2013

Co-Authors: Philipp Hupp, Riko Jacob

Abstract:

Stencil computations on low dimensional grids are kernels of many scientific applications including finite difference methods used to solve partial differential equations. On typical modern computer architectures such stencil computations are limited by the performance of the memory subsystem, namely by the bandwidth between main memory and the Cache. This work considers the computation of star stencils, like the 5-point and 7-point stencil, in the external memory model. The analysis focuses on the constant of the leading term of the non-compulsory I/Os. Optimizing stencil computations is an active field of research, but so far, there has been a significant gap between the lower bounds and the performance of the algorithms. In two dimensions, matching constants for lower and upper bounds are provided closing a gap of 4. In three dimensions, the bounds match up to a factor of $\sqrt{2}$ improving the known results by a factor of 2$\sqrt{3}\sqrt{B}$, where B is the block (Cache Line) Size of the external memory model. For higher dimensions n, the presented lower bounds improve the previously known by a factor between 4 and 6 leaving a gap of $\sqrt[n-1]{n!} \thickapprox{{n} \over{e}}$.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Riko Jacob - One of the best experts on this subject based on the ideXlab platform.

WADS - Tight bounds for low dimensional star stencils in the external memory model

Rajesh Gupta - One of the best experts on this subject based on the ideXlab platform.

Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache

Line Size adaptivity analysis of parameterized loop nests for direct mapped data Cache

Static analysis of parameterized loop nests for energy efficient use of data Caches

Intelligent Memory Systems - Compiler-Directed Cache Line Size Adaptivity

Compiler-directed Cache Line Size adaptivity

Zehra Sura - One of the best experts on this subject based on the ideXlab platform.

Design and Implementation of Software-Managed Caches for Multicores with Local Memory ∗

design and implementation of software managed Caches for multicores with local memory

HPCA - Design and implementation of software-managed Caches for multicores with local memory

Kazuaki Murakami - One of the best experts on this subject based on the ideXlab platform.

Adaptive Cache-Line Size Management on 3D Integrated Microprocessors

Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

Adaptive Cache-Line Size management on 3D integrated microprocessors

PAPER Special Section on Low-Leakage, Low-Voltage, Low-Power and High-Speed Technologies for System LSIs in Deep-Submicron Era Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

A high-performance/low-power on-chip memory-path architecture with variable Cache-Line Size

Philipp Hupp - One of the best experts on this subject based on the ideXlab platform.

WADS - Tight bounds for low dimensional star stencils in the external memory model