Cache Line Size

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1779 Experts worldwide ranked by ideXlab platform

Riko Jacob - One of the best experts on this subject based on the ideXlab platform.

  • WADS - Tight bounds for low dimensional star stencils in the external memory model
    Lecture Notes in Computer Science, 2013
    Co-Authors: Philipp Hupp, Riko Jacob
    Abstract:

    Stencil computations on low dimensional grids are kernels of many scientific applications including finite difference methods used to solve partial differential equations. On typical modern computer architectures such stencil computations are limited by the performance of the memory subsystem, namely by the bandwidth between main memory and the Cache. This work considers the computation of star stencils, like the 5-point and 7-point stencil, in the external memory model. The analysis focuses on the constant of the leading term of the non-compulsory I/Os. Optimizing stencil computations is an active field of research, but so far, there has been a significant gap between the lower bounds and the performance of the algorithms. In two dimensions, matching constants for lower and upper bounds are provided closing a gap of 4. In three dimensions, the bounds match up to a factor of $\sqrt{2}$ improving the known results by a factor of 2$\sqrt{3}\sqrt{B}$, where B is the block (Cache Line) Size of the external memory model. For higher dimensions n, the presented lower bounds improve the previously known by a factor between 4 and 6 leaving a gap of $\sqrt[n-1]{n!} \thickapprox{{n} \over{e}}$.

Rajesh Gupta - One of the best experts on this subject based on the ideXlab platform.

  • Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache
    2013
    Co-Authors: Ru Nicolau, Er Veidenbaum, Rajesh Gupta
    Abstract:

    Abstract—Caches are crucial components of modern processors; they allow high-performance processors to access data fast and, due to their small Sizes, they enable low-power processors to save energy—by circumventing memory accesses. We examine efficient utilization of data Caches in an adaptive memory hierarchy. We exploit data reuse through the static analysis of Cache-Line Size adaptivity. We present an approach that enables the quantification of data misses with respect to Cache-Line Size at compile-time using (parametric) equations, which model interference. Our approach aims at the analysis of perfect loop nests in scientific applications; it is applied to direct mapped Cache and it is an extension and generalization of the Cache Miss Equation (CME) proposed by Ghosh et al. (1999). Part of this analysis is implemented in a software package, STAMINA. We present analytical results in comparison with simulation-based methods and we show evidence of both the expressiveness and the practicability of the analysis. Index Terms—Cache-Line Size adaptivity, spatial locality, interference, parameterized loop nests. æ

  • Line Size adaptivity analysis of parameterized loop nests for direct mapped data Cache
    IEEE Transactions on Computers, 2005
    Co-Authors: Paolo Dalberto, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta
    Abstract:

    Caches are crucial components of modern processors; they allow high-performance processors to access data fast and, due to their small Sizes, they enable low-power processors to save energy - by circumventing memory accesses. We examine efficient utilization of data Caches in an adaptive memory hierarchy. We exploit data reuse through the static analysis of Cache-Line Size adaptivity. We present an approach that enables the quantification of data misses with respect to Cache-Line Size at compile-time using (parametric) equations, which model interference. Our approach aims at the analysis of perfect loop nests in scientific applications; it is applied to direct mapped Cache and it is an extension and generalization of the Cache miss equation (CME) proposed by Ghosh et al. (1999). Part of this analysis is implemented in a software package, STAMINA. We present analytical results in comparison with simulation-based methods and we show evidence of both the expressiveness and the practicability of the analysis.

  • Static analysis of parameterized loop nests for energy efficient use of data Caches
    Compilers and Operating Systems for Low Power, 2003
    Co-Authors: Paolo D'alberto, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta
    Abstract:

    Caches are an important part of architectural and compiler low-power strategies by reducing memory accesses and energy per access. In this chapter, we examine efficient utilization of data Caches for low power in an adaptive memory hierarchy. We focus on the optimization of data reuse through the static analysis of Line Size adaptivity. We present an approach that enables the quantification of data misses with respect to Cache Line Size at compile-time. This analysis is implemented in a software package STAMINA. Experimental results demonstrate effectiveness and accuracy of the analytical results compared to alternative simulation based methods.

  • Intelligent Memory Systems - Compiler-Directed Cache Line Size Adaptivity
    Intelligent Memory Systems, 2001
    Co-Authors: Dan Nicolaescu, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta
    Abstract:

    The performance of a computer system is highly dependent on the performance of the Cache memory system. The traditional Cache memory system has an organization with a Line Size that is fixed at design time. Miss rates for different applications can be improved if the Line Size could be adjusted dynamically at run time.We propose a system where the compiler can set the Cache Line Size for different portions of the program and we show that the miss rate is greatly reduced as a result of this dynamic resizing.

  • Compiler-directed Cache Line Size adaptivity
    Lecture Notes in Computer Science, 2001
    Co-Authors: Dan Nicolaescu, Alexandru Nicolau, Alexander V Veidenbaum, Rajesh Gupta
    Abstract:

    The performance of a computer system is highly dependent on the performance of the Cache memory system. The traditional Cache memory system has an organization with a Line Size that is fixed at design time. Miss rates for different applications can be improved if the Line Size could be adjusted dynamically at run time. We propose a system where the compiler can set the Cache Line Size for different portions of the program and we show that the miss rate is greatly reduced as a result of this dynamic resizing.

Zehra Sura - One of the best experts on this subject based on the ideXlab platform.

  • Design and Implementation of Software-Managed Caches for Multicores with Local Memory ∗
    2009
    Co-Authors: Sangmin Seo, Jaejin Lee, Zehra Sura
    Abstract:

    Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have Caches for their accelerator cores because coherence traffic, Cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed Cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed Cache design, called extended set-index Cache (ESC). It has the benefits of both set-associative and fully associative Caches. Its tag search speed is comparable to the set-associative Cache and its miss rate is comparable to the fully associative Cache. We examine various Line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal Cache Line Size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture.

  • design and implementation of software managed Caches for multicores with local memory
    High-Performance Computer Architecture, 2009
    Co-Authors: Zehra Sura
    Abstract:

    Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have Caches for their accelerator cores because coherence traffic, Cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed Cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed Cache design, called extended set-index Cache (ESC). It has the benefits of both set-associative and fully associative Caches. Its tag search speed is comparable to the set-associative Cache and its miss rate is comparable to the fully associative Cache. We examine various Line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal Cache Line Size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture.

  • HPCA - Design and implementation of software-managed Caches for multicores with local memory
    2009 IEEE 15th International Symposium on High Performance Computer Architecture, 2009
    Co-Authors: Sangmin Seo, Jaejin Lee, Zehra Sura
    Abstract:

    Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have Caches for their accelerator cores because coherence traffic, Cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed Cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed Cache design, called extended set-index Cache (ESC). It has the benefits of both set-associative and fully associative Caches. Its tag search speed is comparable to the set-associative Cache and its miss rate is comparable to the fully associative Cache. We examine various Line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal Cache Line Size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture.

Kazuaki Murakami - One of the best experts on this subject based on the ideXlab platform.

  • Adaptive Cache-Line Size Management on 3D Integrated Microprocessors
    2016
    Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami
    Abstract:

    Abstract—The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the Cache miss penalty because large amount of data can be transferred from the main memory to the Cache at a time. If a large Cache Line Size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable Line-Size Cache scheme. In this paper, we apply it to an L1 data Cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data Cache and stacked DRAM energy consumption up to 75%, compared to a conventional Cache. Keywords-component; low power, variable Line-Size, 3D stacked DRAM I

  • Reducing On-Chip DRAM Energy via Data Transfer Size Optimization
    IEICE Transactions on Electronics, 2009
    Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida
    Abstract:

    This paper proposes a software-controllable variable Line-Size (SC-VLS) Cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small Cache memory to improve the performance. We exploit the Cache to reduce the DRAM energy consumption. During application program executions, an adequate Cache Line Size which produces the lowest Cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large Cache Line Size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small Line Size because of the huge number of banks are accessed. The SC-VLS Cache is able to change a Line Size to an adequate one at runtime with a small area and power overheads. We analyze the adequate Line Size and insert Line Size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS Cache reduces the DRAM energy consumption up to 88%, compared to a conventional Cache with fixed 256 B Lines.

  • Adaptive Cache-Line Size management on 3D integrated microprocessors
    2009 International SoC Design Conference (ISOCC), 2009
    Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami
    Abstract:

    The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the Cache miss penalty because large amount of data can be transferred from the main memory to the Cache at a time. If a large Cache Line Size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable Line-Size Cache scheme. In this paper, we apply it to an L1 data Cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data Cache and stacked DRAM energy consumption up to 75%, compared to a conventional Cache.

  • PAPER Special Section on Low-Leakage, Low-Voltage, Low-Power and High-Speed Technologies for System LSIs in Deep-Submicron Era Reducing On-Chip DRAM Energy via Data Transfer Size Optimization
    2009
    Co-Authors: Takatsugu Ono, Koji Inoue, Kazuaki Murakami, Kenji Yoshida
    Abstract:

    SUMMARY This paper proposes a software-controllable variable LineSize (SC-VLS) Cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small Cache memory to improve the performance. We exploit the Cache to reduce the DRAM energy consumption. During application program executions, an adequate Cache Line Size which produces the lowest Cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large Cache Line Size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small Line Size because of the huge number of banks are accessed. The SC-VLS Cache is able to change a Line Size to an adequate one at runtime with a small area and power overheads. We analyze the adequate Line Size and insert Line Size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS Cache reduces the DRAM energy consumption

  • A high-performance/low-power on-chip memory-path architecture with variable Cache-Line Size
    IEICE Transactions on Electronics, 2000
    Co-Authors: Koji Inoue, Koji Kai, Kazuaki Murakami
    Abstract:

    SUMMARYThis paper proposes an on-chip memory-path architectureemploying the dynamically variable Line-Size (D-VLS) Cache forhigh performance and low energy consumption. The D-VLSCache exploits the high on-chip memory bandwidth attainableon merged DRAM/logic LSIs by replacing a whole large CacheLine in one cycle. At the same time, it attempts to avoid frequentevictions by decreasing the Cache-Line Size when programs havepoor spatial locality. Activating only on-chip DRAM subarrayscorresponding to a replaced Cache-Line Size produces a significantenergy reduction. In our simulation, it is observed that our pro-posedon-chip memory-patharchitecture, whichemploys adirect-mapped D-VLS Cache, improves the ED (Energy Delay) productby more than 75 % over a conventional memory-path model. key words: Cache, low power, variable Line-Size, mergedDRAM/logic LSIs, high bandwidth 1. IntroductionIntegrating a main memory (DRAM) and processorsinto a single chip, or a merged DRAM/logic LSI, makespossible to exploit the high on-chip memory bandwidthprovided by widening on-chip bus and on-chip DRAMarray. This approach is well known as a good solutionto break the memory wall problem [6][8]. Although thehighon-chipmemorybandwidthimprovesdatatransferability,stillwewill haveaperformance-gapproblembe-tween recent GHz high-speed processors and low-speedDRAM. Thus, we believe that it will be needed to em-ployhigh-speedon-chipCachesevenifthemainmemoryand the processors are integrated.For merged DRAM/logic LSIs having Cache mem-ory, we canexploit thehigh on-chipmemorybandwidthby replacing a whole Cache Line at a time[2][9][11]. Thisapproach tends to increase the Cache-Line Size if we at-tempt to exploit the attainable high bandwidth. Alarge Cache-Line Size gives a benefit of prefetching ef-fect if programs have rich spatial locality. However, itwill bring the following disadvantages with poor spatiallocality:1. a number of conflict misses will take place due tofrequent evictions,

Philipp Hupp - One of the best experts on this subject based on the ideXlab platform.

  • WADS - Tight bounds for low dimensional star stencils in the external memory model
    Lecture Notes in Computer Science, 2013
    Co-Authors: Philipp Hupp, Riko Jacob
    Abstract:

    Stencil computations on low dimensional grids are kernels of many scientific applications including finite difference methods used to solve partial differential equations. On typical modern computer architectures such stencil computations are limited by the performance of the memory subsystem, namely by the bandwidth between main memory and the Cache. This work considers the computation of star stencils, like the 5-point and 7-point stencil, in the external memory model. The analysis focuses on the constant of the leading term of the non-compulsory I/Os. Optimizing stencil computations is an active field of research, but so far, there has been a significant gap between the lower bounds and the performance of the algorithms. In two dimensions, matching constants for lower and upper bounds are provided closing a gap of 4. In three dimensions, the bounds match up to a factor of $\sqrt{2}$ improving the known results by a factor of 2$\sqrt{3}\sqrt{B}$, where B is the block (Cache Line) Size of the external memory model. For higher dimensions n, the presented lower bounds improve the previously known by a factor between 4 and 6 leaving a gap of $\sqrt[n-1]{n!} \thickapprox{{n} \over{e}}$.