Cache Miss Rate

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 216 Experts worldwide ranked by ideXlab platform

Quansheng Yang - One of the best experts on this subject based on the ideXlab platform.

  • an improved design on performance and Cache Miss Rate for set associative i Cache
    Computational Intelligence, 2009
    Co-Authors: Quansheng Yang
    Abstract:

    Access with prediction has been an effective approach to reduce the power consumption of set-associative I-Cache. However it decreases the performance meanwhile. This paper improved LAW (Last Accessed Way) based I-Cache design, which takes both the replacement and predicting access policy into account. By replacing and accessing along the last accessed way, the CMR (Cache Miss Rate) and PMR (Predict Miss Rate) are significantly improved. We extend the scope of application of LAW based policy by subdividing those uncertain occasions with MRU (Most Recently Used) way prediction. Simulations show that compared with the original LAW approach, such a 16KB, 4- way and 32B-line-size I-Cache can improve 0.833% performance on average, with the energy consumption increased by 1.498% only. Compared with the conventional set associative I-Cache, the proposed design, which is called eLAW, reduces almost 17.93% CMR and 43%-59% energy consumption on average.

Tulika Mitra - One of the best experts on this subject based on the ideXlab platform.

  • instruction Cache locking using temporal reuse profile
    Design Automation Conference, 2010
    Co-Authors: Yun Liang, Tulika Mitra
    Abstract:

    The performance of most embedded systems is critically dependent on the average memory access latency. Improving the Cache hit Rate can have significant positive impact on the performance of an application. Modern embedded processors often feature Cache locking mechanisms that allow memory blocks to be locked in the Cache under software control. Cache locking was primarily designed to offer timing predictability for hard real-time applications. Hence, the compiler optimization techniques focus on employing Cache locking to improve worst-case execution time. However, Cache locking can be quite effective in improving the average-case execution time of general embedded applications as well. In this paper, we explore static instruction Cache locking to improve average-case program performance. We introduce temporal reuse profile to accuRately and efficiently model the cost and benefit of locking memory blocks in the Cache. We propose an optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the Cache. Experimental results show that locking heuristic achieves close to optimal results and can improve the Cache Miss Rate by up to 24% across a suite of real-world benchmarks. Moreover, our heuristic provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.

  • Improved Procedure Placement for Set Associative Caches
    2010
    Co-Authors: Yun Liang, Tulika Mitra
    Abstract:

    The performance of most embedded systems is critically dependent on the memory hierarchy performance. In particular, higher Cache hit Rate can provide significant performance boost to an embedded application. Procedure placement is a popular technique that aims to improve instruction Cache hit Rate by reducing conflicts in the Cache through compile/link time reordering of procedures. However, existing procedure placement techniques make reordering decisions based on imprecise conflict information. This imprecision leads to limited and sometimes negative performance gain, specially for set-associative Caches. In this paper, we introduce intermediate blocks profile (IBP) to accuRately but compactly model cost-benefit of procedure placement for both direct mapped and set associative Caches. We propose an efficient algorithm that exploits IBP to place procedures in memory such that Cache conflicts are minimized. Experimental results demonstRate that our approach provides substantial improvement in Cache performance over existing procedure placement techniques. Furthermore, we observe that the code layout for a specific Cache configuration is not portable across different Cache configurations. To solve this problem, we propose an algorithm that exploits IBP to place procedures in memory such that the average Cache Miss Rate across a set of Cache configurations is minimized

L Rudolph - One of the best experts on this subject based on the ideXlab platform.

  • A new memory monitoring scheme for memory-aware scheduling and partitioning
    Proceedings Eighth International Symposium on High Performance Computer Architecture, 2002
    Co-Authors: Srini Devadas, L Rudolph
    Abstract:

    We propose a low overhead, online memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in Cache hits as the size of the Cache is increased, which gives the Cache Miss-Rate as a function of Cache size. Using the counters, we describe a scheme that enables an accuRate estimate of the isolated Miss-Rates of each process as a function of Cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the Cache to minimize the overall Miss-Rate. The data collected by the monitors can also be used by an analytical model of Cache and memory behavior to produce a more accuRate overall Miss-Rate for the collection of processes sharing a Cache in both time and space. This overall Miss-Rate can be used to improve scheduling and partitioning schemes.

Richard A. Regueiro - One of the best experts on this subject based on the ideXlab platform.

  • Comparison between pure MPI and hybrid MPI-OpenMP parallelism for Discrete Element Method (DEM) of ellipsoidal and poly-ellipsoidal particles
    Computational Particle Mechanics, 2019
    Co-Authors: Richard A. Regueiro
    Abstract:

    Parallel computing of 3D Discrete Element Method (DEM) simulations can be achieved in different modes, and two of them are pure MPI and hybrid MPI-OpenMP. The hybrid MPI-OpenMP mode allows flexibly combined mapping schemes on contemporary multiprocessing supercomputers. This paper profiles computational components and floating-point operation features of complex-shaped 3D DEM, develops a space decomposition-based MPI parallelism and various thread-based OpenMP parallelism, and carries out performance comparison and analysis from intranode to internode scales across four orders of magnitude of problem size (namely, number of particles). The influences of memory/Cache hierarchy, processes/threads pinning, variation of hybrid MPI-OpenMP mapping scheme, ellipsoid versus poly-ellipsoid are carefully examined. It is found that OpenMP is able to achieve high efficiency in interparticle contact detection, but the unparallelizable code prevents it from achieving the same high efficiency for overall performance; pure MPI achieves not only lower computational granularity (thus higher spatial locality of particles) but also lower communication granularity (thus faster MPI transMission) than hybrid MPI-OpenMP using the same computational resources; the Cache Miss Rate is sensitive to the memory consumption shrinkage per processor, and the last level Cache contributes most significantly to the strong superlinear speedup among all of the three Cache levels of modern microprocessors; in hybrid MPI-OpenMPI mode, as the number of MPI processes increases (and the number of threads per MPI processes decreases accordingly), the total execution time decreases, until the maximum performance is obtained at pure MPI mode; the processes/threads pinning on NUMA architectures improves performance significantly when there are multiple threads per process, whereas the improvement becomes less pronounced when the number of threads per process decreases; both the communication time and computation time increase substantially from ellipsoids to poly-ellipsoids. Overall, pure MPI outperforms hybrid MPI-OpenMP in 3D DEM modeling of ellipsoidal and poly-ellipsoidal particles.

Peter Petrov - One of the best experts on this subject based on the ideXlab platform.

  • Off-chip memory bandwidth minimization through Cache partitioning for multi-core platforms
    Design Automation Conference, 2010
    Co-Authors: Chenjie Yu, Peter Petrov
    Abstract:

    We present a methodology for off-chip memory bandwidth minimization through application-driven L2 Cache partitioning in multi-core systems. A major challenge with multi-core system design is the widening gap between the memory demand geneRated by the processor cores and the limited off-chip memory bandwidth and memory service speed. This severely restricts the number of cores that can be integRated into a multi-core system and the parallelism that can be actually achieved and efficiently exploited for not only memory demanding applications, but also for workloads consisting of many tasks utilizing a large number of cores and thus exceeding the available off-chip bandwidth. Last level shared Cache partitioning has been shown to be a promising technique to enhance Cache utilization and reduce Miss-Rates. While most Cache partitioning techniques focus on Cache Miss Rates, our work takes a different approach in which tasks' memory bandwidth requirements are taken into account when identifying a Cache partitioning for multi-programmed and/or multithreaded workloads. Cache resources are allocated with the objective that the overall system bandwidth requirement is minimized for the target workload. The key insight is that Cache Miss-Rate information may severely misrepresent the actual bandwidth demand of the task, which ultimately determines the overall system performance and power consumption.