Cache Coherence

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 7239 Experts worldwide ranked by ideXlab platform

Alberto Ros - One of the best experts on this subject based on the ideXlab platform.

  • Cache Coherence Protocols for Many-Core CMPs
    'IntechOpen', 2021
    Co-Authors: Alberto Ros, Manuel E. Acacio, Jose M. Garcia
    Abstract:

    Tiled CMP architectures have recently emerged as a scalable alternative to current small-scale CMP designs, and will be probably the architecture of choice for future many-core CMPs. On the other hand, although a great deal of attention was devoted to scalable Cache Coherence protocols in the last decades in the context of shared-memory multiprocessors, the technological parameters and constraints entailed by CMPs demand new solutions to the Cache Coherence problem. New Cache Coherence protocols, like Token-CMP and DiCo-CMP, have been recently proposed to cope with the indirection problem of traditional protocols. However, neither Token-CMP nor DiCo-CMP scale efficiently with the number of cores, and future Cache Coherence protocols need to be efficient in terms of execution time, network traffic generated and area requirements. In this chapter, we take into consideration these three constraints, and we discuss and evaluate both protocols that are used nowadays, such as Hammer and Directory, and novel indirectionaware protocols, such as Token-CMP and DiCo-CMP. In this way, we perform a detailed evaluation of a wide range of Cache Coherence protocols for many-core CMPs in a common framework. We also study several implementations of DiCo-CMP that differ in the amount of Coherence information that they store in order to achieve the best trade-off among the three constraints considered. Particularly, we show that DiCo-LP-1, which only stores the identity of one sharer along with the data block, DiCo-BT, which codifies the directory information just using three bits, and DiCo-NoSC, which does not store any Coherence information in the data Caches (and it does not need to modify the structure of the Caches), are the alternatives that achieve a better trade-off. For example, DiCo-BT requires less area than all evaluated protocols, except Hammer-CMP, it also generates similar network traffic than Directory-CMP and, finally, it has a low average execution time (increasing just by 1% the best approach, DiCo-FM)

  • A Hybrid Static-Dynamic Classification for Dual-Consistency Cache Coherence
    IEEE Transactions on Parallel and Distributed Systems, 2016
    Co-Authors: Alberto Ros, Alexandra Jimborean
    Abstract:

    Traditional Cache Coherence protocols manage all memory accesses equally and ensure the strongest memory model, namely, sequential consistency. Recent Cache Coherence protocols based on self-invalidation advocate for the model sequential consistency for data-race-free, which enables powerful optimizations for race-free code. However, for racy code these Cache Coherence protocols provide sub-optimal performance compared to traditional protocols. This paper proposes SPEL++, a dual-consistency Cache Coherence protocol that supports two execution modes: a traditional sequential-consistent protocol and a protocol that provides weak consistency (or sequential consistency for data-race-free). SPEL++ exploits a static-dynamic hybrid classification of memory accesses based on (i) a compile-time identification of extended data-race-free code regions for OpenMP applications and (ii) a runtime classification of accesses based on the operating system's memory page management. By executing racy code under the sequential-consistent protocol and race-free code under the Cache Coherence protocol that provides sequential consistency for data-race-free, the end result is an efficient execution of the applications while still providing sequential consistency. Compared to a traditional protocol, we show improvements in performance from 19 to 38 percent and reductions in energy consumption from 47 to 53 percent, on average for different benchmark suites, on a 64-core chip multiprocessor.

  • VIPS: simple, efficient, and scalable Cache Coherence
    2016
    Co-Authors: Alberto Ros
    Abstract:

    Directory-based Cache Coherence is the de-facto standard for scalable shared-memory multi/many-cores and significant effort is invested in reducing its overhead. However, directory area and complexity optimizations are often antithetical to each other. This talk presents VIPS, a family of Cache Coherence protocols based on self-invalidation and self-downgrade. VIPS protocols remove the complexity and cost associated with directories in their entirety, thus increasing multiprocessors scalability, and at the same time, provide better performance and energy efficiency than traditional directory-based protocols.

  • a dual consistency Cache Coherence protocol
    International Parallel and Distributed Processing Symposium, 2015
    Co-Authors: Alberto Ros, Alexandra Jimborean
    Abstract:

    Weak memory consistency models can maximize system performance by enabling hardware and compiler optimizations, but increase programming complexity since they do not match programmers' intuition. The design of an efficient system with an intuitive memory model is an open challenge. This paper proposes SPEL, a dual-consistency Cache Coherence protocol which simultaneously guarantees the strongest memory consistency model provided by the hardware and yields improvements in both performance and energy consumption. The design of the protocol exploits a compile-time identification of code regions which can be executed under a less restrictive, thus optimized protocol, without harming correctness. Outside these regions, code is executed under a more restrictive protocol which enforces sequential consistency. Compared to a standard directory protocol, we show improvements in performance of 24% and reductions in energy consumption of 32%, on average, for a 64-core chip multiprocessor.

  • Efficient Cache Coherence Protocol in Tiled Chip Multiprocessors
    2015
    Co-Authors: Alberto Ros, Manuel E. Acacio, Jose ́ M. Garćıa
    Abstract:

    Abstract — Although directory-based Cache coher-ence protocols are the best choice when designing large-scale chip multiprocessors (CMPs), they in-troduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a Cache Coherence protocol aimed at avoiding indirection to the directory infor-mation. In DiCo-CMP, the role of storing up-to-date sharing information and ensuring totally ordered ac-cesses for every memory block is assigned to the Cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending Coherence messages di-rectly from the requesting Caches to those that must observe them, and reduces the network traffic com-pared to broadcast-based protocols by sending just one request message for each miss. Using an extended version of GEMS simulator we show that DiCo-CMP achieves improvements in execution time of up to 8% on average over a directory protocol, and reductions in terms of network traffic of up to 42 % on average compared to Token-CMP. Keywords—Tiled CMPs, Cache Coherence protocols, DiCo-CMP, direct Coherence, indirection. I

Vijay Nagarajan - One of the best experts on this subject based on the ideXlab platform.

  • verification of a lazy Cache Coherence protocol against a weak memory model
    arXiv: Logic in Computer Science, 2017
    Co-Authors: Christopher J Banks, Marco Elver, Ruth Hoffmann, Susmit Sarkar, Paul B Jackson, Vijay Nagarajan
    Abstract:

    In this paper we verify a modern lazy Cache Coherence protocol, TSO-CC, against the memory consistency model it was designed for, TSO. We achieve this by first showing a weak simulation relation between TSO-CC (with a fixed number of processors) and a novel finite-state operational model which exhibits the laziness of TSO-CC and satisfies TSO. We then extend this by an existing parameterisation technique, allowing verification for an unlimited number of processors. The approach is executed entirely within a model checker, no external tool is required and very little in-depth knowledge of formal verification methods is required of the verifier.

  • rc3 consistency directed Cache Coherence for x86 64 with rc extensions
    International Conference on Parallel Architectures and Compilation Techniques, 2015
    Co-Authors: Marco Elver, Vijay Nagarajan
    Abstract:

    The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy Cache Coherence protocols. These protocols exploit synchronization information by enforcing Coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy Coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware Cache Coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per Cache line scales logarithmically with increasing core count, and reduces on-chip Coherence storage overheads by 45% compared to a related approach specifically targeting TSO.

  • tso cc consistency directed Cache Coherence for tso
    High-Performance Computer Architecture, 2014
    Co-Authors: Marco Elver, Vijay Nagarajan
    Abstract:

    Traditional directory Coherence protocols are designed for the strictest consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that support relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually this comes at the cost of scalability (in terms of per core storage), which poses a problem with increasing number of cores in today's CMPs, most of which no longer are sequentially consistent. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we propose TSO-CC, a Cache Coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers, and instead relies on self-invalidation and detection of potential acquires using timestamps to satisfy the TSO memory consistency model lazily. Our results show that TSO-CC achieves average performance comparable to a MESI directory protocol, while TSO-CC's storage overhead per Cache line scales logarithmically with increasing core count.

Linda A. Ness - One of the best experts on this subject based on the ideXlab platform.

  • Verification of the Futurebus+ Cache Coherence protocol
    Formal Methods in System Design, 1995
    Co-Authors: Edmund M. Clarke, Orna Grumberg, Hiromi Hiraishi, Somesh Jha, David E. Long, Kenneth L. Mcmillan, Linda A. Ness
    Abstract:

    We used a hardware description language to construct a formal model of the Cache Coherence protocol described in the IEEE Futurebus+standard. By applying temporal logic model checking techniques, we found errors in the standard. The result of our project is a concise, comprehensible and unambiguous model of the protocol that should be useful both to the Futurebus+Working Group members, who are responsible for the protocol, and to actual designers of Futurebus+boards.

  • verification of the futurebus Cache Coherence protocol
    Formal Methods, 1995
    Co-Authors: Edmund M. Clarke, Orna Grumberg, Hiromi Hiraishi, Somesh Jha, David E. Long, Kenneth L. Mcmillan, Linda A. Ness
    Abstract:

    We used a hardware description language to construct a formal model of the Cache Coherence protocol described in the draft IEEE Futurebus+ standard. By applying temporal logic model checking techniques, we found several errors in the standard. The result of our project is a concise, comprehensible and unambiguous model of the protocol that should be useful both to the Futurebus+ Working Group members, who are responsible for the protocol, and to actual designers of Futurebus+ boards.

Kenneth L. Mcmillan - One of the best experts on this subject based on the ideXlab platform.

  • parameterized verification of the flash Cache Coherence protocol by compositional model checking
    Lecture Notes in Computer Science, 2001
    Co-Authors: Kenneth L. Mcmillan
    Abstract:

    We consider the formal verification of the Cache Coherence protocol of the Stanford FLASH multiprocessor for N processors. The proof uses the SMV proof assistant, a proof system based on symbolic model checking. The proof process is described step by step. The protocol model is derived from an earlier proof of the FLASH protocol, using the PVS system, allowing a direct comparison between the two methods.

  • Verification of the Futurebus+ Cache Coherence protocol
    Formal Methods in System Design, 1995
    Co-Authors: Edmund M. Clarke, Orna Grumberg, Hiromi Hiraishi, Somesh Jha, David E. Long, Kenneth L. Mcmillan, Linda A. Ness
    Abstract:

    We used a hardware description language to construct a formal model of the Cache Coherence protocol described in the IEEE Futurebus+standard. By applying temporal logic model checking techniques, we found errors in the standard. The result of our project is a concise, comprehensible and unambiguous model of the protocol that should be useful both to the Futurebus+Working Group members, who are responsible for the protocol, and to actual designers of Futurebus+boards.

  • verification of the futurebus Cache Coherence protocol
    Formal Methods, 1995
    Co-Authors: Edmund M. Clarke, Orna Grumberg, Hiromi Hiraishi, Somesh Jha, David E. Long, Kenneth L. Mcmillan, Linda A. Ness
    Abstract:

    We used a hardware description language to construct a formal model of the Cache Coherence protocol described in the draft IEEE Futurebus+ standard. By applying temporal logic model checking techniques, we found several errors in the standard. The result of our project is a concise, comprehensible and unambiguous model of the protocol that should be useful both to the Futurebus+ Working Group members, who are responsible for the protocol, and to actual designers of Futurebus+ boards.

David J Lilja - One of the best experts on this subject based on the ideXlab platform.

  • Cache Coherence in large scale shared memory multiprocessors issues and comparisons
    ACM Computing Surveys, 1993
    Co-Authors: David J Lilja
    Abstract:

    Private data Caches have not been as effective in reducing the average memory delay in multiprocessors as in uniprocessors due to data spreading among the processors, and due to the Cache Coherence problem. A wide variety of mechanisms have been proposed for maintaining Cache Coherence in large-scale shared memory multiprocessors making it difficult to compare their performance and implementation implications. To help the computer architect understand some of the trade-offs involved, this paper surveys current Cache Coherence mechanisms, and identifies several issues critical to their design. These design issues include: 1) the Coherence detection strategy, through which possibly incoherent memory accesses are detected either statically at compile-time, or dynamically at run-time; 2) the Coherence enforcement strategy, such as updating or invalidating, that is used to ensure that stale Cache entries are never referenced by a processor; 3) how the precision of block sharing information can be changed to trade-off the implementation cost and the performance of the Coherence mechanism; and 4) how the Cache block size affects the performance of the memory system. Trace-driven simulations are used to compare the performance and implementation impacts of these different issues. In addition, hybrid strategies are presented that can enhance the performance of the multiprocessor memory system by combining several different Coherence mechanisms into a single system.