Memory Reference

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 252 Experts worldwide ranked by ideXlab platform

Michael M. Swift - One of the best experts on this subject based on the ideXlab platform.

  • Revisiting virtual Memory
    2013
    Co-Authors: Mark D. Hill, Michael M. Swift, Arkaprava Basu
    Abstract:

    Page-based virtual Memory (paging) is a crucial piece of Memory management in today's computing systems. However, I find that need, purpose and design constraints of virtual Memory have changed dramatically since translation lookaside buffers (TLBs) were introduced to cache recently-used address translations: (a) physical Memory sizes have grown more than a million-fold, (b) workloads are often sized to avoid swapping information to and from secondary storage, and (c) energy is now a first-order design constraint. Nevertheless, level-one TLBs have remained the same size and are still accessed on every Memory Reference. As a result, large workloads waste considerable execution time on TLB misses and all workloads spend energy on frequent TLB accesses. In this thesis I argue that it is now time to reevaluate virtual Memory management. I reexamine virtual Memory subsystem considering the ever-growing latency overhead of address translation and considering energy dissipation, developing three results. First, I proposed direct segments to reduce the latency overhead of address translation for emerging big-Memory workloads. Many big-Memory workloads allocate most of their Memory early in execution and do not benefit from paging. Direct segments enable hardware-OS mechanisms to bypass paging for a part of a process's virtual address space, eliminating nearly 99% of TLB miss for many of these workloads. Second, I proposed opportunistic virtual caching (OVC) to reduce the energy spent on translating addresses. Accessing TLBs on each Memory Reference burns significant energy, and virtual Memory's page size constrains L1-cache designs to be highly associative—burning yet more energy. OVC makes hardware-OS modifications to expose energy-efficient virtual caching as a dynamic optimization. This saves 94-99% of TLB lookup energy and 23% of L1-cache lookup energy across several workloads. Third, large pages are likely to be more appropriate than direct segments to reduce TLB misses under frequent Memory allocations/deallocations. Unfortunately, prevalent chip designs like Intel's, statically partition TLB resources among multiple page sizes, which can lead to performance pathologies for using large pages. I proposed the merged-associative TLB to avoid such pathologies and reduce TLB miss rate by up to 45% through dynamic aggregation of TLB resources across page sizes.

  • ISCA - Reducing Memory Reference energy with opportunistic virtual caching
    ACM SIGARCH Computer Architecture News, 2012
    Co-Authors: Arkaprava Basu, Mark D. Hill, Michael M. Swift
    Abstract:

    Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every Memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some Memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

  • reducing Memory Reference energy with opportunistic virtual caching
    International Symposium on Computer Architecture, 2012
    Co-Authors: Arkaprava Basu, Mark D. Hill, Michael M. Swift
    Abstract:

    Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every Memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some Memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

Arkaprava Basu - One of the best experts on this subject based on the ideXlab platform.

  • Revisiting virtual Memory
    2013
    Co-Authors: Mark D. Hill, Michael M. Swift, Arkaprava Basu
    Abstract:

    Page-based virtual Memory (paging) is a crucial piece of Memory management in today's computing systems. However, I find that need, purpose and design constraints of virtual Memory have changed dramatically since translation lookaside buffers (TLBs) were introduced to cache recently-used address translations: (a) physical Memory sizes have grown more than a million-fold, (b) workloads are often sized to avoid swapping information to and from secondary storage, and (c) energy is now a first-order design constraint. Nevertheless, level-one TLBs have remained the same size and are still accessed on every Memory Reference. As a result, large workloads waste considerable execution time on TLB misses and all workloads spend energy on frequent TLB accesses. In this thesis I argue that it is now time to reevaluate virtual Memory management. I reexamine virtual Memory subsystem considering the ever-growing latency overhead of address translation and considering energy dissipation, developing three results. First, I proposed direct segments to reduce the latency overhead of address translation for emerging big-Memory workloads. Many big-Memory workloads allocate most of their Memory early in execution and do not benefit from paging. Direct segments enable hardware-OS mechanisms to bypass paging for a part of a process's virtual address space, eliminating nearly 99% of TLB miss for many of these workloads. Second, I proposed opportunistic virtual caching (OVC) to reduce the energy spent on translating addresses. Accessing TLBs on each Memory Reference burns significant energy, and virtual Memory's page size constrains L1-cache designs to be highly associative—burning yet more energy. OVC makes hardware-OS modifications to expose energy-efficient virtual caching as a dynamic optimization. This saves 94-99% of TLB lookup energy and 23% of L1-cache lookup energy across several workloads. Third, large pages are likely to be more appropriate than direct segments to reduce TLB misses under frequent Memory allocations/deallocations. Unfortunately, prevalent chip designs like Intel's, statically partition TLB resources among multiple page sizes, which can lead to performance pathologies for using large pages. I proposed the merged-associative TLB to avoid such pathologies and reduce TLB miss rate by up to 45% through dynamic aggregation of TLB resources across page sizes.

  • ISCA - Reducing Memory Reference energy with opportunistic virtual caching
    ACM SIGARCH Computer Architecture News, 2012
    Co-Authors: Arkaprava Basu, Mark D. Hill, Michael M. Swift
    Abstract:

    Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every Memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some Memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

  • reducing Memory Reference energy with opportunistic virtual caching
    International Symposium on Computer Architecture, 2012
    Co-Authors: Arkaprava Basu, Mark D. Hill, Michael M. Swift
    Abstract:

    Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every Memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some Memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

Kevin Skadron - One of the best experts on this subject based on the ideXlab platform.

  • Memory Reference reuse latency accelerated warmup for sampled microarchitecture simulation
    International Symposium on Performance Analysis of Systems and Software, 2003
    Co-Authors: J W Haskins, Kevin Skadron
    Abstract:

    This paper proposes to speedup sampled microprocessor simulations by reducing warmup times without sacrificing simulation accuracy. It exploiting the observation that of the Memory References that precede a sample cluster, References that occur nearest to the cluster are more likely to be germane to the execution of the cluster itself. Hence, while modeling all cache and branch predictor interactions that precede a sample cluster would reliably establish their state, this is overkill and leads to long-running simulations. Instead, accurately establishing simulated cache and branch predictor state can be accomplished quickly by only modeling a subset of the Memory References and control-flow instructions immediately preceding a sample cluster. Our technique measures Memory Reference reuse latencies (MRRLs) - the number of completed instructions between consecutive References to each unique Memory location - and uses these data to choose a point prior to each cluster to engage cache hierarchy and branch predictor modeling. By starting cache and branch predictor modeling late in the pre-cluster instruction stream, we were able to reduce overall simulation running times by an average of 90.62% of the maximum potential speedup (accomplished by performing no pre-cluster warmup at all), while generating an average error in IPC of less than 1%, both relative to the IPC generated by warming up all pre-cluster cache and branch predictor interactions.

  • ISPASS - Memory Reference reuse latency: Accelerated warmup for sampled microarchitecture simulation
    2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003., 2003
    Co-Authors: J W Haskins, Kevin Skadron
    Abstract:

    This paper proposes to speedup sampled microprocessor simulations by reducing warmup times without sacrificing simulation accuracy. It exploiting the observation that of the Memory References that precede a sample cluster, References that occur nearest to the cluster are more likely to be germane to the execution of the cluster itself. Hence, while modeling all cache and branch predictor interactions that precede a sample cluster would reliably establish their state, this is overkill and leads to long-running simulations. Instead, accurately establishing simulated cache and branch predictor state can be accomplished quickly by only modeling a subset of the Memory References and control-flow instructions immediately preceding a sample cluster. Our technique measures Memory Reference reuse latencies (MRRLs) - the number of completed instructions between consecutive References to each unique Memory location - and uses these data to choose a point prior to each cluster to engage cache hierarchy and branch predictor modeling. By starting cache and branch predictor modeling late in the pre-cluster instruction stream, we were able to reduce overall simulation running times by an average of 90.62% of the maximum potential speedup (accomplished by performing no pre-cluster warmup at all), while generating an average error in IPC of less than 1%, both relative to the IPC generated by warming up all pre-cluster cache and branch predictor interactions.

  • Memory Reference reuse latency accelerated sampled microarchitecture simulation
    2002
    Co-Authors: J W Haskins, Kevin Skadron
    Abstract:

    This paper explores techniques for speeding up sampled microprocessor simulations by exploiting the observation that of the Memory References that precede a sample, References that occur nearest to the sample are more likely to be germane during the sample itself. This means that accurately warming up simulated cache and branch predictor state only requires that a subset of the Memory References and control-flow instructions immediately preceding a simulation sample need to be modeled. Our technique takes measurements of the Memory Reference reuse latencies (MRRLs) and uses these data to choose a point prior to each sample to engage cache hierarchy and branch predictor modeling. By starting cache and branch predictor modeling late in the pre-sample instruction stream, rather than modeling cache and branch predictor interactions for all pre-sample instructions we are able to save the time cost of modeling them. This savings reduces overall simulation running times by an average of 25%, while generating an average error in IPC of less than 0.7%.

  • TSpec: A Notation for Describing Memory Reference Traces
    2000
    Co-Authors: Dee A. B. Weikle, Kevin Skadron, Sally A. Mckee, William A. Wulf
    Abstract:

    Interpreting Reference patterns in the output of a processor is complicated by the lack of a succinct notation for humans to use when communicating about them. Since an actual trace is simply an incredibly long list of numbers, it is difficult to see the underlying patterns inherent in it. The source code, while simpler to look at, does not include the effects of compiler optimizations such as loop unrolling, and so can be misleading as to the actual References and order seen by the Memory system. To simplify communication of traces between researchers and to understand them more completely, we have developed a notation for representing them that is easy for humans to read, write, and analyze. We call this notation TSpec, for trace specification notation. It has been designed for use in cache design with four goals in mind. First, it is intended to assist in communication between people, especially with respect to understanding the patterns inherent in Memory Reference traces. Second, it is the object on which the cache filter model operates. Specifically, the trace and state of the cache are represented in TSpec, these are then the inputs for a function that models the cache, and the result of that function is a modified trace and state that are also represented in TSpec. Third, it supports the future creation of a machine readable version that could be used to generate traces to drive simulators, or for use in tools (such as translators from assembly language to TSpec). Finally, it can be used to represent different levels of abstraction in benchmark analysis.

Mohammed Atiquzzaman - One of the best experts on this subject based on the ideXlab platform.

  • Multiple-bus multiprocessor under unbalanced traffic
    Computers & Electrical Engineering, 1999
    Co-Authors: M.a. Sayeed, Mohammed Atiquzzaman
    Abstract:

    Performance evaluation of multiple-bus multiprocessor systems is usually carried out under the assumption of uniform Memory Reference model. Hot spots arising in multiprocessor systems due to the use of shared variables, synchronization primitives, etc. give rise to non-uniform Memory Reference pattern. The objective of this paper is to study the performance of multiple bus multiprocessor system in the presence of hot spots. Analytical expressions for the average Memory bandwidth and probability of acceptance of prioritized processors have been derived. Two new phenomenon, coined as bumping and knee effect, have been observed in the acceptance probabilities of the processors. The results are validated by simulation results.

  • MASCOTS - Performance of multiple-bus multiprocessor under non-uniform Memory Reference
    Proceedings of International Workshop on Modeling Analysis and Simulation of Computer and Telecommunication Systems, 1994
    Co-Authors: M.a. Sayeed, Mohammed Atiquzzaman
    Abstract:

    Performance evaluation of multiple-bus multiprocessor systems is usually carried out under the assumption of uniform Memory Reference model. The objective of this paper is to study the performance of multiple bus multiprocessor system in the presence of hot spots. Analytical expressions for the average Memory bandwidth and probability of acceptance of prioritized processors have been derived. Two new phenomenon, coined as bumping and knee effect, have been observed in the acceptance probabilities of the processors. The results are validated by simulation results. >

  • Effect of nonuniform traffic on the performance of multistage interconnection networks
    IEE Proceedings - Computers and Digital Techniques, 1994
    Co-Authors: Mohammed Atiquzzaman, M. Shaheer Akhtar
    Abstract:

    Multistage interconnection networks are used to connect processors to memories in shared Memory multiprocessor systems. The performance evaluation of such networks is usually based on the assumption of a uniform Memory Reference pattern. Hot spots in such networks give rise to a nonuniform Memory Reference pattern and result in a degradation in performance. A comparison of the performance of unbuffered and buffered networks under a nonuniform traffic pattern is given. Analytical models have been developed for the evaluation of performance. An analytical model for unbuffered networks is developed in this paper, while the model for buffered networks is presented elsewhere. Results from the models are used to find the impact of different degrees of hot spot traffic and network size on the performance of the network. It is shown that an unbuffered network may perform better than a buffered network under a nonuniform traffic pattern. Finally, a hybrid mode network is suggested for optimum performance under different traffic conditions

  • Effect of hot-spots on the performance of crossbar multiprocessor systems
    Parallel Computing, 1993
    Co-Authors: Mohammed Atiquzzaman, Mohammad M. Banat
    Abstract:

    Atiquzzaman, M. and M.M. Banat, Effect of hot-spots on the performance of crossbar multiprocessor systems, Parallel Computing 19 (1993) 455-461. Previous studies on the performance of crossbar multiprocessor systems have assumed a uniform Memory Reference model. Hot-spots arising in multiprocessor systems due to the use of shared variables, synchronization primitives, etc. give rise to nonuniform Memory Reference pattern. This paper presents an analytical expression for calculating the average Memory bandwidth of a crossbar multiprocessor system having a hot Memory. The expression is verified by simulation results.

  • ISCA - Performance of Multiple-Bus Multiprocessor Under Non-Uniform Memory Reference Model
    [1992] Proceedings the 19th Annual International Symposium on Computer Architecture, 1992
    Co-Authors: M.a. Sayeed, Mohammed Atiquzzaman
    Abstract:

    Performance evaluation of multiple-bus multiprocessor systems is usually carried out under the assumption of uniform Memory Reference model. Hot spots arising in multiprocessor systems due to the use of shared variables, synchronization primitives etc., give rise to non-uniform Memory Reference patterns. The objective of this paper is to study the performance of multiple bus multiprocessor system in the presence of hot spots. Analytic expressions for the average Memory bandwidth and probability of acceptance of prioritized processors have been derived. The results are validated by simulation results.

Mark D. Hill - One of the best experts on this subject based on the ideXlab platform.

  • Revisiting virtual Memory
    2013
    Co-Authors: Mark D. Hill, Michael M. Swift, Arkaprava Basu
    Abstract:

    Page-based virtual Memory (paging) is a crucial piece of Memory management in today's computing systems. However, I find that need, purpose and design constraints of virtual Memory have changed dramatically since translation lookaside buffers (TLBs) were introduced to cache recently-used address translations: (a) physical Memory sizes have grown more than a million-fold, (b) workloads are often sized to avoid swapping information to and from secondary storage, and (c) energy is now a first-order design constraint. Nevertheless, level-one TLBs have remained the same size and are still accessed on every Memory Reference. As a result, large workloads waste considerable execution time on TLB misses and all workloads spend energy on frequent TLB accesses. In this thesis I argue that it is now time to reevaluate virtual Memory management. I reexamine virtual Memory subsystem considering the ever-growing latency overhead of address translation and considering energy dissipation, developing three results. First, I proposed direct segments to reduce the latency overhead of address translation for emerging big-Memory workloads. Many big-Memory workloads allocate most of their Memory early in execution and do not benefit from paging. Direct segments enable hardware-OS mechanisms to bypass paging for a part of a process's virtual address space, eliminating nearly 99% of TLB miss for many of these workloads. Second, I proposed opportunistic virtual caching (OVC) to reduce the energy spent on translating addresses. Accessing TLBs on each Memory Reference burns significant energy, and virtual Memory's page size constrains L1-cache designs to be highly associative—burning yet more energy. OVC makes hardware-OS modifications to expose energy-efficient virtual caching as a dynamic optimization. This saves 94-99% of TLB lookup energy and 23% of L1-cache lookup energy across several workloads. Third, large pages are likely to be more appropriate than direct segments to reduce TLB misses under frequent Memory allocations/deallocations. Unfortunately, prevalent chip designs like Intel's, statically partition TLB resources among multiple page sizes, which can lead to performance pathologies for using large pages. I proposed the merged-associative TLB to avoid such pathologies and reduce TLB miss rate by up to 45% through dynamic aggregation of TLB resources across page sizes.

  • ISCA - Reducing Memory Reference energy with opportunistic virtual caching
    ACM SIGARCH Computer Architecture News, 2012
    Co-Authors: Arkaprava Basu, Mark D. Hill, Michael M. Swift
    Abstract:

    Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every Memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some Memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

  • reducing Memory Reference energy with opportunistic virtual caching
    International Symposium on Computer Architecture, 2012
    Co-Authors: Arkaprava Basu, Mark D. Hill, Michael M. Swift
    Abstract:

    Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every Memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some Memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.