cycles per instruction

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 888 Experts worldwide ranked by ideXlab platform

Mark Leone - One of the best experts on this subject based on the ideXlab platform.

  • optimizing ml with run time code generation
    Programming Language Design and Implementation, 1996
    Co-Authors: Mark Leone
    Abstract:

    We describe the design and implementation of a compiler that automatically translates ordinary programs written in a subset of ML into code that generates native code at run time. Run-time code generation can make use of values and invariants that cannot be exploited at compile time, yielding code that is often superior to statically optimal code. But the cost of optimizing and generating code at run time can be prohibitive. We demonstrate how compile-time specialization can reduce the cost of run-time code generation by an order of magnitude without greatly affecting code quality. Several benchmark programs are examined, which exhibit an average cost of only six cycles per instruction generated at run time.

  • PLDI - Optimizing ML with run-time code generation
    Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation - PLDI '96, 1996
    Co-Authors: Peter Lee, Mark Leone
    Abstract:

    We describe the design and implementation of a compiler that automatically translates ordinary programs written in a subset of ML into code that generates native code at run time. Run-time code generation can make use of values and invariants that cannot be exploited at compile time, yielding code that is often superior to statically optimal code. But the cost of optimizing and generating code at run time can be prohibitive. We demonstrate how compile-time specialization can reduce the cost of run-time code generation by an order of magnitude without greatly affecting code quality. Several benchmark programs are examined, which exhibit an average cost of only six cycles per instruction generated at run time.

John Wilkes - One of the best experts on this subject based on the ideXlab platform.

  • EuroSys - CPI 2 : CPU performance isolation for shared compute clusters
    Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13, 2013
    Co-Authors: Xiao Zhang, Eric S. Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, John Wilkes
    Abstract:

    performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs' behavior. Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues.

  • CPI 2 : CPU performance isolation for shared compute clusters
    2013
    Co-Authors: Xiao Zhang, Eric S. Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, John Wilkes, Google Inc
    Abstract:

    performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs’ behavior. Our solution, CPI 2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI 2 to all of Google’s shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues. 1

Sandhya Dwarkadas - One of the best experts on this subject based on the ideXlab platform.

  • dynamic frequency and voltage control for a multiple clock domain microarchitecture
    International Symposium on Microarchitecture, 2002
    Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott
    Abstract:

    We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

  • memory hierarchy reconfiguration for energy and performance in general purpose processor architectures
    International Symposium on Microarchitecture, 2000
    Co-Authors: Rajeev Balasubramonian, David H Albonesi, Alper Buyuktosunoglu, Sandhya Dwarkadas
    Abstract:

    Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application's hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 /spl mu/m technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 /spl mu/m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

  • MICRO - Dynamic frequency and voltage control for a multiple clock domain microarchitecture
    35th Annual IEEE ACM International Symposium on Microarchitecture 2002. (MICRO-35). Proceedings., 1
    Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott
    Abstract:

    We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

D. Bhandarkar - One of the best experts on this subject based on the ideXlab platform.

  • ISCA - Characterization of alpha AXP performance using TP and SPEC workloads
    1994
    Co-Authors: Z. Cvetanovic, D. Bhandarkar
    Abstract:

    The characteristics of several commercial and technical workloads on the DEC 7000 AXP system are compared using built-in hardware monitors. The data analyzed include total instructions, cycles, multiple-issued instructions, stall components, cache misses, and instruction types. The data indicates that the two classes of workloads have vastly different characteristics and impose different requirements on the system design. Compared to VAX, Alpha AXP takes advantage of lower cycles per instruction and cycle time to achieve a significant performance advantage. The cache and memory interconnect subsystems are expected to play a crucial role in the performance of future systems. A simple model for evaluating the effects of various design tradeoffs based on the data collected by using hardware monitors is proposed.

  • HPCA - performance characterization of the Alpha 21164 microprocessor using TP and SPEC workloads
    Proceedings. Second International Symposium on High-Performance Computer Architecture, 1
    Co-Authors: Z. Cvetanovic, D. Bhandarkar
    Abstract:

    This paper compares the performance characteristics of the Alpha 21164 to the previous-generation 21064 microprocessor. Measurements on the 21164-based AlphaServer 8200 system are compared to the 21064-based DEC 7000 server using several commercial and technical workloads. The data analyzed includes cycles per instruction, multiple-issued instructions, branch predictions, stall components, cache misses, and instruction frequencies. The AlphaServer 8200 provides 2 to 3 times the performance of the DEC 7000 server based on the faster clock, larger on-chip cache, expanded multiple-issuing, and lower cache/memory latencies and higher bandwidth.

Michael L Scott - One of the best experts on this subject based on the ideXlab platform.

  • dynamic frequency and voltage control for a multiple clock domain microarchitecture
    International Symposium on Microarchitecture, 2002
    Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott
    Abstract:

    We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

  • MICRO - Dynamic frequency and voltage control for a multiple clock domain microarchitecture
    35th Annual IEEE ACM International Symposium on Microarchitecture 2002. (MICRO-35). Proceedings., 1
    Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott
    Abstract:

    We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.