cycles per instruction

The Experts below are selected from a list of 888 Experts worldwide ranked by ideXlab platform

Mark Leone - One of the best experts on this subject based on the ideXlab platform.

optimizing ml with run time code generation

Programming Language Design and Implementation, 1996

Co-Authors: Mark Leone

Abstract:

We describe the design and implementation of a compiler that automatically translates ordinary programs written in a subset of ML into code that generates native code at run time. Run-time code generation can make use of values and invariants that cannot be exploited at compile time, yielding code that is often superior to statically optimal code. But the cost of optimizing and generating code at run time can be prohibitive. We demonstrate how compile-time specialization can reduce the cost of run-time code generation by an order of magnitude without greatly affecting code quality. Several benchmark programs are examined, which exhibit an average cost of only six cycles per instruction generated at run time.

15 days free trial to Access Article
PLDI - Optimizing ML with run-time code generation

Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation - PLDI '96, 1996

Co-Authors: Peter Lee, Mark Leone

Abstract:

We describe the design and implementation of a compiler that automatically translates ordinary programs written in a subset of ML into code that generates native code at run time. Run-time code generation can make use of values and invariants that cannot be exploited at compile time, yielding code that is often superior to statically optimal code. But the cost of optimizing and generating code at run time can be prohibitive. We demonstrate how compile-time specialization can reduce the cost of run-time code generation by an order of magnitude without greatly affecting code quality. Several benchmark programs are examined, which exhibit an average cost of only six cycles per instruction generated at run time.

15 days free trial to Access Article

John Wilkes - One of the best experts on this subject based on the ideXlab platform.

EuroSys - CPI 2 : CPU performance isolation for shared compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13, 2013

Co-Authors: Xiao Zhang, Eric S. Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, John Wilkes

Abstract:

performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs' behavior. Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues.

15 days free trial to Access Article
CPI 2 : CPU performance isolation for shared compute clusters

2013

Co-Authors: Xiao Zhang, Eric S. Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, John Wilkes, Google Inc

Abstract:

performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs’ behavior. Our solution, CPI 2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI 2 to all of Google’s shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues. 1

15 days free trial to Access Article

Sandhya Dwarkadas - One of the best experts on this subject based on the ideXlab platform.

dynamic frequency and voltage control for a multiple clock domain microarchitecture

International Symposium on Microarchitecture, 2002

Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott

Abstract:

We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

15 days free trial to Access Article
memory hierarchy reconfiguration for energy and performance in general purpose processor architectures

International Symposium on Microarchitecture, 2000

Co-Authors: Rajeev Balasubramonian, David H Albonesi, Alper Buyuktosunoglu, Sandhya Dwarkadas

Abstract:

Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application's hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 /spl mu/m technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 /spl mu/m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

15 days free trial to Access Article
MICRO - Dynamic frequency and voltage control for a multiple clock domain microarchitecture

35th Annual IEEE ACM International Symposium on Microarchitecture 2002. (MICRO-35). Proceedings., 1

Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott

Abstract:

We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

15 days free trial to Access Article

D. Bhandarkar - One of the best experts on this subject based on the ideXlab platform.

ISCA - Characterization of alpha AXP performance using TP and SPEC workloads

1994

Co-Authors: Z. Cvetanovic, D. Bhandarkar

Abstract:

The characteristics of several commercial and technical workloads on the DEC 7000 AXP system are compared using built-in hardware monitors. The data analyzed include total instructions, cycles, multiple-issued instructions, stall components, cache misses, and instruction types. The data indicates that the two classes of workloads have vastly different characteristics and impose different requirements on the system design. Compared to VAX, Alpha AXP takes advantage of lower cycles per instruction and cycle time to achieve a significant performance advantage. The cache and memory interconnect subsystems are expected to play a crucial role in the performance of future systems. A simple model for evaluating the effects of various design tradeoffs based on the data collected by using hardware monitors is proposed.

15 days free trial to Access Article
HPCA - performance characterization of the Alpha 21164 microprocessor using TP and SPEC workloads

Proceedings. Second International Symposium on High-Performance Computer Architecture, 1

Co-Authors: Z. Cvetanovic, D. Bhandarkar

Abstract:

This paper compares the performance characteristics of the Alpha 21164 to the previous-generation 21064 microprocessor. Measurements on the 21164-based AlphaServer 8200 system are compared to the 21064-based DEC 7000 server using several commercial and technical workloads. The data analyzed includes cycles per instruction, multiple-issued instructions, branch predictions, stall components, cache misses, and instruction frequencies. The AlphaServer 8200 provides 2 to 3 times the performance of the DEC 7000 server based on the faster clock, larger on-chip cache, expanded multiple-issuing, and lower cache/memory latencies and higher bandwidth.

15 days free trial to Access Article

Michael L Scott - One of the best experts on this subject based on the ideXlab platform.

dynamic frequency and voltage control for a multiple clock domain microarchitecture

International Symposium on Microarchitecture, 2002

Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott

Abstract:

We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

15 days free trial to Access Article
MICRO - Dynamic frequency and voltage control for a multiple clock domain microarchitecture

35th Annual IEEE ACM International Symposium on Microarchitecture 2002. (MICRO-35). Proceedings., 1

Co-Authors: Greg Semeraro, David H Albonesi, Steven Dropsho, Grigorios Magklis, Sandhya Dwarkadas, Michael L Scott

Abstract:

We describe the design, analysis, and performance of an on--line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance.Our algorithm achieves on average a 19.0% reduction in Energy per instruction (EPI), a 3.2% increase in cycles per instruction (CPI), a 16.7% improvement in Energy--Delay Product, and a Power Savings to performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to performance Degradation ratio of only 2--3. Our Energy--Delay Product improvement is 85.5% of what has been achieved using an off--line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Mark Leone - One of the best experts on this subject based on the ideXlab platform.

optimizing ml with run time code generation

PLDI - Optimizing ML with run-time code generation

John Wilkes - One of the best experts on this subject based on the ideXlab platform.

EuroSys - CPI 2 : CPU performance isolation for shared compute clusters

CPI 2 : CPU performance isolation for shared compute clusters

Sandhya Dwarkadas - One of the best experts on this subject based on the ideXlab platform.

dynamic frequency and voltage control for a multiple clock domain microarchitecture

memory hierarchy reconfiguration for energy and performance in general purpose processor architectures

MICRO - Dynamic frequency and voltage control for a multiple clock domain microarchitecture

D. Bhandarkar - One of the best experts on this subject based on the ideXlab platform.

ISCA - Characterization of alpha AXP performance using TP and SPEC workloads

HPCA - performance characterization of the Alpha 21164 microprocessor using TP and SPEC workloads

Michael L Scott - One of the best experts on this subject based on the ideXlab platform.

dynamic frequency and voltage control for a multiple clock domain microarchitecture

MICRO - Dynamic frequency and voltage control for a multiple clock domain microarchitecture

cycles per instruction

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Mark Leone - One of the best experts on this subject based on the ideXlab platform.

John Wilkes - One of the best experts on this subject based on the ideXlab platform.

Sandhya Dwarkadas - One of the best experts on this subject based on the ideXlab platform.

D. Bhandarkar - One of the best experts on this subject based on the ideXlab platform.

Michael L Scott - One of the best experts on this subject based on the ideXlab platform.

Related terms