Out-of-Order Processor

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 981 Experts worldwide ranked by ideXlab platform

Lieven Eeckhout - One of the best experts on this subject based on the ideXlab platform.

  • A performance counter architecture for computing accurate CPI components
    ACM SIGPLAN Notices, 2020
    Co-Authors: Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, James E. Smith
    Abstract:

    A common way of representing Processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microProcessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar Out-of-Order Processors is challenging because of various overlaps among execution and miss events ( cache misses, TLB misses, and branch mispredictions). This paper shows that meaningful and accurate CPI stacks can be computed for superscalar Out-of-Order Processors. Using interval analysis, a novel method for analyzing Out-of-Order Processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches

  • Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics
    IEEE Transactions on Computers, 2016
    Co-Authors: Sam Van Den Steen, Stijn Eyerman, Trevor E. Carlson, Sander De Pestel, Moncef Mechri, David Black-schaffer, Erik Hagersten, Lieven Eeckhout
    Abstract:

    Optimizing Processors for (a) specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energy-efficiency gains from technology scaling, such approaches may become increasingly important. However, designing application-specific Processors requires fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance and power estimates and insight into the interaction between an application's characteristics and the micro-architecture of a Processor. Unfortunately, prior analytical models for superscalar Out-of-Order Processors require micro-architecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration of interest, which is far more time-consuming than evaluating the analytical performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large superscalar Out-of-Order Processor design space almost instantaneously. We show that using a microarchitecture independent profile leads to a speedup of 300x compared to detailed simulation for our evaluated design space. Over a large design space, the model has a 9.3 percent average error for performance and a 4.3 percent average error for power, compared to detailed cycle-level simulation. The model is able to accurately determine the optimal Processor configuration for different applications under power or performance constraints, and provides insight into performance through cycle stacks.

  • A first-order mechanistic model for architectural vulnerability factor
    2012 39th Annual International Symposium on Computer Architecture (ISCA), 2012
    Co-Authors: Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, Lizy Kurian John
    Abstract:

    Soft error reliability has become a first-order design criterion for modern microProcessors. Architectural Vulnerability Factor (AVF) modeling is often used to capture the probability that a radiation-induced fault in a hardware structure will manifest as an error at the program output. AVF estimation requires detailed microarchitectural simulations which are time-consuming and typically present aggregate metrics. Moreover, it requires a large number of simulations to derive insight into the impact of microarchitectural events on AVF. In this work we present a first-order mechanistic analytical model for computing AVF by estimating the occupancy of correct-path state in important microarchitecture structures through inexpensive profiling. We show that the model estimates the AVF for the reorder buffer, issue queue, load and store queue, and functional units in a 4-wide issue machine with a mean absolute error of less than 0.07. The model is constructed from the first principles of Out-of-Order Processor execution in order to provide novel insight into the interaction of the workload with the microarchitecture to determine AVF. We demonstrate that the model can be used to perform design space explorations to understand trade-offs between soft error rate and performance, to study the impact of scaling of microarchitectural structures on AVF and performance, and to characterize workloads for AVF.

  • AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors
    2010 43rd Annual IEEE ACM International Symposium on Microarchitecture, 2010
    Co-Authors: Arun Arvind Nair, Lizy Kurian John, Lieven Eeckhout
    Abstract:

    Soft error reliability is increasingly becoming a first-order design concern for microProcessors, as a result of higher transistor counts, shrinking device geometries and lowering of operating voltages. It is important for designers to be able to validate whether the Soft Error Rate (SER) targets of their design have been met, and help end users select the Processor best suited to their reliability goals. The knowledge of the observable worst-case SER allows designers to select their design point, and bound the worst-case vulnerability at that design point. We highlight the lack of a methodology for evaluation of the overall observable worst-case SER. Hence, there is a clear need for a so called stress mark that can demonstrably approach the observable worst-case SER. The worst-case thus obtained can be used to identify reliability bottlenecks, validate safety margins used for reliability design and identify inadequacies in benchmark suites used to evaluate SER. Starting from a comprehensive study about how micro architecture-dependent program characteristics affect soft errors, we derive the insights needed to develop an automated and flexible methodology for generating a stress mark that approaches the maximum SER of an Out-of-Order Processor. We demonstrate how our methodology enables architects to quantify the impact of SER-mitigation mechanisms on the worst-case SER of the Processor. The stress mark achieves 1.4X higher SER in the core, 2.5X higher SER in DL1 and DTLB, and 1.5X higher SER in L2 as compared to the highest SER induced by SPEC CPU2006 and MiBench programs.

  • HiPEAC - Studying compiler optimizations on superscalar Processors through interval analysis
    High Performance Embedded Architectures and Compilers, 2008
    Co-Authors: Stijn Eyerman, Lieven Eeckhout, James E. Smith
    Abstract:

    Understanding the performance impact of compiler optimizations on superscalar Processors is complicated because compiler optimizations interact with the microarchitecture in complex ways. This paper analyzes this interaction using interval analysis, an analytical Processor model that allows for breaking total execution time into cycle components. By studying the impact of compiler optimizations on the various cycle components, one can gain insight into how compiler optimizations affect Out-of-Order Processor performance. The analysis provided in this paper reveals various interesting insights and suggestions for future work on compiler optimizations for Out-of-Order Processors. In addition, we contrast the effect compiler optimizations have on Out-of-Order versus in-order Processors.

Stijn Eyerman - One of the best experts on this subject based on the ideXlab platform.

  • A performance counter architecture for computing accurate CPI components
    ACM SIGPLAN Notices, 2020
    Co-Authors: Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, James E. Smith
    Abstract:

    A common way of representing Processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microProcessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar Out-of-Order Processors is challenging because of various overlaps among execution and miss events ( cache misses, TLB misses, and branch mispredictions). This paper shows that meaningful and accurate CPI stacks can be computed for superscalar Out-of-Order Processors. Using interval analysis, a novel method for analyzing Out-of-Order Processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches

  • Breaking In-Order Branch Miss Recovery
    IEEE Computer Architecture Letters, 2020
    Co-Authors: Stijn Eyerman, Wim Heirman, Sam Van Den Steen
    Abstract:

    Despite very accurate branch predictors, branch misses remain an important source of performance limiters, especially for irregular applications. To ensure in-order commit, branch miss recovery is done in-order: all instructions after the oldest branch miss are flushed, even if they eventually reconverge with the correct path. We propose a technique to limit flushing to real wrong-path instructions only, allowing the resolution of newer branch misses while an older one is not yet resolved. Our technique involves minimal additions to a conventional Out-of-Order Processor, by reusing existing checkpoint mechanisms and relying on programmer/compiler inserted hints to detect data and control independence. We evaluate the technique on graph benchmarks, resulting in up to 2× increase in performance.

  • Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics
    IEEE Transactions on Computers, 2016
    Co-Authors: Sam Van Den Steen, Stijn Eyerman, Trevor E. Carlson, Sander De Pestel, Moncef Mechri, David Black-schaffer, Erik Hagersten, Lieven Eeckhout
    Abstract:

    Optimizing Processors for (a) specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energy-efficiency gains from technology scaling, such approaches may become increasingly important. However, designing application-specific Processors requires fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance and power estimates and insight into the interaction between an application's characteristics and the micro-architecture of a Processor. Unfortunately, prior analytical models for superscalar Out-of-Order Processors require micro-architecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration of interest, which is far more time-consuming than evaluating the analytical performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large superscalar Out-of-Order Processor design space almost instantaneously. We show that using a microarchitecture independent profile leads to a speedup of 300x compared to detailed simulation for our evaluated design space. Over a large design space, the model has a 9.3 percent average error for performance and a 4.3 percent average error for power, compared to detailed cycle-level simulation. The model is able to accurately determine the optimal Processor configuration for different applications under power or performance constraints, and provides insight into performance through cycle stacks.

  • A first-order mechanistic model for architectural vulnerability factor
    2012 39th Annual International Symposium on Computer Architecture (ISCA), 2012
    Co-Authors: Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, Lizy Kurian John
    Abstract:

    Soft error reliability has become a first-order design criterion for modern microProcessors. Architectural Vulnerability Factor (AVF) modeling is often used to capture the probability that a radiation-induced fault in a hardware structure will manifest as an error at the program output. AVF estimation requires detailed microarchitectural simulations which are time-consuming and typically present aggregate metrics. Moreover, it requires a large number of simulations to derive insight into the impact of microarchitectural events on AVF. In this work we present a first-order mechanistic analytical model for computing AVF by estimating the occupancy of correct-path state in important microarchitecture structures through inexpensive profiling. We show that the model estimates the AVF for the reorder buffer, issue queue, load and store queue, and functional units in a 4-wide issue machine with a mean absolute error of less than 0.07. The model is constructed from the first principles of Out-of-Order Processor execution in order to provide novel insight into the interaction of the workload with the microarchitecture to determine AVF. We demonstrate that the model can be used to perform design space explorations to understand trade-offs between soft error rate and performance, to study the impact of scaling of microarchitectural structures on AVF and performance, and to characterize workloads for AVF.

  • HiPEAC - Studying compiler optimizations on superscalar Processors through interval analysis
    High Performance Embedded Architectures and Compilers, 2008
    Co-Authors: Stijn Eyerman, Lieven Eeckhout, James E. Smith
    Abstract:

    Understanding the performance impact of compiler optimizations on superscalar Processors is complicated because compiler optimizations interact with the microarchitecture in complex ways. This paper analyzes this interaction using interval analysis, an analytical Processor model that allows for breaking total execution time into cycle components. By studying the impact of compiler optimizations on the various cycle components, one can gain insight into how compiler optimizations affect Out-of-Order Processor performance. The analysis provided in this paper reveals various interesting insights and suggestions for future work on compiler optimizations for Out-of-Order Processors. In addition, we contrast the effect compiler optimizations have on Out-of-Order versus in-order Processors.

Diana Marculescu - One of the best experts on this subject based on the ideXlab platform.

  • ICCAD - Power efficiency of voltage scaling in multiple clock multiple voltage cores
    Proceedings of the 2002 IEEE ACM international conference on Computer-aided design - ICCAD '02, 2002
    Co-Authors: A Iyer, Diana Marculescu
    Abstract:

    Due to increasing clock speeds, increasing design sizes and shrinking technologies, it is becoming more and more challenging to distribute a single global clock throughout a chip. In this paper we study the effect of using a Globally Asynchronous Locally Synchronous (GALS) organization for a superscalar, Out-of-Order Processor, both in terms of power and performance. To this end, we propose a novel modeling and simulation environment for multiple clock cores with static or dynamically variable voltages for each synchronous block. Using this design exploration environment we were able to assess the power/performance tradeoffs available for Multiple Clock, Single Voltage (MCSV), as well as Multiple Clock, Dynamic Voltage (MCDV) cores. Our results show that MCSV Processors are 10% more power efficient when compared to single-clock single voltage designs with a performance penalty of about 10%. By exploiting the flexibility of independent dynamic voltage scaling the various clock domains, the power efficiency of GALS designs can be improved by 12% on average, and up to 20% more in select cases. The power efficiency of MCDV cores becomes comparable with the one of Single Clock, Dynamic Voltage (SCDV) cores, while being up to 8% better in some cases. Our results show that MCDV cores consume 22% less power at an average 12% performance loss.

  • Power efficiency of voltage scaling in multiple clock multiple voltage cores
    IEEE ACM International Conference on Computer Aided Design 2002. ICCAD 2002., 2002
    Co-Authors: A Iyer, Diana Marculescu
    Abstract:

    Due to increasing clock speeds, increasing design sues and shrinking technologies, it is becoming more and more challenging to distribute a single global clock throughout a chip. In this paper we study the effect of using a Globally Asynchronous Locally Synchronous (GALS) organization for a superscalar Out-of-Order Processor, both in terms of power and performance. To this end, we propose a novel modeling and simulation environment for multiple clock cores with static or dynamically variable voltages for each synchronous block Using this design exploration environment we were able to assess the power/performance tradeoffs available for Multiple Clock Single Voltage (MCSV), as well as Multiple Clock Dynamic Voltage (MCDV) cores. Our results show that MCSV Processors are 10% more power efficient when compared to single-clock single voltage designs with a performance penalty of about 10% By exploiting the flexibility of independent dynamic voltage scaling the various clock domains, the power efficiency of GALS designs can be improved by 12% on average, and up to 20% more in select cases. The power efficiency of MCDV cores becomes comparable with the one of Single Cloak Dynamic Voltage (SCDV) cores while being up to 8% better in some cases. Our results show that MCDV cones consume 22% less power at an average 12% performance loss.

A Iyer - One of the best experts on this subject based on the ideXlab platform.

  • ICCAD - Power efficiency of voltage scaling in multiple clock multiple voltage cores
    Proceedings of the 2002 IEEE ACM international conference on Computer-aided design - ICCAD '02, 2002
    Co-Authors: A Iyer, Diana Marculescu
    Abstract:

    Due to increasing clock speeds, increasing design sizes and shrinking technologies, it is becoming more and more challenging to distribute a single global clock throughout a chip. In this paper we study the effect of using a Globally Asynchronous Locally Synchronous (GALS) organization for a superscalar, Out-of-Order Processor, both in terms of power and performance. To this end, we propose a novel modeling and simulation environment for multiple clock cores with static or dynamically variable voltages for each synchronous block. Using this design exploration environment we were able to assess the power/performance tradeoffs available for Multiple Clock, Single Voltage (MCSV), as well as Multiple Clock, Dynamic Voltage (MCDV) cores. Our results show that MCSV Processors are 10% more power efficient when compared to single-clock single voltage designs with a performance penalty of about 10%. By exploiting the flexibility of independent dynamic voltage scaling the various clock domains, the power efficiency of GALS designs can be improved by 12% on average, and up to 20% more in select cases. The power efficiency of MCDV cores becomes comparable with the one of Single Clock, Dynamic Voltage (SCDV) cores, while being up to 8% better in some cases. Our results show that MCDV cores consume 22% less power at an average 12% performance loss.

  • Power efficiency of voltage scaling in multiple clock multiple voltage cores
    IEEE ACM International Conference on Computer Aided Design 2002. ICCAD 2002., 2002
    Co-Authors: A Iyer, Diana Marculescu
    Abstract:

    Due to increasing clock speeds, increasing design sues and shrinking technologies, it is becoming more and more challenging to distribute a single global clock throughout a chip. In this paper we study the effect of using a Globally Asynchronous Locally Synchronous (GALS) organization for a superscalar Out-of-Order Processor, both in terms of power and performance. To this end, we propose a novel modeling and simulation environment for multiple clock cores with static or dynamically variable voltages for each synchronous block Using this design exploration environment we were able to assess the power/performance tradeoffs available for Multiple Clock Single Voltage (MCSV), as well as Multiple Clock Dynamic Voltage (MCDV) cores. Our results show that MCSV Processors are 10% more power efficient when compared to single-clock single voltage designs with a performance penalty of about 10% By exploiting the flexibility of independent dynamic voltage scaling the various clock domains, the power efficiency of GALS designs can be improved by 12% on average, and up to 20% more in select cases. The power efficiency of MCDV cores becomes comparable with the one of Single Cloak Dynamic Voltage (SCDV) cores while being up to 8% better in some cases. Our results show that MCDV cones consume 22% less power at an average 12% performance loss.

Antonio Gonz´lez - One of the best experts on this subject based on the ideXlab platform.

  • SBAC-PAD - A Power-Efficient Co-designed Out-of-Order Processor
    2011 23rd International Symposium on Computer Architecture and High Performance Computing, 2011
    Co-Authors: Josep Maria Codina, Antonio Gonz´lez
    Abstract:

    A co-designed Processor helps in cutting down both the complexity and power consumption by co-designing certain key performance enablers. In this paper, we propose a FIFO based co-designed Out-of-Order Processor. Multiple FIFOs are added in order to dynamically schedule, in a complexity-effective manner, the micro-ops. We propose a commit logic that is able to commit the program state as a superblock commits atomically. This enables us to get rid of the Reorder Buffer (ROB) entirely. Instead to maintain the correct program state, we propose a four/eight entry Superblock Ordering Buffer (SOB). We also propose the per superblock Register Rename Table (SRRT) that holds the register state pertaining to the superblock. Our proposed Processor dissipates 6% less power and obtains 12% speedup for SPECFP, as a result, it consumes less energy. Furthermore, we propose an enhanced steering heuristic and an early release mechanism to increase the performance of a FIFO based Out-of-Order Processor. We obtain performance improvement of nearly 25% and 70% for a four FIFO and for a two FIFO configurations, respectively. We also show that our proposed steering heuristic based Processor consumes 10% less energy than the previously proposed steering heuristic.

  • Interaction between Compilers and Computer Architectures - A Co-designed HW/SW Approach to General Purpose Program Acceleration Using a Programmable Functional Unit
    2011 15th Workshop on Interaction between Compilers and Computer Architectures, 2011
    Co-Authors: Josep M. Codina, Antonio Gonz´lez
    Abstract:

    In this paper, we propose a novel programmable functional unit (PFU) to accelerate general purpose application execution on a modern Out-of-Order x86 Processor in a complexity-effective way. Code is transformed and instructions are generated that run on the PFU using a co-designed virtual machine (Cd-VM). Groups offrequently executed micro-operations (micro-ops) are identified and fused into a macro-op (MOP) by the Cd-VM. The MOPs are executed on PFU. Results presented in this paper show that this HW/SW co-designed approach produces average speedups in performance of 17% in SPECFP and 10% in SPECINT,  and up-to 33\%, over modern Out-of-Order Processor. Moreover, we also  show that the proposed scheme not only out-performs dynamic  vectorization using SIMD accelerators but also outperforms an 8-wide issue Out-of-Order Processor.

  • A Power-Efficient Co-designed Out-of-Order Processor
    2011 23rd International Symposium on Computer Architecture and High Performance Computing, 2011
    Co-Authors: Josep Maria Codina, Antonio Gonz´lez
    Abstract:

    A co-designed Processor helps in cutting down both the complexity and power consumption by co-designing certain key performance enablers. In this paper, we propose a FIFO based co-designed Out-of-Order Processor. Multiple FIFOs are added in order to dynamically schedule, in a complexity-effective manner, the micro-ops. We propose a commit logic that is able to commit the program state as a superblock commits atomically. This enables us to get rid of the Reorder Buffer (ROB) entirely. Instead to maintain the correct program state, we propose a four/eight entry Superblock Ordering Buffer (SOB). We also propose the per superblock Register Rename Table (SRRT) that holds the register state pertaining to the superblock. Our proposed Processor dissipates 6% less power and obtains 12% speedup for SPECFP, as a result, it consumes less energy. Furthermore, we propose an enhanced steering heuristic and an early release mechanism to increase the performance of a FIFO based Out-of-Order Processor. We obtain performance improvement of nearly 25% and 70% for a four FIFO and for a two FIFO configurations, respectively. We also show that our proposed steering heuristic based Processor consumes 10% less energy than the previously proposed steering heuristic.

  • A Co-designed HW/SW Approach to General Purpose Program Acceleration Using a Programmable Functional Unit
    2011 15th Workshop on Interaction between Compilers and Computer Architectures, 2011
    Co-Authors: Josep Maria Codina, Antonio Gonz´lez
    Abstract:

    In this paper, we propose a novel programmable functional unit (PFU) to accelerate general purpose application execution on a modern Out-of-Order x86 Processor in a complexity-effective way. Code is transformed and instructions are generated that run on the PFU using a co-designed virtual machine (Cd-VM). Groups of frequently executed micro-operations (micro-ops) are identified and fused into a macro-op (MOP) by the Cd-VM. The MOPs are executed on PFU. Results presented in this paper show that this HW/SW co-designed approach produces average speedups in performance of 17% in SPECFP and 10% in SPECINT, and up-to 33%, over modern Out-of-Order Processor. Moreover, we also show that the proposed scheme not only out-performs dynamic vectorization using SIMD accelerators but also outperforms an 8-wide issue Out-of-Order Processor.