Deep Pipeline

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 171 Experts worldwide ranked by ideXlab platform

Yale N Patt - One of the best experts on this subject based on the ideXlab platform.

  • two level adaptive training branch prediction
    International Symposium on Microarchitecture, 1991
    Co-Authors: Yale N Patt
    Abstract:

    High-performance microarchitectures use, among other structures, Deep Pipelines to help speed up exe- cution. The importance of a good branch predictor to the effectiveness of a Deep Pipeline in the presence of condi- tional branches is well-known. In fact, the literature contains proposals for a number of branch prediction schemes. Some are static in that they use opcode information and profiling statistics to make predictions. Others are dynamic in that they use run-time execution history to make predictions. This paper proposes a new dynamic branch predictor, the Two-Level Adaptive Paining scheme, which alters the branch prediction algorithm on the basis of information collected at run-time. Several configurations of the Two-Level Adaptive Training Branch Predictor are introduced, simulated, and compared to simulations of other known static and dynamic branch prediction schemes. Two-Level Adaptive Training Branch Prediction achieves 97 percent accuracy on nine of the ten SPEC benchmarks, compared to less than 93 percent for other schemes. Since a prediction miss requires flushing of the speculative execution already in progress, the relevant metric is the miss rate. The miss rate is 3 percent for the Two-Level Adaptive Training scheme vs. 7 percent (best case) for the other schemes. This represents more than a 100 percent improvement in reducing the number of Pipeline hushes required.

  • MICRO - Two-level adaptive training branch prediction
    Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24, 1991
    Co-Authors: Tse-yu Yeh, Yale N Patt
    Abstract:

    High-performance microarchitectures use, among other structures, Deep Pipelines to help speed up exe- cution. The importance of a good branch predictor to the effectiveness of a Deep Pipeline in the presence of condi- tional branches is well-known. In fact, the literature contains proposals for a number of branch prediction schemes. Some are static in that they use opcode information and profiling statistics to make predictions. Others are dynamic in that they use run-time execution history to make predictions. This paper proposes a new dynamic branch predictor, the Two-Level Adaptive Paining scheme, which alters the branch prediction algorithm on the basis of information collected at run-time. Several configurations of the Two-Level Adaptive Training Branch Predictor are introduced, simulated, and compared to simulations of other known static and dynamic branch prediction schemes. Two-Level Adaptive Training Branch Prediction achieves 97 percent accuracy on nine of the ten SPEC benchmarks, compared to less than 93 percent for other schemes. Since a prediction miss requires flushing of the speculative execution already in progress, the relevant metric is the miss rate. The miss rate is 3 percent for the Two-Level Adaptive Training scheme vs. 7 percent (best case) for the other schemes. This represents more than a 100 percent improvement in reducing the number of Pipeline hushes required.

André Seznec - One of the best experts on this subject based on the ideXlab platform.

  • design tradeoffs for the alpha ev8 conditional branch predictor
    International Symposium on Computer Architecture, 2002
    Co-Authors: André Seznec, Stephen Felix, Venkata Krishnan, Yiannakis Sazeides
    Abstract:

    This paper presents the Alpha EV8 conditional branch predictor The Alpha EV8 microprocessor project, canceled in June 2001 in a late phase of development, envisioned an aggressive 8-wide issue out-of-order superscalar microarchitecture featuring a very Deep Pipeline and simultaneous multithreading. Performance of such a processor is highly dependent on the accuracy of its branch predictor and consequently a very large silicon area was devoted to branch prediction on EV8. The Alpha EV8 branch predictor relies on global history and features a total of 352 Kbits.The focus of this paper is on the different trade-offs performed to overcome various implementation constraints for the EV8 branch predictor. One such instance is the pipelining of the predictor on two cycles to facilitate the prediction of up to 16 branches per cycle from any two dynamically successive, 8 instruction fetch blocks. This resulted in the use of three fetch-block old compressed branch history information for accesing the predictor. Implementation constraints also restricted the composition of the index functions for the predictor and forced the usage of only single-ported memory cells.Nevertheless, we show that the Alpha EV8 branch predictor achieves prediction accuracy in the same range as the state-of-the-art academic global history branch predictors that do not consider implementation constraints in great detail.

  • ISCA - Design tradeoffs for the alpha EV8 conditional branch predictor
    ACM SIGARCH Computer Architecture News, 2002
    Co-Authors: André Seznec, Stephen Felix, Venkata Krishnan, Yiannakis Sazeides
    Abstract:

    This paper presents the Alpha EV8 conditional branch predictor The Alpha EV8 microprocessor project, canceled in June 2001 in a late phase of development, envisioned an aggressive 8-wide issue out-of-order superscalar microarchitecture featuring a very Deep Pipeline and simultaneous multithreading. Performance of such a processor is highly dependent on the accuracy of its branch predictor and consequently a very large silicon area was devoted to branch prediction on EV8. The Alpha EV8 branch predictor relies on global history and features a total of 352 Kbits.The focus of this paper is on the different trade-offs performed to overcome various implementation constraints for the EV8 branch predictor. One such instance is the pipelining of the predictor on two cycles to facilitate the prediction of up to 16 branches per cycle from any two dynamically successive, 8 instruction fetch blocks. This resulted in the use of three fetch-block old compressed branch history information for accesing the predictor. Implementation constraints also restricted the composition of the index functions for the predictor and forced the usage of only single-ported memory cells.Nevertheless, we show that the Alpha EV8 branch predictor achieves prediction accuracy in the same range as the state-of-the-art academic global history branch predictors that do not consider implementation constraints in great detail.

  • MICRO - MIDEE: smoothing branch and instruction cache miss penalties on Deep Pipelines
    1993
    Co-Authors: Nathalie Drach, André Seznec
    Abstract:

    Pipelining is a major technique used in high performance processors. But its effectiveness is reduced by the branch instructions. A new organization for implementing branch instructions is presented: the Multiple Instruction Decode Effective Execution (MIDEE) organization. All the Pipeline depths may be addressed using this organization. MIDEE is based on the use of double fetch and decode, early computation of the target address for branch instructions and two instruction queues. The double fetch-decode concerns a pair of instructions stored at consecutive addresses. These instructions are then decoded simultaneously, but no execution hardware is duplicated; only useful instructions are effectively executed. A pair of instruction queues are used between the fetch-decode stages and execution stages; this allows to hide branch penalty and most of the instruction cache misses penalty. Trace driven simulations show that the performance of Deep Pipeline processor may dramatically be improved when the MIDEE organization is implemented: branch penalty is reduced and Pipeline stall delay due to instruction cache misses is also decreased. >

  • MODEE : smoothing branch and instruction cache miss penalties on Deep Pipelines
    1993
    Co-Authors: Nathalie Drach, André Seznec
    Abstract:

    Pipelining is a major technique used in high performance processors. But a fundamental drawback of pipeling is the lost time due to branch instructions. A new organization for implementing branch instructions is presented : the Multiple Instruction Decode Effective Execution (MIDEE) organization. All the Pipeline depths may be addressed using this organization. MIDEE is based on the use of double fetch and decode, early computation of the target address for branch instructions and two instruction queues. The double fetch-decode concerns a pair of instructions stored at consecutive addresses. These instructions are then decoded simultaneously, but no execution hardware is duplicated,only useful instructions are effectively executed. A pair of instruction queues are used between the fetch-decode stages and execution stages, this allows to hide branch penalty and most of the instruction cache misses penalty. Trace driven simulations show that the performance of Deep Pipeline processor may dramatically be improved when the MIDEE organization is implemented : branch penalty is reduced and Pipeline stall delay due to instruction cache misses is also decreased.

Shaochong Zhang - One of the best experts on this subject based on the ideXlab platform.

  • FPGA - Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)
    Proceedings of the 2018 ACM SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18, 2018
    Co-Authors: Jason Cong, Michael Lo, Zhenman Fang, Jingxian Xu, Hanrui Wang, Shaochong Zhang
    Abstract:

    The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each Pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized Deep Pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.

  • FCCM - Understanding Performance Differences of FPGAs and GPUs
    2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018
    Co-Authors: Jason Cong, Zhenman Fang, Hanrui Wang, Shaochong Zhang
    Abstract:

    This paper aims to better understand the performance differences between FPGAs and GPUs. We intentionally begin with a widely used GPU-friendly benchmark suite, Rodinia, and port 15 of the kernels onto FPGAs using HLS C. Then we propose an analytical model to compare their performance. We find that for 6 out of the 15 ported kernels, today's FPGAs can provide comparable performance or even achieve better performance than the GPU, while consuming an average of 28% of the GPU power. Besides lower clock frequency, FPGAs usually achieve a higher number of operations per cycle in each customized Deep Pipeline, but lower effective parallel factor due to the far lower off-chip memory bandwidth. With 4x more memory bandwidth, 8 out of the 15 FPGA kernels are projected to achieve at least half of the GPU kernel performance.

Iswarya Chintakunta Pabbathi - One of the best experts on this subject based on the ideXlab platform.

  • Adder structures architecture for Deep Pipeline & massive parallel Using SSTA to find ultra-low energy
    International Journal of Research, 2016
    Co-Authors: S. Haroon Rasheed, Iswarya Chintakunta Pabbathi
    Abstract:

    Adders are basic functional units in computer arithmetic. Binary adders are used in microprocessor for addition and subtraction operations as well as for floating point multiplication and division. Therefore adders are fundamental components and improving their performance is one of the major challenges in digital designs. we have analyzed the latency, energy consumption, and effects of process variation on different structures with respect to the design structure and logic depth to propose architectures with higher throughput, lower energy consumption, and smaller performance loss caused by process variation in application specific integrated circuit design. We have exploited adders as different implementations of a processing unit, and propose architectural guidelines for finer technologies in subthreshold which are applicable to any other architecture. The results show that smaller computing building blocks have better energy efficiency and less performance degradation because of variation effects. In contrast, their computation throughput will be mid or less unless proper solutions, such as Pipelined or parallel structures, are used. Therefore, our proposed solution to improve the throughput loss while reducing sensitivity to process variations is using simpler elements in Deep Pipelined designs or massively parallel structures.

Jason Cong - One of the best experts on this subject based on the ideXlab platform.

  • FPGA - Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)
    Proceedings of the 2018 ACM SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18, 2018
    Co-Authors: Jason Cong, Michael Lo, Zhenman Fang, Jingxian Xu, Hanrui Wang, Shaochong Zhang
    Abstract:

    The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each Pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized Deep Pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.

  • FCCM - Understanding Performance Differences of FPGAs and GPUs
    2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018
    Co-Authors: Jason Cong, Zhenman Fang, Hanrui Wang, Shaochong Zhang
    Abstract:

    This paper aims to better understand the performance differences between FPGAs and GPUs. We intentionally begin with a widely used GPU-friendly benchmark suite, Rodinia, and port 15 of the kernels onto FPGAs using HLS C. Then we propose an analytical model to compare their performance. We find that for 6 out of the 15 ported kernels, today's FPGAs can provide comparable performance or even achieve better performance than the GPU, while consuming an average of 28% of the GPU power. Besides lower clock frequency, FPGAs usually achieve a higher number of operations per cycle in each customized Deep Pipeline, but lower effective parallel factor due to the far lower off-chip memory bandwidth. With 4x more memory bandwidth, 8 out of the 15 FPGA kernels are projected to achieve at least half of the GPU kernel performance.