Processor Family

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 4542 Experts worldwide ranked by ideXlab platform

G. Lowney - One of the best experts on this subject based on the ideXlab platform.

  • ispike a post link optimizer for the intel spl reg itanium spl reg architecture
    Symposium on Code Generation and Optimization, 2004
    Co-Authors: Robert Muth, Harish Patil, R. Cohn, G. Lowney
    Abstract:

    Ispike is a post-link optimizer developed for the Intel/spl reg/ Itanium Processor Family (IPF) Processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40% on the ltanium/spl reg/2 Processor, with average improvement of 8.5% and 9.9% over executables generated by the Intel/spl reg/ Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters.

  • ispike a post link optimizer for the intel itanium architecture
    Symposium on Code Generation and Optimization, 2004
    Co-Authors: Robert Muth, Harish Patil, R. Cohn, G. Lowney
    Abstract:

    Ispike is post-link optimizer developed for theIntel®Itanium Processor Family (IPF) Processors.TheIPF architecture poses both opportunities and challenges topost-link optimizations.IPF offers a rich set of performancecounters to collect detailed profile information at a low cost,which is essential to post-link optimization being practical.At the same time, the prediction and bundling features onIPF make post-link code transformation more challengingthan on other architectures.In Ispike, we have implementedoptimizations like code layout, instruction prefetching, datalayout, and data prefetching that exploit the IPF advantages,and strategies that cope with the IPF-specific challenges.Using SPEC CINT2000 as benchmarks, we showthat Ispike improves performance by as much as 40% on theItanium®2 Processor, with average improvement of 8.5%and 9.9% over executables generated by the Intel®Electroncompiler and by the Gcc compiler, respectively.We alsodemonstrate that statistical profiles collected via IPF performancecounters and complete profiles collected via instrumentationproduce equal performance benefit, but theprofiling overhead is significantly lower for performancecounters.

J. Gregory Steffan - One of the best experts on this subject based on the ideXlab platform.

  • A Multithreaded VLIW Soft Processor Family
    2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, 2013
    Co-Authors: Kalin Ovtcharov, Ilian Tili, J. Gregory Steffan
    Abstract:

    Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both H- dgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in the dataflow graph of the application; and (iv) architectural parameters.

Y. Zemach - One of the best experts on this subject based on the ideXlab platform.

  • ia 32 execution layer a two phase dynamic translator designed to support ia 32 applications on itanium spl reg based systems
    International Symposium on Microarchitecture, 2003
    Co-Authors: L. Baraz, T. Devor, A. Skaletsky, Opher Etzion, Shalom Goldenberg, Yun Wang, Y. Zemach
    Abstract:

    IA-32 execution layer (IA-32 EL) is a new technology that executes IA-32 applications on Intel Itanium Processor Family systems. Currently, support for IA-32 applications on Itanium-based platforms is achieved using hardware circuitry on the Itanium Processors. This capability will be enhanced with IA-32 EL - software that will ship with Itanium-based operating systems and will convert IA-32 instructions into Itanium instructions via dynamic translation. In this paper, we describe aspects of the IA-32 execution layer technology, including the general two-phase translation architecture and the usage of a single translator for multiple operating systems. The paper provides details of some of the technical challenges such as precise exception, emulation of FP, MMX, and Intel streaming SIMD extension instructions, and misalignment handling. Finally, the paper presents some performance results.

  • IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems
    Proceedings of the Annual International Symposium on Microarchitecture MICRO, 2003
    Co-Authors: L. Baraz, T. Devor, A. Skaletsky, Suzanne Goldenberg, Opher Etzion, Yun Wang, Y. Zemach
    Abstract:

    IA-32 execution layer (IA-32 EL) is a new technology that executes IA-32 applications on Intel Itanium Processor Family systems. Currently, support for IA-32 applications on Itanium-based platforms is achieved using hardware circuitry on the Itanium Processors. This capability will be enhanced with IA-32 EL - software that will ship with Itanium-based operating systems and will convert IA-32 instructions into Itanium instructions via dynamic translation. In this paper, we describe aspects of the IA-32 execution layer technology, including the general two-phase translation architecture and the usage of a single translator for multiple operating systems. The paper provides details of some of the technical challenges such as precise exception, emulation of FP, MMX, and Intel streaming SIMD extension instructions, and misalignment handling. Finally, the paper presents some performance results.

Robert Muth - One of the best experts on this subject based on the ideXlab platform.

  • ispike a post link optimizer for the intel spl reg itanium spl reg architecture
    Symposium on Code Generation and Optimization, 2004
    Co-Authors: Robert Muth, Harish Patil, R. Cohn, G. Lowney
    Abstract:

    Ispike is a post-link optimizer developed for the Intel/spl reg/ Itanium Processor Family (IPF) Processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40% on the ltanium/spl reg/2 Processor, with average improvement of 8.5% and 9.9% over executables generated by the Intel/spl reg/ Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters.

  • ispike a post link optimizer for the intel itanium architecture
    Symposium on Code Generation and Optimization, 2004
    Co-Authors: Robert Muth, Harish Patil, R. Cohn, G. Lowney
    Abstract:

    Ispike is post-link optimizer developed for theIntel®Itanium Processor Family (IPF) Processors.TheIPF architecture poses both opportunities and challenges topost-link optimizations.IPF offers a rich set of performancecounters to collect detailed profile information at a low cost,which is essential to post-link optimization being practical.At the same time, the prediction and bundling features onIPF make post-link code transformation more challengingthan on other architectures.In Ispike, we have implementedoptimizations like code layout, instruction prefetching, datalayout, and data prefetching that exploit the IPF advantages,and strategies that cope with the IPF-specific challenges.Using SPEC CINT2000 as benchmarks, we showthat Ispike improves performance by as much as 40% on theItanium®2 Processor, with average improvement of 8.5%and 9.9% over executables generated by the Intel®Electroncompiler and by the Gcc compiler, respectively.We alsodemonstrate that statistical profiles collected via IPF performancecounters and complete profiles collected via instrumentationproduce equal performance benefit, but theprofiling overhead is significantly lower for performancecounters.

Kalin Ovtcharov - One of the best experts on this subject based on the ideXlab platform.

  • tilt a multithreaded vliw soft Processor Family
    Field-Programmable Logic and Applications, 2013
    Co-Authors: Kalin Ovtcharov, Ilian Tili, Gregory J Steffan
    Abstract:

    We propose TILT, an FPGA-based compute engine designed to highly-utilize multiple, varied, and deeply-pipelined functional units by leveraging thread-level parallelism and static compiler analysis and scheduling. For this work we focus on deeply-pipelined floating-point functional units of widely-varying latency, executing Hodgkin-Huxley neuron simulation as an example application, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural trade-offs by measuring area and throughput for designs with varying numbers of functional units, thread contexts, and memory banks.

  • A Multithreaded VLIW Soft Processor Family
    2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, 2013
    Co-Authors: Kalin Ovtcharov, Ilian Tili, J. Gregory Steffan
    Abstract:

    Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both H- dgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in the dataflow graph of the application; and (iv) architectural parameters.