Loop Instruction

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 90 Experts worldwide ranked by ideXlab platform

Ji Gu - One of the best experts on this subject based on the ideXlab platform.

  • Reducing Power and Energy Overhead in Instruction Prefetching for Embedded Processor Systems
    Mobile and Handheld Computing Solutions for Organizations and End-Users, 2020
    Co-Authors: Ji Gu
    Abstract:

    Instruction prefetching is an effective way to improve performance of the pipelined processors. However, existing Instruction prefetching schemes increase performance with a significant energy sacrifice, making them unsuitable for embedded and ubiquitous systems where high performance and low energy consumption are all demanded. This paper proposes reducing energy overhead in Instruction prefetching by using a simple hardware/software design and an efficient prefetching operation scheme. Two approaches are investigated: Decoded Loop Instruction Cache-based Prefetching (DLICP) that is most effective for Loop intensive applications, and the enhanced DLICP with the popular existing Next Line Prefetching (NLP) for applications of a moderate number of Loops. The experimental results show that both DLICP and the enhanced DLICP deliver improved performance at a much reduced energy overhead.

  • DLIC: Decoded Loop Instructions caching for energy-aware embedded processors
    ACM Transactions in Embedded Computing Systems, 2013
    Co-Authors: Ji Gu, Tohru Ishihara
    Abstract:

    With the explosive proliferation of embedded systems, especially through countless portable devices and wireless equipment used, embedded systems have become indispensable to the modern society and people's life. Those devices are often battery driven. Therefore, low energy consumption in embedded processors is important and becomes critical in step with the system complexity. The on-chip Instruction cache (I-cache) is usually the most energy-consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing Loop cache approaches use a tiny decoded cache to filter the I-cache access and Instruction decode activity for repeated Loop iterations. However, such designs are effective for small and simple Loops, and only suitable for DSP kernel-like applications. They are not effectual for many embedded applications where complex Loops are common. In this article, we propose a decoded Loop Instruction cache (DLIC) that is small, hence energy efficient, yet can capture most Loops, including large nested ones with branch executions, so that a significant amount of I-cache accesses and Instruction decoding can be eradicated. The experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87p as compared to normal cache-only design. On average, 66p energy can be saved on Instruction fetching and decoding, while at a performance overhead of only 1.4p.

  • ESTImedia - Loop Instruction caching for energy-efficient embedded multitasking processors
    2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, 2012
    Co-Authors: Ji Gu, Tohru Ishihara
    Abstract:

    With the exponential increase of power consumption in processor generations, energy dissipation has become one of the most critical constraints in system design. Cache memories are usually the most energy consuming components on the processor chip due to their large die size occupation and frequent access operations. Furthermore, in step with the increased complexity of modern embedded applications, microprocessors are increasingly executing multitasking applications. In multitasking processors, the conventional L1 Instruction cache (I-cache) is usually shared by multiple tasks and thereby suffering a highly intensive read/write operations, which can be even more energy-consuming than used in a single-task based system. This paper presents an energy-efficient shared multitasking Loop Instruction cache (SMLIC), which is designed to address the tasks sharing and context switch issues so that it can be efficiently utilized to reduce the I-cache accesses for energy savings in multitasking processors. Experiments on a set of multitasking applications demonstrate that the proposed SMLIC design scheme can reduce I-cache accesses by 12∼86% and energy consumption in Instruction supply by 11∼79% for multitasking system, depending on various frequencies of context switch.

  • Loop Instruction caching for energy-efficient embedded multitasking processors
    2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, 2012
    Co-Authors: Ji Gu, Tohru Ishihara
    Abstract:

    With the exponential increase of power consumption in processor generations, energy dissipation has become one of the most critical constraints in system design. Cache memories are usually the most energy consuming components on the processor chip due to their large die size occupation and frequent access operations. Furthermore, in step with the increased complexity of modern embedded applications, microprocessors are increasingly executing multitasking applications. In multitasking processors, the conventional L1 Instruction cache (I-cache) is usually shared by multiple tasks and thereby suffering a highly intensive read/write operations, which can be even more energy-consuming than used in a single-task based system. This paper presents an energy-efficient shared multitasking Loop Instruction cache (SMLIC), which is designed to address the tasks sharing and context switch issues so that it can be efficiently utilized to reduce the I-cache accesses for energy savings in multitasking processors. Experiments on a set of multitasking applications demonstrate that the proposed SMLIC design scheme can reduce I-cache accesses by 12~86% and energy consumption in Instruction supply by 11~79% for multitasking system, depending on various frequencies of context switch.

  • enabling large decoded Instruction Loop caching for energy aware embedded processors
    Compilers Architecture and Synthesis for Embedded Systems, 2010
    Co-Authors: Ji Gu
    Abstract:

    Low energy consumption in embedded processors is increasingly important in step with the system complexity. The on-chip Instruction cache (I-cache) is usually a most energy consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing Loop cache approaches use a tiny decoded cache to filter the I-cache access and Instruction decode activity for repeated Loop iterations. However, such designs are effective to small and simple Loops, and only suitable for DSP kernel-like applications. They are not effectual to many embedded applications where complex Loops are common. In this paper, we propose a decoded Loop Instruction cache (DLIC) that is small, hence energy efficient, yet can capture most Loops, including large, nested ones with branch executions, so that a significant amount of I-cache accesses and Instruction decoding can be eradicated. Experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87%. On average, 66% energy can be saved on Instruction fetching and decoding, at a performance overhead of only 1.4%.

Hiroshi Kadota - One of the best experts on this subject based on the ideXlab platform.

  • ohmega a vlsi superscalar processor architecture for numerical applications
    International Symposium on Computer Architecture, 1991
    Co-Authors: Masaitsu Nakajima, Hiraku Nakano, Yasuhiro Nakakura, Tadahiro Yoshida, Reiji Segawa, Yuji Nakai, Takeshi Kishida, Hiroshi Kadota
    Abstract:

    multiple Instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs Instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of Instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of Instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of Instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of Instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order Instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, Loop Instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.

  • ISCA - OHMEGA : a VLSI superscalar processor architecture for numerical applications
    Proceedings of the 18th annual international symposium on Computer architecture - ISCA '91, 1991
    Co-Authors: Masaitsu Nakajima, Hiraku Nakano, Yasuhiro Nakakura, Tadahiro Yoshida, Reiji Segawa, Yuji Nakai, Takeshi Kishida, Hiroshi Kadota
    Abstract:

    multiple Instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs Instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of Instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of Instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of Instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of Instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order Instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, Loop Instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.

Mikko H. Lipasti - One of the best experts on this subject based on the ideXlab platform.

  • HPCA - Revolver: Processor architecture for power efficient Loop execution
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014
    Co-Authors: Mitchell Hayenga, Vignyan Reddy Kothinti Naresh, Mikko H. Lipasti
    Abstract:

    With the rise of mobile and cloud-based computing, modern processor design has become the task of achieving maximum power efficiency at specific performance targets. This trend, coupled with dwindling improvements in single-threaded performance, has led architects to predominately focus on energy efficiency. In this paper we note that for the majority of benchmarks, a substantial portion of execution time is spent executing simple Loops. Capitalizing on the frequency of Loops, we design an out-of-order processor architecture that achieves an aggressive level of performance while minimizing the energy consumed during the execution of Loops. The Revolver architecture achieves energy efficiency during Loop execution by enabling “in-place execution” of Loops within the processor's out-of-order backend. Essentially, a few static instances of each Loop Instruction are dispatched to the out-of-order execution core by the processor frontend. The static Instruction instances may each be executed multiple times in order to complete all necessary Loop iterations. During Loop execution the processor frontend, including Instruction fetch, branch prediction, decode, allocation, and dispatch logic, can be completely clock gated. Additionally we propose a mechanism to preexecute future Loop iteration load Instructions, thereby realizing parallelism beyond the Loop iterations currently executing within the processor core. Employing Revolver across three benchmark suites, we eliminate 20, 55, and 84% of all frontend Instruction dispatches. Overall, we find Revolver maintains performance, while resulting in 5.3%- 18.3% energy-delay benefit over Loop buffers or micro-op cache techniques alone.

  • Revolver: Processor architecture for power efficient Loop execution
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014
    Co-Authors: Mitchell Hayenga, Vignyan Reddy Kothinti Naresh, Mikko H. Lipasti
    Abstract:

    With the rise of mobile and cloud-based computing, modern processor design has become the task of achieving maximum power efficiency at specific performance targets. This trend, coupled with dwindling improvements in single-threaded performance, has led architects to predominately focus on energy efficiency. In this paper we note that for the majority of benchmarks, a substantial portion of execution time is spent executing simple Loops. Capitalizing on the frequency of Loops, we design an out-of-order processor architecture that achieves an aggressive level of performance while minimizing the energy consumed during the execution of Loops. The Revolver architecture achieves energy efficiency during Loop execution by enabling “in-place execution” of Loops within the processor's out-of-order backend. Essentially, a few static instances of each Loop Instruction are dispatched to the out-of-order execution core by the processor frontend. The static Instruction instances may each be executed multiple times in order to complete all necessary Loop iterations. During Loop execution the processor frontend, including Instruction fetch, branch prediction, decode, allocation, and dispatch logic, can be completely clock gated. Additionally we propose a mechanism to preexecute future Loop iteration load Instructions, thereby realizing parallelism beyond the Loop iterations currently executing within the processor core. Employing Revolver across three benchmark suites, we eliminate 20, 55, and 84% of all frontend Instruction dispatches. Overall, we find Revolver maintains performance, while resulting in 5.3%-18.3% energy-delay benefit over Loop buffers or micro-op cache techniques alone.

Tohru Ishihara - One of the best experts on this subject based on the ideXlab platform.

  • DLIC: Decoded Loop Instructions caching for energy-aware embedded processors
    ACM Transactions in Embedded Computing Systems, 2013
    Co-Authors: Ji Gu, Tohru Ishihara
    Abstract:

    With the explosive proliferation of embedded systems, especially through countless portable devices and wireless equipment used, embedded systems have become indispensable to the modern society and people's life. Those devices are often battery driven. Therefore, low energy consumption in embedded processors is important and becomes critical in step with the system complexity. The on-chip Instruction cache (I-cache) is usually the most energy-consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing Loop cache approaches use a tiny decoded cache to filter the I-cache access and Instruction decode activity for repeated Loop iterations. However, such designs are effective for small and simple Loops, and only suitable for DSP kernel-like applications. They are not effectual for many embedded applications where complex Loops are common. In this article, we propose a decoded Loop Instruction cache (DLIC) that is small, hence energy efficient, yet can capture most Loops, including large nested ones with branch executions, so that a significant amount of I-cache accesses and Instruction decoding can be eradicated. The experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87p as compared to normal cache-only design. On average, 66p energy can be saved on Instruction fetching and decoding, while at a performance overhead of only 1.4p.

  • ESTImedia - Loop Instruction caching for energy-efficient embedded multitasking processors
    2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, 2012
    Co-Authors: Ji Gu, Tohru Ishihara
    Abstract:

    With the exponential increase of power consumption in processor generations, energy dissipation has become one of the most critical constraints in system design. Cache memories are usually the most energy consuming components on the processor chip due to their large die size occupation and frequent access operations. Furthermore, in step with the increased complexity of modern embedded applications, microprocessors are increasingly executing multitasking applications. In multitasking processors, the conventional L1 Instruction cache (I-cache) is usually shared by multiple tasks and thereby suffering a highly intensive read/write operations, which can be even more energy-consuming than used in a single-task based system. This paper presents an energy-efficient shared multitasking Loop Instruction cache (SMLIC), which is designed to address the tasks sharing and context switch issues so that it can be efficiently utilized to reduce the I-cache accesses for energy savings in multitasking processors. Experiments on a set of multitasking applications demonstrate that the proposed SMLIC design scheme can reduce I-cache accesses by 12∼86% and energy consumption in Instruction supply by 11∼79% for multitasking system, depending on various frequencies of context switch.

  • Loop Instruction caching for energy-efficient embedded multitasking processors
    2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, 2012
    Co-Authors: Ji Gu, Tohru Ishihara
    Abstract:

    With the exponential increase of power consumption in processor generations, energy dissipation has become one of the most critical constraints in system design. Cache memories are usually the most energy consuming components on the processor chip due to their large die size occupation and frequent access operations. Furthermore, in step with the increased complexity of modern embedded applications, microprocessors are increasingly executing multitasking applications. In multitasking processors, the conventional L1 Instruction cache (I-cache) is usually shared by multiple tasks and thereby suffering a highly intensive read/write operations, which can be even more energy-consuming than used in a single-task based system. This paper presents an energy-efficient shared multitasking Loop Instruction cache (SMLIC), which is designed to address the tasks sharing and context switch issues so that it can be efficiently utilized to reduce the I-cache accesses for energy savings in multitasking processors. Experiments on a set of multitasking applications demonstrate that the proposed SMLIC design scheme can reduce I-cache accesses by 12~86% and energy consumption in Instruction supply by 11~79% for multitasking system, depending on various frequencies of context switch.

Masaitsu Nakajima - One of the best experts on this subject based on the ideXlab platform.

  • ohmega a vlsi superscalar processor architecture for numerical applications
    International Symposium on Computer Architecture, 1991
    Co-Authors: Masaitsu Nakajima, Hiraku Nakano, Yasuhiro Nakakura, Tadahiro Yoshida, Reiji Segawa, Yuji Nakai, Takeshi Kishida, Hiroshi Kadota
    Abstract:

    multiple Instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs Instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of Instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of Instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of Instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of Instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order Instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, Loop Instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.

  • ISCA - OHMEGA : a VLSI superscalar processor architecture for numerical applications
    Proceedings of the 18th annual international symposium on Computer architecture - ISCA '91, 1991
    Co-Authors: Masaitsu Nakajima, Hiraku Nakano, Yasuhiro Nakakura, Tadahiro Yoshida, Reiji Segawa, Yuji Nakai, Takeshi Kishida, Hiroshi Kadota
    Abstract:

    multiple Instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs Instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of Instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of Instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of Instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of Instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order Instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, Loop Instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.