Loop Unrolling

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 2295 Experts worldwide ranked by ideXlab platform

Peter Marwedel - One of the best experts on this subject based on the ideXlab platform.

  • combining worst case timing models Loop Unrolling and static Loop analysis for wcet minimization
    Euromicro Conference on Real-Time Systems, 2009
    Co-Authors: Paul Lokuciejewski, Peter Marwedel
    Abstract:

    Program Loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization Loop Unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance.In this paper, we present Loop Unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable Unrolling factor is based on precise Loop iteration counts provided by a static Loop analysis. In addition,our heuristics avoid adverse effects of Unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive Loop Unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

  • ECRTS - Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization
    2009 21st Euromicro Conference on Real-Time Systems, 2009
    Co-Authors: Paul Lokuciejewski, Peter Marwedel
    Abstract:

    Program Loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization Loop Unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance.In this paper, we present Loop Unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable Unrolling factor is based on precise Loop iteration counts provided by a static Loop analysis. In addition,our heuristics avoid adverse effects of Unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive Loop Unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

Paul Lokuciejewski - One of the best experts on this subject based on the ideXlab platform.

  • combining worst case timing models Loop Unrolling and static Loop analysis for wcet minimization
    Euromicro Conference on Real-Time Systems, 2009
    Co-Authors: Paul Lokuciejewski, Peter Marwedel
    Abstract:

    Program Loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization Loop Unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance.In this paper, we present Loop Unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable Unrolling factor is based on precise Loop iteration counts provided by a static Loop analysis. In addition,our heuristics avoid adverse effects of Unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive Loop Unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

  • ECRTS - Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization
    2009 21st Euromicro Conference on Real-Time Systems, 2009
    Co-Authors: Paul Lokuciejewski, Peter Marwedel
    Abstract:

    Program Loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization Loop Unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance.In this paper, we present Loop Unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable Unrolling factor is based on precise Loop iteration counts provided by a static Loop analysis. In addition,our heuristics avoid adverse effects of Unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive Loop Unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.

Albert Cohen - One of the best experts on this subject based on the ideXlab platform.

  • On the Effectiveness of Register Moves to Minimise Post-Pass Unrolling in Software Pipelined Loops
    2012
    Co-Authors: Mounira Bachir, Albert Cohen, Sid Touati
    Abstract:

    Software pipelining is a powerful technique to expose fine-grain parallelism, but it results in variables staying alive across more than one kernel iteration. It requires periodic register allocation and is challenging for code generation: the lack of a reliable solution currently restricts the applicability of software pipelining. The classical software solution that does not alter the computation throughput consists in Unrolling the Loop a posteriori [12], [11]. However, the resulting Unrolling degree is often unacceptable and may reach absurd levels. Alternatively, Loop Unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations, but this may increase the initiation interval (II) which nullifies the benefits of software pipelining. This article aims at tightly controling the post-pass Loop Unrolling necessary to generate code. We study the potential of live range splitting to reduce kernel Loop Unrolling, introducing additional move instructions without inscreasing the II. We provide a complete formalisation of the problem, an algorithm, and extensive experiments. Our algorithm yields low Unrolling degrees in most cases -- with no increase of the II.

  • Loop Unrolling Minimisation in the Presence of Multiple Register Types: a Viable Alternative to Modulo Variable Expansion
    2011
    Co-Authors: Mounira Bachir, Sid Touati, Frederic Brault, Albert Cohen
    Abstract:

    Modulo Variable Expansion (MVE) [1] used with soft- ware pipelining (SWP) may sacrifice the register optimality (MAXLIVE) and in general may lead to unnecessary spills or move operations negating the benefits of SWP. In con- trast, bigger Loop Unrolling can be performed to meet the MAXLIVE registers requirement [2, 3]. However, the de- gree of Unrolling should be minimised to control code size and hence I-cache performance. In our previous work, we designed a post-pass Unrolling algorithm which minimises the Unrolling degree while ad- justing the length of reuse circuits through the usage of ad- ditional (free) registers [4]. In this paper, we complete our study with an improved algorithm for minimising kernel Loop Unrolling resulting from cyclic register allocation in the presence of multiple register types showing that considering all register types in conjunction provides a lower Unrolling degree than considering each register type in isolation. In ad- dition, we integrate our solution within a real world embed- ded system compiler: st200cc for the ST2xx family of VLIW embedded processors and compare it to MVE. Our large set of experiments on both high performance and embed- ded benchmarks (SPEC2000, SPEC2006, MEDIABENCH and FFMPEG) demonstrates the practical applicability and the benefits of our approach.

  • Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
    2011
    Co-Authors: Mounira Bachir, Sid Touati, Albert Cohen
    Abstract:

    This article studies an important open problem in backend compilation regarding Loop Unrolling after periodic register allocation. Although software pipelining is a powerful technique to extract fine-grain parallelism, variables can stay alive across more than one kernel iteration, which is challenging for code generation. The classical software solution that does not alter the computation throughput consists in Unrolling the Loop a posteriori (13; 12). However, the resulting Unrolling degree is often unacceptable and may reach absurd levels. Alternatively, Loop Unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations. However, inserting those operations may increase the initiation interval (II) and nullifies the benefits of software pipelining itself. We propose in this article a new technique to minimise the Loop Unrolling degree generated after periodic register allocation. In fact, this technique consists on decomposing the generated meeting graph circuits by inserting move instructions without compromising the throughput benefits of software pipelining. The different experiments showed that the execution time is acceptable and good results can be produced when we have many functional units which can execute move operations.

  • HPCS - Loop Unrolling minimisation in the presence of multiple register types: A viable alternative to modulo variable expansion
    2011 International Conference on High Performance Computing & Simulation, 2011
    Co-Authors: Mounira Bachir, Sid-ahmed-ali Touati, Frederic Brault, Albert Cohen
    Abstract:

    Modulo Variable Expansion (MVE) [1] used with software pipelining (SWP) may sacrifice the register optimality (MAXLIVE) and in general may lead to unnecessary spills or move operations negating the benefits of SWP. In contrast, bigger Loop Unrolling can be performed to meet the MAXLIVE registers requirement [2, 3]. However, the degree of Unrolling should be minimised to control code size and hence I-cache performance. In our previous work, we designed a post-pass Unrolling algorithm which minimises the Unrolling degree while adjusting the length of reuse circuits through the usage of additional (free) registers [4]. In this paper, we complete our study with an improved algorithm for minimising kernel Loop Unrolling resulting from cyclic register allocation in the presence of multiple register types showing that considering all register types in conjunction provides a lower Unrolling degree than considering each register type in isolation. In addition, we integrate our solution within a real world embedded system compiler: st200cc for the ST2xx family of VLIW embedded processors and compare it to MVE. Our large set of experiments on both high performance and embedded benchmarks (SPEC2000, SPEC2006, MEDIABENCH and FFMPEG) demonstrates the practical applicability and the benefits of our approach.

  • Post-Pass Periodic Register Allocation to Minimise Loop Unrolling Degree
    2008
    Co-Authors: Mounira Bachir, Sid Touati, Albert Cohen
    Abstract:

    This paper solves an open problem regarding Loop Unrolling after periodic register allocation. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates reuse circuits spanning multiple Loop iterations. These circuits require periodic register allocation, which in turn yield a code generation challenge, generally addressed through: (1) hardware support -- rotating register files -- deemed too expensive for embedded processors, (2) insertion of register moves with a high risk of reducing the computation throughput -- initiation interval (II) -- of software pipelining, and (3) post-pass Loop Unrolling that does not compromise throughput but often leads to unpractical code growth. The latter approach relies on the proof that MAXLIVE registers are sufficient for periodic register allocation [2, 3, 5]; yet the only heuristic to control the amount of post-pass Loop Unrolling does not achieve this bound and leads to undesired register spills [4, 7]. We propose a periodic register allocation technique allowing a software-only code generation that does not trade the optimality of the II for compactness of the generated code. Our idea is based on using the remaining registers: calling Rarch the number of architectural registers of the target processor, then the number of remaining registers that can be used for minimising the Unrolling degree is equal to Rarch − MAXLIVE. We provide a complete formalisation of the problem and algorithm, followed by extensive experiments. We achieve practical Loop Unrolling degrees in most cases -- with no increase of the II -- while state-of-the-art techniques would either induce register spilling, degrade the II or lead to unacceptable code growth.

Mounira Bachir - One of the best experts on this subject based on the ideXlab platform.

  • On the Effectiveness of Register Moves to Minimise Post-Pass Unrolling in Software Pipelined Loops
    2012
    Co-Authors: Mounira Bachir, Albert Cohen, Sid Touati
    Abstract:

    Software pipelining is a powerful technique to expose fine-grain parallelism, but it results in variables staying alive across more than one kernel iteration. It requires periodic register allocation and is challenging for code generation: the lack of a reliable solution currently restricts the applicability of software pipelining. The classical software solution that does not alter the computation throughput consists in Unrolling the Loop a posteriori [12], [11]. However, the resulting Unrolling degree is often unacceptable and may reach absurd levels. Alternatively, Loop Unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations, but this may increase the initiation interval (II) which nullifies the benefits of software pipelining. This article aims at tightly controling the post-pass Loop Unrolling necessary to generate code. We study the potential of live range splitting to reduce kernel Loop Unrolling, introducing additional move instructions without inscreasing the II. We provide a complete formalisation of the problem, an algorithm, and extensive experiments. Our algorithm yields low Unrolling degrees in most cases -- with no increase of the II.

  • Loop Unrolling Minimisation in the Presence of Multiple Register Types: a Viable Alternative to Modulo Variable Expansion
    2011
    Co-Authors: Mounira Bachir, Sid Touati, Frederic Brault, Albert Cohen
    Abstract:

    Modulo Variable Expansion (MVE) [1] used with soft- ware pipelining (SWP) may sacrifice the register optimality (MAXLIVE) and in general may lead to unnecessary spills or move operations negating the benefits of SWP. In con- trast, bigger Loop Unrolling can be performed to meet the MAXLIVE registers requirement [2, 3]. However, the de- gree of Unrolling should be minimised to control code size and hence I-cache performance. In our previous work, we designed a post-pass Unrolling algorithm which minimises the Unrolling degree while ad- justing the length of reuse circuits through the usage of ad- ditional (free) registers [4]. In this paper, we complete our study with an improved algorithm for minimising kernel Loop Unrolling resulting from cyclic register allocation in the presence of multiple register types showing that considering all register types in conjunction provides a lower Unrolling degree than considering each register type in isolation. In ad- dition, we integrate our solution within a real world embed- ded system compiler: st200cc for the ST2xx family of VLIW embedded processors and compare it to MVE. Our large set of experiments on both high performance and embed- ded benchmarks (SPEC2000, SPEC2006, MEDIABENCH and FFMPEG) demonstrates the practical applicability and the benefits of our approach.

  • Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling
    2011
    Co-Authors: Mounira Bachir, Sid Touati, Albert Cohen
    Abstract:

    This article studies an important open problem in backend compilation regarding Loop Unrolling after periodic register allocation. Although software pipelining is a powerful technique to extract fine-grain parallelism, variables can stay alive across more than one kernel iteration, which is challenging for code generation. The classical software solution that does not alter the computation throughput consists in Unrolling the Loop a posteriori (13; 12). However, the resulting Unrolling degree is often unacceptable and may reach absurd levels. Alternatively, Loop Unrolling can be avoided thanks to software register renaming. This is achieved through the insertion of move operations. However, inserting those operations may increase the initiation interval (II) and nullifies the benefits of software pipelining itself. We propose in this article a new technique to minimise the Loop Unrolling degree generated after periodic register allocation. In fact, this technique consists on decomposing the generated meeting graph circuits by inserting move instructions without compromising the throughput benefits of software pipelining. The different experiments showed that the execution time is acceptable and good results can be produced when we have many functional units which can execute move operations.

  • HPCS - Loop Unrolling minimisation in the presence of multiple register types: A viable alternative to modulo variable expansion
    2011 International Conference on High Performance Computing & Simulation, 2011
    Co-Authors: Mounira Bachir, Sid-ahmed-ali Touati, Frederic Brault, Albert Cohen
    Abstract:

    Modulo Variable Expansion (MVE) [1] used with software pipelining (SWP) may sacrifice the register optimality (MAXLIVE) and in general may lead to unnecessary spills or move operations negating the benefits of SWP. In contrast, bigger Loop Unrolling can be performed to meet the MAXLIVE registers requirement [2, 3]. However, the degree of Unrolling should be minimised to control code size and hence I-cache performance. In our previous work, we designed a post-pass Unrolling algorithm which minimises the Unrolling degree while adjusting the length of reuse circuits through the usage of additional (free) registers [4]. In this paper, we complete our study with an improved algorithm for minimising kernel Loop Unrolling resulting from cyclic register allocation in the presence of multiple register types showing that considering all register types in conjunction provides a lower Unrolling degree than considering each register type in isolation. In addition, we integrate our solution within a real world embedded system compiler: st200cc for the ST2xx family of VLIW embedded processors and compare it to MVE. Our large set of experiments on both high performance and embedded benchmarks (SPEC2000, SPEC2006, MEDIABENCH and FFMPEG) demonstrates the practical applicability and the benefits of our approach.

  • LCPC - Using the meeting graph framework to minimise kernel Loop Unrolling for scheduled Loops
    Languages and Compilers for Parallel Computing, 2010
    Co-Authors: Mounira Bachir, David Gregg, Sid-ahmed-ali Touati
    Abstract:

    This paper improves our previous research effort [1] by providing an efficient method for kernel Loop Unrolling minimisation in the case of already scheduled Loops, where circular lifetime intervals are known. When Loops are software pipelined, the number of values simultaneously alive becomes exactly known giving better opportunities for kernel Loop Unrolling. Furthermore, fixing circular lifetime intervals allows us to reduce the algorithmic complexity of our method compared to [1] by computing a new research space for minimal kernel Loop Unrolling. The meeting graph (MG) is one of the [3] frameworks proposed in the literature which models Loop Unrolling and register allocation together in a common formal framework for software pipelined Loops. Although MG significantly improves Loop register allocation, the computed Loop Unrolling may lead to unpractical code growth. This work proposes to minimise the Loop Unrolling degree in the meeting graph by making an adaptation of [1] the approach described in . We explain how to reduce the research space for minimal kernel Loop Unrolling in the context of MG, yielding to a reduced algorithmic complexity. Furthermore, our experiments on SPEC2000, SPEC2006, MEDIABENCH and FFMPEG show that in concrete cases the Loop Unrolling minimisation is very fast and the minimal Loop Unrolling degree for 75% of the optimised Loops is equal to 1 (i.e. no unroll), while it is equal to 7 when the software pipelining (SWP) schedule is not fixed.

Yoshiaki Fukazawa - One of the best experts on this subject based on the ideXlab platform.

  • a method for applying Loop Unrolling and software pipelining to instruction level parallel architectures
    Systems and Computers in Japan, 1998
    Co-Authors: Nobuhiro Kondo, Akira Koseki, Hideaki Komatsu, Yoshiaki Fukazawa
    Abstract:

    A considerable part of program execution time is consumed by Loops, so that Loop optimization is highly effective especially for the innermost Loops of a program. Software pipelining and Loop Unrolling are known methods for Loop optimization. Software pipelining is advantageous in that the code becomes only slightly longer. This method, however, is difficult to apply if the Loop includes branching when the parallelism is limited. On the other hand, Loop Unrolling, while being free of such limitations, suffers from a number of drawbacks. In particular the code size grows substantially and it is difficult to determine the optimal number of body replications. In order to solve these problems, it seems important to combine software pipelining with Loop Unrolling so as to utilize the advantages of both techniques while paying due regard to properties of programs under consideration and to the machine resources available. This paper describes a method for applying optimal Loop Unrolling and effective software pipelining to achieve this goal. Program characteristics obtained by means of an extended PDG (program dependence graph) are taken into consideration as well as machine resources. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 62–73, 1998

  • ISPAN - A method for estimating optimal Unrolling times for nested Loops
    Proceedings of the 1997 International Symposium on Parallel Architectures Algorithms and Networks (I-SPAN'97), 1
    Co-Authors: Akira Koseki, H. Komastu, Yoshiaki Fukazawa
    Abstract:

    Loop Unrolling is one of the most promising parallelization techniques, because the nature of programs causes most of the processing time to be spent in their Loops. Unrolling not only the innermost Loop but also outer Loops greatly expands the scope for reusing data and parallelizing instructions. Nested-Loop Unrolling is therefore a very effective way of obtaining a higher degree of parallelism. However, we need a method for measuring the efficiency of Loop Unrolling that takes account of both the reuse of data and the parallelism between instructions. This paper describes a heuristic algorithm for deciding the number of times and the directions in which Loops should be unrolled, through the use of information such as dependence, reuse, and machine resources. Our method is evaluated by applying benchmark tests.