Loop Transformation

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 19146 Experts worldwide ranked by ideXlab platform

Enhong Chen - One of the best experts on this subject based on the ideXlab platform.

  • Loop fusion and reordering for register file optimization on stream processors
    Journal of Systems and Software, 2012
    Co-Authors: Wanyong Tian, Minming Li, Enhong Chen
    Abstract:

    Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. stream register file (SRF) is a non-bypassing software-managed on-chip memory. Unlike conventional register files, the input data must be all stored in the SRF when a program is being executed. It is a critical resource in stream processors. When loading a program from the off-chip memory into SRF for execution, the storage consumption and the data transfer time are two key factors which affect the performance. This work applies Loop Transformation to programs for SRF optimization. We consider two objectives of minimizing the storage consumption and data transfer time. Previous techniques concentrate on the utilization of SRF only. This is the first paper considering both the two factors. We present a cost evaluation function in this paper and apply Loop fusion and reordering to improve the performance of stream processors. The experimental results show significant performance improvement.

  • Loop fusion and reordering for register file optimization on stream processors
    ACM Symposium on Applied Computing, 2011
    Co-Authors: Wanyong Tian, Minming Li, Enhong Chen
    Abstract:

    Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. Stream Register File (SRF) is a non-bypassing software-managed on-chip memory. It is a critical resource in stream processors. When loading a program from the off-chip memory into SRF for executing, the storage consumption and the data transfer time are two key factors which affect the performance. This work applies Loop Transformation to programs for SRF optimization. We consider two objectives of minimizing the storage consumption and data transfer time. Previous techniques concentrate on the utilization of SRF only. This is the first paper considering both the two factors. We present a cost evaluation function in this paper and apply Loop fusion and reordering to improve the performance of stream processors. The experimental results show significant performance improvement.

  • SAC - Loop fusion and reordering for register file optimization on stream processors
    Proceedings of the 2011 ACM Symposium on Applied Computing - SAC '11, 2011
    Co-Authors: Wanyong Tian, Minming Li, Enhong Chen
    Abstract:

    Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. Stream Register File (SRF) is a non-bypassing software-managed on-chip memory. It is a critical resource in stream processors. When loading a program from the off-chip memory into SRF for executing, the storage consumption and the data transfer time are two key factors which affect the performance. This work applies Loop Transformation to programs for SRF optimization. We consider two objectives of minimizing the storage consumption and data transfer time. Previous techniques concentrate on the utilization of SRF only. This is the first paper considering both the two factors. We present a cost evaluation function in this paper and apply Loop fusion and reordering to improve the performance of stream processors. The experimental results show significant performance improvement.

Zili Shao - One of the best experts on this subject based on the ideXlab platform.

  • optimizing parallelism for nested Loops with iterational and instructional retiming
    Journal of Embedded Computing, 2009
    Co-Authors: Zili Shao
    Abstract:

    Embedded systems have strict timing and code size requirements. Retiming is one of the most important optimization techniques to improve the execution time of Loops by increasing the parallelism among successive Loop iterations. Traditionally, retiming has been applied at instruction level to reduce cycle period for single Loops. While multi-dimensional (MD) retiming can explore the outer Loop parallelism, it introduces large overheads in Loop index generation and code size due to Loop Transformation. In this paper, we propose a novel approach, that combines iterational retiming with instructional retiming to satisfy any given timing constraint by achieving full parallelism for iterations in a partition with minimal code size. The experimental results show that combining iterational retiming and instructional retiming, we can achieve 37% code size reduction comparing to applying iteration retiming alone.

  • Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping
    The Journal of VLSI Signal Processing Systems for Signal Image and Video Technology, 2007
    Co-Authors: Zili Shao
    Abstract:

    Majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested Loops. Most of the existing Loop Transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated Loop bounds and Loop indexes calculations. This paper proposes a new technique, Loop striping , that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where all iterations in a stripe are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for Loop striping Transformations. The experimental results show that Loop striping always achieves better iteration period than software pipelining and Loop unfolding, improving average iteration period by 50 and 54% respectively.

  • EUC - Loop striping: maximize parallelism for nested Loops
    Embedded and Ubiquitous Computing, 2006
    Co-Authors: Zili Shao
    Abstract:

    The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested Loops. Most of the existing Loop Transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated Loop bounds and Loop indexes calculations. This paper proposes a new technique, Loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for Loop striping Transformations. The experimental results show that Loop striping always achieves better iteration period than software pipelining and Loop unfolding, improving average iteration period by 50% and 54% respectively

  • Loop striping maximize parallelism for nested Loops
    Embedded and Ubiquitous Computing, 2006
    Co-Authors: Zili Shao
    Abstract:

    The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested Loops. Most of the existing Loop Transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated Loop bounds and Loop indexes calculations. This paper proposes a new technique, Loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for Loop striping Transformations. The experimental results show that Loop striping always achieves better iteration period than software pipelining and Loop unfolding, improving average iteration period by 50% and 54% respectively

  • Loop distribution and fusion with timing and code size optimization for embedded dsps
    Embedded and Ubiquitous Computing, 2005
    Co-Authors: Qingfeng Zhuge, Zili Shao
    Abstract:

    Loop distribution and Loop fusion are two e.ective Loop Transformation techniques to optimize the execution of the programs in DSP applications. In this paper, we propose a new technique combining Loop distribution with direct Loop fusion, which will improve the timing performance without jeopardizing the code size. We .rst develop the Loop distribution theorems that state the legality conditions of Loop distribution for multi-level nested Loops. We show that if the summation of the edge weights of the dependence cycle satis.es a certain condition, then the statements involved in the dependence cycle can be distributed; otherwise, they should be put in the same Loop after Loop distribution. Then, we propose the technique of maximum Loop distribution with direct Loop fusion. The experimental results show that the execution time of the transformed Loops by our technique is reduced 21.0compared to the original Loops and the code size of the transformed Loops is reduced 7.0% on average compared to the original Loops.

Francky Catthoor - One of the best experts on this subject based on the ideXlab platform.

  • Incremental hierarchical memory size estimation for steering of Loop Transformations
    ACM Transactions on Design Automation of Electronic Systems, 2007
    Co-Authors: Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Francky Catthoor
    Abstract:

    Modern embedded multimedia and telecommunications systems need to store and access huge amounts of data. This becomes a critical factor for the overall energy consumption, area, and performance of the systems. Loop Transformations are essential to improve the data access locality and regularity in order to optimally design or utilize a memory hierarchy. However, due to abstract high-level cost functions, current Loop Transformation steering techniques do not take the memory platform sufficiently into account. They usually also result in only one final Transformation solution. On the other hand, the Loop Transformation search space for real-life applications is huge, especially if the memory platform is still not fully fixed. Use of existing Loop Transformation techniques will therefore typically lead to suboptimal end-products. It is critical to find all interesting Loop Transformation instances. This can only be achieved by performing an evaluation of the effect of later design stages at the early Loop Transformation stage. This article presents a fast incremental hierarchical memory-size requirement estimation technique. It estimates the influence of any given sequence of Loop Transformation instances on the mapping of application data onto a hierarchical memory platform. As the exact memory platform instantiation is often not yet defined at this high-level design stage, a platform-independent estimation is introduced with a Pareto curve output for each Loop Transformation instance. Comparison among the Pareto curves helps the designer, or a steering tool, to find all interesting Loop Transformation instances that might later lead to low-power data mapping for any of the many possible memory hierarchy instances. Initially, the source code is used as input for estimation. However, performing the estimation repeatedly from the source code is too slow for large search space exploration. An incremental approach, based on local updating of the previous result, is therefore used to handle sequences of different Loop Transformations. Experiments show that the initial approach takes a few seconds, which is two orders of magnitude faster than state-of-the-art solutions but still too costly to be performed interactively many times. The incremental approach typically takes just a few milliseconds, which is another two orders of magnitude faster than the initial approach. This huge speedup allows us for the first time to handle real-life industrial-size applications and get realistic feedback during Loop Transformation exploration.

  • hierarchical memory size estimation for Loop fusion and Loop shifting in data dominated applications
    Asia and South Pacific Design Automation Conference, 2006
    Co-Authors: Qubo Hu, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Eric Brockmeyer, Francky Catthoor
    Abstract:

    Loop fusion and Loop shifting are important Transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the Loop Transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for Loop fusion and Loop shifting at the early Loop Transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting Loop Transformations that should then be investigated in more depth afterward.

  • ASP-DAC - Hierarchical memory size estimation for Loop fusion and Loop shifting in data-dominated applications
    Proceedings of the 2006 conference on Asia South Pacific design automation - ASP-DAC '06, 2006
    Co-Authors: Qubo Hu, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Eric Brockmeyer, Francky Catthoor
    Abstract:

    Loop fusion and Loop shifting are important Transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the Loop Transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for Loop fusion and Loop shifting at the early Loop Transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting Loop Transformations that should then be investigated in more depth afterward.

  • ASAP - Loop Transformation Methodologies for Array-Oriented Memory Management
    IEEE 17th International Conference on Application-specific Systems Architectures and Processors (ASAP'06), 2006
    Co-Authors: Florin Balasa, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Francky Catthoor
    Abstract:

    The storage requirements in data-dominant signal processing systems, whose behavior is described by arraybased, Loop-organized algorithmic specifications, have an important impact on the overall energy consumption, data access latency, and chip area. Applying different Loop Transformations on the specification code can significantly enhance the memory management of such VLSI systems, improving all the major parameters of the design space - power, area, and performance. This paper gives a global view on existing and recently proposed memory size evaluation approaches for procedural and non-procedural specifications. Moreover, it discusses typical memory management trade-offs taken into account during the exploration of system specifications by Loop Transformations, that can exploit these early size evaluations.

  • Global Loop Transformation Steering
    Data Access and Storage Management for Embedded Programmable Processors, 2002
    Co-Authors: Francky Catthoor, Per Gunnar Kjeldsberg, Koen Danckaert, Chidamber Kulkarni, Erik Brockmeyer, Tanja Van Achteren, Thierry Omnes
    Abstract:

    As motivated, the reorganisation of the Loop structure and the global control flow across the entire application is a crucial initial step in the DTSE flow Experiments have shown that this is extremely difficult to decide manually due to the many conflicting goals and trade-offs that exist in modern real-life multi-media applications. So an interactive Transformation environment would help but is not sufficient. Therefore, we have devoted a major research effort since 1989 to derive automatic steering techniques in the DTSE context where both “global” access locality and access regularity are crucial.

Keshav Pingali - One of the best experts on this subject based on the ideXlab platform.

  • A singular Loop Transformation framework based on non-singular matrices
    International Journal of Parallel Programming, 1994
    Co-Authors: Keshav Pingali
    Abstract:

    In this paper, we discuss a Loop Transformation framework that is based on integer non-singular matrices. The Transformations included in this framework are called Λ-Transformations and include permutation, skewing and reversal, as well as a Transformation called Loop scaling . This framework is more general than existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial Transformation matrix, produces a full Transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.

  • LCPC - A Singular Loop Transformation Framework Based on Non-Singular Matrices
    Languages and Compilers for Parallel Computing, 1993
    Co-Authors: Keshav Pingali
    Abstract:

    In this paper, we discuss a Loop Transformation framework that is based on integer non-singular matrices. The Transformations included in this framework are called $\Lambda$-Transformations and include permutation, skewing and reversal, as well as a Transformation called Loop scaling. This framework is more general than the existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial Transformation matrix, produces a full Transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.

  • a singular Loop Transformation framework based on non singular matrices
    Languages and Compilers for Parallel Computing, 1992
    Co-Authors: Keshav Pingali
    Abstract:

    In this paper, we discuss a Loop Transformation framework that is based on integer non-singular matrices. The Transformations included in this framework are called $\Lambda$-Transformations and include permutation, skewing and reversal, as well as a Transformation called Loop scaling. This framework is more general than the existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial Transformation matrix, produces a full Transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.

Yonghong Song - One of the best experts on this subject based on the ideXlab platform.

  • new tiling techniques to improve cache temporal locality
    Programming Language Design and Implementation, 1999
    Co-Authors: Yonghong Song
    Abstract:

    Tiling is a well-known Loop Transformation to improve temporal locality of nested Loops. Current compiler algorithms for tiling are limited to Loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper presents a number of program Transformations to enable tiling for a class of nontrivial imperfectly-nested Loops such that cache locality is improved. We define a program model for such Loops and develop compiler algorithms for their tiling. We propose to adopt odd-even variable duplication to break anti- and output dependences without unduly increasing the working-set size, and to adopt speculative execution to enable tiling of Loops which may terminate prematurely due to, e.g. convergence tests in iterative algorithms. We have implemented these techniques in a research compiler, Panorama. Initial experiments with several benchmark programs are performed on SGI workstations based on MIPS R5K and R10K processors. Overall, the transformed programs run faster by 9% to 164%.

  • PLDI - New tiling techniques to improve cache temporal locality
    Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation - PLDI '99, 1999
    Co-Authors: Yonghong Song
    Abstract:

    Tiling is a well-known Loop Transformation to improve temporal locality of nested Loops. Current compiler algorithms for tiling are limited to Loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper presents a number of program Transformations to enable tiling for a class of nontrivial imperfectly-nested Loops such that cache locality is improved. We define a program model for such Loops and develop compiler algorithms for their tiling. We propose to adopt odd-even variable duplication to break anti- and output dependences without unduly increasing the working-set size, and to adopt speculative execution to enable tiling of Loops which may terminate prematurely due to, e.g. convergence tests in iterative algorithms. We have implemented these techniques in a research compiler, Panorama. Initial experiments with several benchmark programs are performed on SGI workstations based on MIPS R5K and R10K processors. Overall, the transformed programs run faster by 9% to 164%.