Loop Transformation

The Experts below are selected from a list of 19146 Experts worldwide ranked by ideXlab platform

Enhong Chen - One of the best experts on this subject based on the ideXlab platform.

Loop fusion and reordering for register file optimization on stream processors

Journal of Systems and Software, 2012

Co-Authors: Wanyong Tian, Minming Li, Enhong Chen

Abstract:

Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. stream register file (SRF) is a non-bypassing software-managed on-chip memory. Unlike conventional register files, the input data must be all stored in the SRF when a program is being executed. It is a critical resource in stream processors. When loading a program from the off-chip memory into SRF for execution, the storage consumption and the data transfer time are two key factors which affect the performance. This work applies Loop Transformation to programs for SRF optimization. We consider two objectives of minimizing the storage consumption and data transfer time. Previous techniques concentrate on the utilization of SRF only. This is the first paper considering both the two factors. We present a cost evaluation function in this paper and apply Loop fusion and reordering to improve the performance of stream processors. The experimental results show significant performance improvement.

15 days free trial to Access Article
Loop fusion and reordering for register file optimization on stream processors

ACM Symposium on Applied Computing, 2011

Co-Authors: Wanyong Tian, Minming Li, Enhong Chen

Abstract:

Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. Stream Register File (SRF) is a non-bypassing software-managed on-chip memory. It is a critical resource in stream processors. When loading a program from the off-chip memory into SRF for executing, the storage consumption and the data transfer time are two key factors which affect the performance. This work applies Loop Transformation to programs for SRF optimization. We consider two objectives of minimizing the storage consumption and data transfer time. Previous techniques concentrate on the utilization of SRF only. This is the first paper considering both the two factors. We present a cost evaluation function in this paper and apply Loop fusion and reordering to improve the performance of stream processors. The experimental results show significant performance improvement.

15 days free trial to Access Article
SAC - Loop fusion and reordering for register file optimization on stream processors

Proceedings of the 2011 ACM Symposium on Applied Computing - SAC '11, 2011

Co-Authors: Wanyong Tian, Minming Li, Enhong Chen

Abstract:

Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. Stream Register File (SRF) is a non-bypassing software-managed on-chip memory. It is a critical resource in stream processors. When loading a program from the off-chip memory into SRF for executing, the storage consumption and the data transfer time are two key factors which affect the performance. This work applies Loop Transformation to programs for SRF optimization. We consider two objectives of minimizing the storage consumption and data transfer time. Previous techniques concentrate on the utilization of SRF only. This is the first paper considering both the two factors. We present a cost evaluation function in this paper and apply Loop fusion and reordering to improve the performance of stream processors. The experimental results show significant performance improvement.

15 days free trial to Access Article

Zili Shao - One of the best experts on this subject based on the ideXlab platform.

optimizing parallelism for nested Loops with iterational and instructional retiming

Journal of Embedded Computing, 2009

Co-Authors: Zili Shao

Abstract:

Embedded systems have strict timing and code size requirements. Retiming is one of the most important optimization techniques to improve the execution time of Loops by increasing the parallelism among successive Loop iterations. Traditionally, retiming has been applied at instruction level to reduce cycle period for single Loops. While multi-dimensional (MD) retiming can explore the outer Loop parallelism, it introduces large overheads in Loop index generation and code size due to Loop Transformation. In this paper, we propose a novel approach, that combines iterational retiming with instructional retiming to satisfy any given timing constraint by achieving full parallelism for iterations in a partition with minimal code size. The experimental results show that combining iterational retiming and instructional retiming, we can achieve 37% code size reduction comparing to applying iteration retiming alone.

15 days free trial to Access Article
Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

The Journal of VLSI Signal Processing Systems for Signal Image and Video Technology, 2007

Co-Authors: Zili Shao

Abstract:

Majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested Loops. Most of the existing Loop Transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated Loop bounds and Loop indexes calculations. This paper proposes a new technique, Loop striping , that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where all iterations in a stripe are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for Loop striping Transformations. The experimental results show that Loop striping always achieves better iteration period than software pipelining and Loop unfolding, improving average iteration period by 50 and 54% respectively.

15 days free trial to Access Article
EUC - Loop striping: maximize parallelism for nested Loops

Embedded and Ubiquitous Computing, 2006

Co-Authors: Zili Shao

Abstract:

The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested Loops. Most of the existing Loop Transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated Loop bounds and Loop indexes calculations. This paper proposes a new technique, Loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for Loop striping Transformations. The experimental results show that Loop striping always achieves better iteration period than software pipelining and Loop unfolding, improving average iteration period by 50% and 54% respectively

15 days free trial to Access Article
Loop striping maximize parallelism for nested Loops

Embedded and Ubiquitous Computing, 2006

Co-Authors: Zili Shao

Abstract:

The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested Loops. Most of the existing Loop Transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated Loop bounds and Loop indexes calculations. This paper proposes a new technique, Loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for Loop striping Transformations. The experimental results show that Loop striping always achieves better iteration period than software pipelining and Loop unfolding, improving average iteration period by 50% and 54% respectively

15 days free trial to Access Article
Loop distribution and fusion with timing and code size optimization for embedded dsps

Embedded and Ubiquitous Computing, 2005

Co-Authors: Qingfeng Zhuge, Zili Shao

Abstract:

Loop distribution and Loop fusion are two e.ective Loop Transformation techniques to optimize the execution of the programs in DSP applications. In this paper, we propose a new technique combining Loop distribution with direct Loop fusion, which will improve the timing performance without jeopardizing the code size. We .rst develop the Loop distribution theorems that state the legality conditions of Loop distribution for multi-level nested Loops. We show that if the summation of the edge weights of the dependence cycle satis.es a certain condition, then the statements involved in the dependence cycle can be distributed; otherwise, they should be put in the same Loop after Loop distribution. Then, we propose the technique of maximum Loop distribution with direct Loop fusion. The experimental results show that the execution time of the transformed Loops by our technique is reduced 21.0compared to the original Loops and the code size of the transformed Loops is reduced 7.0% on average compared to the original Loops.

15 days free trial to Access Article

Francky Catthoor - One of the best experts on this subject based on the ideXlab platform.

Incremental hierarchical memory size estimation for steering of Loop Transformations

ACM Transactions on Design Automation of Electronic Systems, 2007

Co-Authors: Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Francky Catthoor

Abstract:

Modern embedded multimedia and telecommunications systems need to store and access huge amounts of data. This becomes a critical factor for the overall energy consumption, area, and performance of the systems. Loop Transformations are essential to improve the data access locality and regularity in order to optimally design or utilize a memory hierarchy. However, due to abstract high-level cost functions, current Loop Transformation steering techniques do not take the memory platform sufficiently into account. They usually also result in only one final Transformation solution. On the other hand, the Loop Transformation search space for real-life applications is huge, especially if the memory platform is still not fully fixed. Use of existing Loop Transformation techniques will therefore typically lead to suboptimal end-products. It is critical to find all interesting Loop Transformation instances. This can only be achieved by performing an evaluation of the effect of later design stages at the early Loop Transformation stage. This article presents a fast incremental hierarchical memory-size requirement estimation technique. It estimates the influence of any given sequence of Loop Transformation instances on the mapping of application data onto a hierarchical memory platform. As the exact memory platform instantiation is often not yet defined at this high-level design stage, a platform-independent estimation is introduced with a Pareto curve output for each Loop Transformation instance. Comparison among the Pareto curves helps the designer, or a steering tool, to find all interesting Loop Transformation instances that might later lead to low-power data mapping for any of the many possible memory hierarchy instances. Initially, the source code is used as input for estimation. However, performing the estimation repeatedly from the source code is too slow for large search space exploration. An incremental approach, based on local updating of the previous result, is therefore used to handle sequences of different Loop Transformations. Experiments show that the initial approach takes a few seconds, which is two orders of magnitude faster than state-of-the-art solutions but still too costly to be performed interactively many times. The incremental approach typically takes just a few milliseconds, which is another two orders of magnitude faster than the initial approach. This huge speedup allows us for the first time to handle real-life industrial-size applications and get realistic feedback during Loop Transformation exploration.

15 days free trial to Access Article
hierarchical memory size estimation for Loop fusion and Loop shifting in data dominated applications

Asia and South Pacific Design Automation Conference, 2006

Co-Authors: Qubo Hu, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Eric Brockmeyer, Francky Catthoor

Abstract:

Loop fusion and Loop shifting are important Transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the Loop Transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for Loop fusion and Loop shifting at the early Loop Transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting Loop Transformations that should then be investigated in more depth afterward.

15 days free trial to Access Article
ASP-DAC - Hierarchical memory size estimation for Loop fusion and Loop shifting in data-dominated applications

Proceedings of the 2006 conference on Asia South Pacific design automation - ASP-DAC '06, 2006

Co-Authors: Qubo Hu, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Eric Brockmeyer, Francky Catthoor

Abstract:

Loop fusion and Loop shifting are important Transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the Loop Transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for Loop fusion and Loop shifting at the early Loop Transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting Loop Transformations that should then be investigated in more depth afterward.

15 days free trial to Access Article
ASAP - Loop Transformation Methodologies for Array-Oriented Memory Management

IEEE 17th International Conference on Application-specific Systems Architectures and Processors (ASAP'06), 2006

Co-Authors: Florin Balasa, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Francky Catthoor

Abstract:

The storage requirements in data-dominant signal processing systems, whose behavior is described by arraybased, Loop-organized algorithmic specifications, have an important impact on the overall energy consumption, data access latency, and chip area. Applying different Loop Transformations on the specification code can significantly enhance the memory management of such VLSI systems, improving all the major parameters of the design space - power, area, and performance. This paper gives a global view on existing and recently proposed memory size evaluation approaches for procedural and non-procedural specifications. Moreover, it discusses typical memory management trade-offs taken into account during the exploration of system specifications by Loop Transformations, that can exploit these early size evaluations.

15 days free trial to Access Article
Global Loop Transformation Steering

Data Access and Storage Management for Embedded Programmable Processors, 2002

Co-Authors: Francky Catthoor, Per Gunnar Kjeldsberg, Koen Danckaert, Chidamber Kulkarni, Erik Brockmeyer, Tanja Van Achteren, Thierry Omnes

Abstract:

As motivated, the reorganisation of the Loop structure and the global control flow across the entire application is a crucial initial step in the DTSE flow Experiments have shown that this is extremely difficult to decide manually due to the many conflicting goals and trade-offs that exist in modern real-life multi-media applications. So an interactive Transformation environment would help but is not sufficient. Therefore, we have devoted a major research effort since 1989 to derive automatic steering techniques in the DTSE context where both “global” access locality and access regularity are crucial.

15 days free trial to Access Article

Keshav Pingali - One of the best experts on this subject based on the ideXlab platform.

A singular Loop Transformation framework based on non-singular matrices

International Journal of Parallel Programming, 1994

Co-Authors: Keshav Pingali

Abstract:

In this paper, we discuss a Loop Transformation framework that is based on integer non-singular matrices. The Transformations included in this framework are called Λ-Transformations and include permutation, skewing and reversal, as well as a Transformation called Loop scaling . This framework is more general than existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial Transformation matrix, produces a full Transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.

15 days free trial to Access Article
LCPC - A Singular Loop Transformation Framework Based on Non-Singular Matrices

Languages and Compilers for Parallel Computing, 1993

Co-Authors: Keshav Pingali

Abstract:

In this paper, we discuss a Loop Transformation framework that is based on integer non-singular matrices. The Transformations included in this framework are called $\Lambda$-Transformations and include permutation, skewing and reversal, as well as a Transformation called Loop scaling. This framework is more general than the existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial Transformation matrix, produces a full Transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.

15 days free trial to Access Article
a singular Loop Transformation framework based on non singular matrices

Languages and Compilers for Parallel Computing, 1992

Co-Authors: Keshav Pingali

Abstract:

In this paper, we discuss a Loop Transformation framework that is based on integer non-singular matrices. The Transformations included in this framework are called $\Lambda$-Transformations and include permutation, skewing and reversal, as well as a Transformation called Loop scaling. This framework is more general than the existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial Transformation matrix, produces a full Transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.

15 days free trial to Access Article

Yonghong Song - One of the best experts on this subject based on the ideXlab platform.

new tiling techniques to improve cache temporal locality

Programming Language Design and Implementation, 1999

Co-Authors: Yonghong Song

Abstract:

Tiling is a well-known Loop Transformation to improve temporal locality of nested Loops. Current compiler algorithms for tiling are limited to Loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper presents a number of program Transformations to enable tiling for a class of nontrivial imperfectly-nested Loops such that cache locality is improved. We define a program model for such Loops and develop compiler algorithms for their tiling. We propose to adopt odd-even variable duplication to break anti- and output dependences without unduly increasing the working-set size, and to adopt speculative execution to enable tiling of Loops which may terminate prematurely due to, e.g. convergence tests in iterative algorithms. We have implemented these techniques in a research compiler, Panorama. Initial experiments with several benchmark programs are performed on SGI workstations based on MIPS R5K and R10K processors. Overall, the transformed programs run faster by 9% to 164%.

15 days free trial to Access Article
PLDI - New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation - PLDI '99, 1999

Co-Authors: Yonghong Song

Abstract:

Tiling is a well-known Loop Transformation to improve temporal locality of nested Loops. Current compiler algorithms for tiling are limited to Loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper presents a number of program Transformations to enable tiling for a class of nontrivial imperfectly-nested Loops such that cache locality is improved. We define a program model for such Loops and develop compiler algorithms for their tiling. We propose to adopt odd-even variable duplication to break anti- and output dependences without unduly increasing the working-set size, and to adopt speculative execution to enable tiling of Loops which may terminate prematurely due to, e.g. convergence tests in iterative algorithms. We have implemented these techniques in a research compiler, Panorama. Initial experiments with several benchmark programs are performed on SGI workstations based on MIPS R5K and R10K processors. Overall, the transformed programs run faster by 9% to 164%.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Enhong Chen - One of the best experts on this subject based on the ideXlab platform.

Loop fusion and reordering for register file optimization on stream processors

Loop fusion and reordering for register file optimization on stream processors

SAC - Loop fusion and reordering for register file optimization on stream processors

Zili Shao - One of the best experts on this subject based on the ideXlab platform.

optimizing parallelism for nested Loops with iterational and instructional retiming

Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

EUC - Loop striping: maximize parallelism for nested Loops

Loop striping maximize parallelism for nested Loops

Loop distribution and fusion with timing and code size optimization for embedded dsps

Francky Catthoor - One of the best experts on this subject based on the ideXlab platform.

Incremental hierarchical memory size estimation for steering of Loop Transformations

hierarchical memory size estimation for Loop fusion and Loop shifting in data dominated applications

ASP-DAC - Hierarchical memory size estimation for Loop fusion and Loop shifting in data-dominated applications

ASAP - Loop Transformation Methodologies for Array-Oriented Memory Management

Global Loop Transformation Steering

Keshav Pingali - One of the best experts on this subject based on the ideXlab platform.

A singular Loop Transformation framework based on non-singular matrices

LCPC - A Singular Loop Transformation Framework Based on Non-Singular Matrices

a singular Loop Transformation framework based on non singular matrices

Yonghong Song - One of the best experts on this subject based on the ideXlab platform.

new tiling techniques to improve cache temporal locality

PLDI - New tiling techniques to improve cache temporal locality

Loop Transformation

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Enhong Chen - One of the best experts on this subject based on the ideXlab platform.

Zili Shao - One of the best experts on this subject based on the ideXlab platform.

Francky Catthoor - One of the best experts on this subject based on the ideXlab platform.

Keshav Pingali - One of the best experts on this subject based on the ideXlab platform.

Yonghong Song - One of the best experts on this subject based on the ideXlab platform.

Related terms