Loop Fusion

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 837 Experts worldwide ranked by ideXlab platform

Ken Kennedy - One of the best experts on this subject based on the ideXlab platform.

  • Model-guided empirical tuning of Loop Fusion
    International Journal of High Performance Systems Architecture, 2008
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion is recognised as an effective transformation for improving memory hierarchy performance. However, unconstrained Loop Fusion can lead to poor performance because of increased register pressure and cache conflict misses. In this paper, we present a cache-conscious analytical model for profitable Loop Fusion. We use this model to tune Fusion parameters for different architectures through empirical search. Experiments on four different platforms for a set of applications show significant speedup over fully optimised code generated by state-of-the-art commercial compilers.

  • Array syntax compilation and performance tuning
    2007
    Co-Authors: Ken Kennedy, Yuan Zhao
    Abstract:

    Array syntax adds expressive power to a language by providing operations on and assignments to array sections. Thus it allows programmers to write clear and concise code. However, state-of-the-art vendor compilers fail to efficiently map array statements to underlying architectures for high performance. The inefficiency is caused by ineffectively solving the following three technical problems: (1) reducing the size of allocated temporary array; (2) extending solutions to the evolving architectures; (3) applying Loop Fusion to multiple array statements. Finding solutions to these problems is important because otherwise array syntax, though a high-level language feature, may not be widely used by application developers. To address the above problems, this research first develops a novel strategy that minimizes the allocated temporary arrays using Loop alignment and Loop skewing on scalar processors, thereby reducing memory traffic and improving cache utilization. It then extends the minimization strategy to exploit the increasing on-chip parallelism on evolving architectures that offer vector (e.g., SSE and AltiVec) and multi-core (e.g., CELL) capabilities. In addition, new techniques boost performance by improving data alignment and managing data movement, both of which are important on these new architectures. Last, this dissertation parameterizes Loop Fusion for performance tuning and explores the properties of the space of all possible Loop Fusion configurations, to expedite performance tuning of Loop Fusion for increasing data reuse across multiple array statements. These transformations and optimizations are implemented in a source-to-source research compiler with extensions to target short vector processors and CELL processor. Experiments show that array statements compiled with our strategy run as much as two times faster than those compiled directly by vendor compilers. Our exploration of Loop Fusion parameter space identifies good candidates for heuristic searching and space pruning, which are essential to make the performance tuning process practical. In summary, this dissertation demonstrates that advanced compilation techniques can significantly improve the performance of programs written in array syntax upon current state-of-the-art implementation across a variety of architectures, including the latest multi-core processors with vector capabilities.

  • profitable Loop Fusion and tiling using model driven empirical search
    International Conference on Supercomputing, 2006
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for applying these transformations profitably across architectures. In this paper, we present a model-guided empirical tuning strategy for profitable application of Loop Fusion and tiling. Our strategy consists of a detailed cost model that characterizes the interaction between the two transformations at different levels of the memory hierarchy. The novelty of our approach is in exposing key architectural parameters within the model for automatic tuning through empirical search. Preliminary experiments with a set of applications on four different platforms show that our strategy achieves significant performance improvement over fully optimized code generated by state-of-the-art commercial compilers. The time spent in searching for the best parameters is considerably less than with other search strategies.

  • ICS - Profitable Loop Fusion and tiling using model-driven empirical search
    Proceedings of the 20th annual international conference on Supercomputing - ICS '06, 2006
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for applying these transformations profitably across architectures. In this paper, we present a model-guided empirical tuning strategy for profitable application of Loop Fusion and tiling. Our strategy consists of a detailed cost model that characterizes the interaction between the two transformations at different levels of the memory hierarchy. The novelty of our approach is in exposing key architectural parameters within the model for automatic tuning through empirical search. Preliminary experiments with a set of applications on four different platforms show that our strategy achieves significant performance improvement over fully optimized code generated by state-of-the-art commercial compilers. The time spent in searching for the best parameters is considerably less than with other search strategies.

  • a cache conscious profitability model for empirical tuning of Loop Fusion
    Lecture Notes in Computer Science, 2006
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion is recognized as an effective program transformation for improving memory hierarchy performance. However, unconstrained Loop Fusion can lead to poor performance because of increased register pressure and cache conflict misses. The complex interaction between different levels of the memory hierarchy with the input program makes it very difficult to always make the right choice in fusing Loops. In this paper, we present a cache-conscious analytical model for profitable Loop Fusion to be used with a constrained weighted Fusion algorithm. We then extend the model to show its effectiveness in the context of an empirical tuning framework. A preliminary evaluation of the model is presented using hand experiments on four applications.

P Sadayappan - One of the best experts on this subject based on the ideXlab platform.

  • Memory minimization for tensor contractions using integer linear programming
    Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
    Co-Authors: Abdul Allam, Jeyakumar Ramanujam, G. Baumgartner, P Sadayappan
    Abstract:

    This paper presents a technique for memory optimization for a class of computations that arises in the field of correlated electronic structure methods such as coupled cluster and configuration interaction methods in quantum chemistry. In this class of computations, Loop computations perform a multi-dimensional sum of product of input arrays. There are many different ways to get the same final results that differ in the required number of arithmetic operations required. In addition, for a given number of arithmetic operations, different expressions of the Loop have different memory requirements. Loop Fusion is a plausible solution for reducing memory usage. By fusing Loops between producer Loop nest and consumer Loop nest, the required storage of intermediate array is reduced by the range of the fused Loop. Because resultant Loops have to be legal after Fusion, some Loops can not be fused at the same time. In this paper, we have developed a novel integer linear programming (ILP) formulation that is shown to be highly effective on a number of test cases producing the optimal solutions using very small execution times. The main idea in the ILP formulation is the encoding of legality rules for Loop Fusion of a special class of Loops using logical constraints over binary decision variables and a highly effective approximation of memory usage

  • IPDPS - Memory minimization for tensor contractions using integer linear programming
    Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
    Co-Authors: Abdul Allam, Jeyakumar Ramanujam, Gerald Baumgartner, P Sadayappan
    Abstract:

    This paper presents a technique for memory optimization for a class of computations that arises in the field of correlated electronic structure methods such as coupled cluster and configuration interaction methods in quantum chemistry. In this class of computations, Loop computations perform a multi-dimensional sum of product of input arrays. There are many different ways to get the same final results that differ in the required number of arithmetic operations required. In addition, for a given number of arithmetic operations, different expressions of the Loop have different memory requirements. Loop Fusion is a plausible solution for reducing memory usage. By fusing Loops between producer Loop nest and consumer Loop nest, the required storage of intermediate array is reduced by the range of the fused Loop. Because resultant Loops have to be legal after Fusion, some Loops can not be fused at the same time. In this paper, we have developed a novel integer linear programming (ILP) formulation that is shown to be highly effective on a number of test cases producing the optimal solutions using very small execution times. The main idea in the ILP formulation is the encoding of legality rules for Loop Fusion of a special class of Loops using logical constraints over binary decision variables and a highly effective approximation of memory usage.

  • Memory-constrained communication minimization for a class of array computations
    Lecture Notes in Computer Science, 2005
    Co-Authors: Daniel Cociorva, P Sadayappan, Gerald Baumgartner, Jeyakumar Ramanujam
    Abstract:

    The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate Loop Fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of inter-processor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between Loop Fusion and memory usage, we develop an approach to identify the best combination of Loop Fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

  • Memory-constrained data locality optimization for tensor contractions
    Lecture Notes in Computer Science, 2004
    Co-Authors: Alina Bibireata, Jeyakumar Ramanujam, P Sadayappan, Daniel Cociorva, Gerald Baumgartner, Sandhya Krishnan, David E. Bernholdt, Venkatesh Choppella
    Abstract:

    The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions over large multi-dimensional arrays. Efficient computation of these contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, requiring their storage on disk. However, the intermediates can often be generated and used in batches through appropriate Loop Fusion transformations. To optimize the performance of such computations a combination of Loop Fusion and Loop tiling is required, so that the cost of disk I/O is minimized. In this paper, we address the memory-constrained data-locality optimization problem in the context of this class of computations. We develop an optimization framework to search among a space of Fusion and tiling choices to minimize the data movement overhead. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

  • HiPC - Data locality optimization for synthesis of efficient out-of-core algorithms
    High Performance Computing - HiPC 2003, 2003
    Co-Authors: Sandhya Krishnan, Jeyakumar Ramanujam, P Sadayappan, Daniel Cociorva, Gerald Baumgartner, David E. Bernholdt, Sriram Krishnamoorthy, Venkatesh Choppella
    Abstract:

    This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested Loops that represent tensor contraction computations. Tensor contraction expressions arise in many accurate computational models of electronic structure. The developed approach combines Loop Fusion with Loop tiling and uses a performance-model driven approach to Loop tiling for the generation of out-of-core code. Experimental measurements are provided that show a good match with model-based predictions and demonstrate the effectiveness of the proposed algorithm.

Francky Catthoor - One of the best experts on this subject based on the ideXlab platform.

  • hierarchical memory size estimation for Loop Fusion and Loop shifting in data dominated applications
    Asia and South Pacific Design Automation Conference, 2006
    Co-Authors: Qubo Hu, Martin Palkovic, Per Gunnar Kjeldsberg, Arnout Vandecappelle, E Brockmeyer, Francky Catthoor
    Abstract:

    Loop Fusion and Loop shifting are important transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the Loop transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for Loop Fusion and Loop shifting at the early Loop transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting Loop transformations that should then be investigated in more depth afterward.

  • ASP-DAC - Hierarchical memory size estimation for Loop Fusion and Loop shifting in data-dominated applications
    Proceedings of the 2006 conference on Asia South Pacific design automation - ASP-DAC '06, 2006
    Co-Authors: Qubo Hu, Martin Palkovic, Per Gunnar Kjeldsberg, Arnout Vandecappelle, E Brockmeyer, Francky Catthoor
    Abstract:

    Loop Fusion and Loop shifting are important transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the Loop transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for Loop Fusion and Loop shifting at the early Loop transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting Loop transformations that should then be investigated in more depth afterward.

  • optimizing the memory bandwidth with Loop Fusion
    International Conference on Hardware Software Codesign and System Synthesis, 2004
    Co-Authors: Paul Marchal, J I Gomez, Francky Catthoor
    Abstract:

    The memory bandwidth largely determines the performance and energy cost of embedded systems. At the compiler level, several techniques improve the memory bandwidth at the scope of a basic block, but often fail to exploit all. We propose a technique to optimize the memory bandwidth across the boundaries of a basic block. Our technique incrementally fuses Loops to better use the available bandwidth. The resulting performance depends on how the data is assigned to the memories of the memory layer. At the same time, the assignment also strongly influences the energy cost. Therefore, we combine in our approach the Fusion and assignment decisions. Designers can use our output to trade-off the energy cost with the system's performance.

  • CODES+ISSS - Optimizing the memory bandwidth with Loop Fusion
    Proceedings of the 2nd IEEE ACM IFIP international conference on Hardware software codesign and system synthesis - CODES+ISSS '04, 2004
    Co-Authors: Paul Marchal, J I Gomez, Francky Catthoor
    Abstract:

    The memory bandwidth largely determines the performance and energy cost of embedded systems. At the compiler level, several techniques improve the memory bandwidth at the scope of a basic block, but often fail to exploit all. We propose a technique to optimize the memory bandwidth across the boundaries of a basic block. Our technique incrementally fuses Loops to better use the available bandwidth. The resulting performance depends on how the data is assigned to the memories of the memory layer. At the same time, the assignment also strongly influences the energy cost. Therefore, we combine in our approach the Fusion and assignment decisions. Designers can use our output to trade-off the energy cost with the system's performance.

Jeyakumar Ramanujam - One of the best experts on this subject based on the ideXlab platform.

  • Memory minimization for tensor contractions using integer linear programming
    Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
    Co-Authors: Abdul Allam, Jeyakumar Ramanujam, G. Baumgartner, P Sadayappan
    Abstract:

    This paper presents a technique for memory optimization for a class of computations that arises in the field of correlated electronic structure methods such as coupled cluster and configuration interaction methods in quantum chemistry. In this class of computations, Loop computations perform a multi-dimensional sum of product of input arrays. There are many different ways to get the same final results that differ in the required number of arithmetic operations required. In addition, for a given number of arithmetic operations, different expressions of the Loop have different memory requirements. Loop Fusion is a plausible solution for reducing memory usage. By fusing Loops between producer Loop nest and consumer Loop nest, the required storage of intermediate array is reduced by the range of the fused Loop. Because resultant Loops have to be legal after Fusion, some Loops can not be fused at the same time. In this paper, we have developed a novel integer linear programming (ILP) formulation that is shown to be highly effective on a number of test cases producing the optimal solutions using very small execution times. The main idea in the ILP formulation is the encoding of legality rules for Loop Fusion of a special class of Loops using logical constraints over binary decision variables and a highly effective approximation of memory usage

  • IPDPS - Memory minimization for tensor contractions using integer linear programming
    Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
    Co-Authors: Abdul Allam, Jeyakumar Ramanujam, Gerald Baumgartner, P Sadayappan
    Abstract:

    This paper presents a technique for memory optimization for a class of computations that arises in the field of correlated electronic structure methods such as coupled cluster and configuration interaction methods in quantum chemistry. In this class of computations, Loop computations perform a multi-dimensional sum of product of input arrays. There are many different ways to get the same final results that differ in the required number of arithmetic operations required. In addition, for a given number of arithmetic operations, different expressions of the Loop have different memory requirements. Loop Fusion is a plausible solution for reducing memory usage. By fusing Loops between producer Loop nest and consumer Loop nest, the required storage of intermediate array is reduced by the range of the fused Loop. Because resultant Loops have to be legal after Fusion, some Loops can not be fused at the same time. In this paper, we have developed a novel integer linear programming (ILP) formulation that is shown to be highly effective on a number of test cases producing the optimal solutions using very small execution times. The main idea in the ILP formulation is the encoding of legality rules for Loop Fusion of a special class of Loops using logical constraints over binary decision variables and a highly effective approximation of memory usage.

  • Memory-constrained communication minimization for a class of array computations
    Lecture Notes in Computer Science, 2005
    Co-Authors: Daniel Cociorva, P Sadayappan, Gerald Baumgartner, Jeyakumar Ramanujam
    Abstract:

    The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate Loop Fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of inter-processor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between Loop Fusion and memory usage, we develop an approach to identify the best combination of Loop Fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

  • Memory-constrained data locality optimization for tensor contractions
    Lecture Notes in Computer Science, 2004
    Co-Authors: Alina Bibireata, Jeyakumar Ramanujam, P Sadayappan, Daniel Cociorva, Gerald Baumgartner, Sandhya Krishnan, David E. Bernholdt, Venkatesh Choppella
    Abstract:

    The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions over large multi-dimensional arrays. Efficient computation of these contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, requiring their storage on disk. However, the intermediates can often be generated and used in batches through appropriate Loop Fusion transformations. To optimize the performance of such computations a combination of Loop Fusion and Loop tiling is required, so that the cost of disk I/O is minimized. In this paper, we address the memory-constrained data-locality optimization problem in the context of this class of computations. We develop an optimization framework to search among a space of Fusion and tiling choices to minimize the data movement overhead. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.

  • HiPC - Data locality optimization for synthesis of efficient out-of-core algorithms
    High Performance Computing - HiPC 2003, 2003
    Co-Authors: Sandhya Krishnan, Jeyakumar Ramanujam, P Sadayappan, Daniel Cociorva, Gerald Baumgartner, David E. Bernholdt, Sriram Krishnamoorthy, Venkatesh Choppella
    Abstract:

    This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested Loops that represent tensor contraction computations. Tensor contraction expressions arise in many accurate computational models of electronic structure. The developed approach combines Loop Fusion with Loop tiling and uses a performance-model driven approach to Loop tiling for the generation of out-of-core code. Experimental measurements are provided that show a good match with model-based predictions and demonstrate the effectiveness of the proposed algorithm.

Apan Qasem - One of the best experts on this subject based on the ideXlab platform.

  • Model-guided empirical tuning of Loop Fusion
    International Journal of High Performance Systems Architecture, 2008
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion is recognised as an effective transformation for improving memory hierarchy performance. However, unconstrained Loop Fusion can lead to poor performance because of increased register pressure and cache conflict misses. In this paper, we present a cache-conscious analytical model for profitable Loop Fusion. We use this model to tune Fusion parameters for different architectures through empirical search. Experiments on four different platforms for a set of applications show significant speedup over fully optimised code generated by state-of-the-art commercial compilers.

  • profitable Loop Fusion and tiling using model driven empirical search
    International Conference on Supercomputing, 2006
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for applying these transformations profitably across architectures. In this paper, we present a model-guided empirical tuning strategy for profitable application of Loop Fusion and tiling. Our strategy consists of a detailed cost model that characterizes the interaction between the two transformations at different levels of the memory hierarchy. The novelty of our approach is in exposing key architectural parameters within the model for automatic tuning through empirical search. Preliminary experiments with a set of applications on four different platforms show that our strategy achieves significant performance improvement over fully optimized code generated by state-of-the-art commercial compilers. The time spent in searching for the best parameters is considerably less than with other search strategies.

  • ICS - Profitable Loop Fusion and tiling using model-driven empirical search
    Proceedings of the 20th annual international conference on Supercomputing - ICS '06, 2006
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for applying these transformations profitably across architectures. In this paper, we present a model-guided empirical tuning strategy for profitable application of Loop Fusion and tiling. Our strategy consists of a detailed cost model that characterizes the interaction between the two transformations at different levels of the memory hierarchy. The novelty of our approach is in exposing key architectural parameters within the model for automatic tuning through empirical search. Preliminary experiments with a set of applications on four different platforms show that our strategy achieves significant performance improvement over fully optimized code generated by state-of-the-art commercial compilers. The time spent in searching for the best parameters is considerably less than with other search strategies.

  • a cache conscious profitability model for empirical tuning of Loop Fusion
    Lecture Notes in Computer Science, 2006
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion is recognized as an effective program transformation for improving memory hierarchy performance. However, unconstrained Loop Fusion can lead to poor performance because of increased register pressure and cache conflict misses. The complex interaction between different levels of the memory hierarchy with the input program makes it very difficult to always make the right choice in fusing Loops. In this paper, we present a cache-conscious analytical model for profitable Loop Fusion to be used with a constrained weighted Fusion algorithm. We then extend the model to show its effectiveness in the context of an empirical tuning framework. A preliminary evaluation of the model is presented using hand experiments on four applications.

  • LCPC - A cache-conscious profitability model for empirical tuning of Loop Fusion
    Languages and Compilers for Parallel Computing, 2005
    Co-Authors: Apan Qasem, Ken Kennedy
    Abstract:

    Loop Fusion is recognized as an effective program transformation for improving memory hierarchy performance. However, unconstrained Loop Fusion can lead to poor performance because of increased register pressure and cache conflict misses. The complex interaction between different levels of the memory hierarchy with the input program makes it very difficult to always make the right choice in fusing Loops. In this paper, we present a cache-conscious analytical model for profitable Loop Fusion to be used with a constrained weighted Fusion algorithm. We then extend the model to show its effectiveness in the context of an empirical tuning framework. A preliminary evaluation of the model is presented using hand experiments on four applications.