Loop Interchange

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 993 Experts worldwide ranked by ideXlab platform

Stamatis Vassiliadis - One of the best experts on this subject based on the ideXlab platform.

  • Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
    2009
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis
    Abstract:

    up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD

  • Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
    IEEE Transactions on Multimedia, 2008
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis
    Abstract:

    The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

  • Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors
    2008
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis
    Abstract:

    sumes up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hi-erarchy, multimedia extensions, SIMD

Asadollah Shahbahrami - One of the best experts on this subject based on the ideXlab platform.

  • Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
    2009
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis
    Abstract:

    up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD

  • Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
    IEEE Transactions on Multimedia, 2008
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis
    Abstract:

    The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

  • Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors
    2008
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis
    Abstract:

    sumes up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hi-erarchy, multimedia extensions, SIMD

Ken Kennedy - One of the best experts on this subject based on the ideXlab platform.

  • 237IMPROVING MEMORY PERFORMANCE IMPROVING MEMORY HIERARCHY PERFORMANCE THROUGH COMBINED Loop Interchange AND MULTI-LEVEL FUSION
    2016
    Co-Authors: Ken Kennedy
    Abstract:

    Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing Loops that access similar sets of data. Typically, it is applied to Loops at the same level after Loop Interchange, which first attains the best nesting order for each local Loop nest. However, since Loop Interchange cannot foresee the overall optimi-zation effect, it often selects the wrong Loops to be placed outermost for fusion, achieving suboptimal performance globally. Building on traditional unimodular transformations on perfectly nested Loops, we present a novel transforma-tion, dependence hoisting, that effectively combines inter-change and fusion for arbitrarily nested Loops. We present techniques to simultaneously Interchange and fuse Loops at multiple levels. By evaluating the compound optimiza-tion effect beforehand, we have achieved better perform-ance than was possible by previous techniques, which apply Interchange and fusion separately. Key words: Memory hierarchy performance, compiler opti-mizations, Loop transformations, Loop Interchange, Loop fusion

  • improving memory hierarchy performance through combined Loop Interchange and multi level fusion
    IEEE International Conference on High Performance Computing Data and Analytics, 2004
    Co-Authors: Qing Yi, Ken Kennedy
    Abstract:

    Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing Loops that access similar sets of data. Typically, it is applied to Loops at the same level after Loop Interchange, which first attains the best nesting order for each local Loop nest. However, since Loop Interchange cannot foresee the overall optimization effect, it often selects the wrong Loops to be placed outermost for fusion, achieving suboptimal performance globally. Building on traditional unimodular transformations on perfectly nested Loops, we present a novel transformation, dependence hoisting, that effectively combines Interchange and fusion for arbitrarily nested Loops. We present techniques to simultaneously Interchange and fuse Loops at multiple levels. By evaluating the compound optimization effect beforehand, we have achieved better performance than was possible by previous techniques, which apply Interchange and fusion separately.

Ben Juurlink - One of the best experts on this subject based on the ideXlab platform.

  • Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
    2009
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis
    Abstract:

    up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD

  • Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
    IEEE Transactions on Multimedia, 2008
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis
    Abstract:

    The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

  • Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors
    2008
    Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis
    Abstract:

    sumes up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hi-erarchy, multimedia extensions, SIMD

Ling Shao - One of the best experts on this subject based on the ideXlab platform.

  • dmatiler revisiting Loop tiling for direct memory access
    International Conference on Parallel Architectures and Compilation Techniques, 2010
    Co-Authors: Haibo Lin, Tao Liu, Tong Chen, Lakshminarayanan Renganarayana, John Kevin Patrick Obrien, Ling Shao
    Abstract:

    In this paper we present the design and implementation of a DMATiler which combines compiler analysis and runtime management to optimize local memory performance. In traditional cache model based Loop tiling optimizations, the compiler approximates runtime cache misses as the number of distinct cache lines touched by a Loop nest. In contrast, the DMATiler has the full control of the addresses, sizes, and sequences of data transfers. DMATiler uses a simplified DMA performance model to formulate the cost model for DMA-tiled Loop nests, then solves it using a custom gradient descent algorithm with heuristics guided by DMA characteristics. Given a Loop nest, DMATiler uses Loop Interchange to make the Loop order more friendlier for data movements. Moreover, DMATiler applies compressed data buffer and advanced DMA command to further optimize data transfers. We have implemented the DMATiler in the IBM XL C/C++ for Multi-core Acceleration for Linux, and have conducted experiments with a set of Loop nest benchmarks. The results show DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level Loop blocking (average speedup of 6.2x) on the Cell BE processor.