Loop Interchange - Explore the Science & Experts

The Experts below are selected from a list of 993 Experts worldwide ranked by ideXlab platform

Stamatis Vassiliadis - One of the best experts on this subject based on the ideXlab platform.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

2009

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis

Abstract:

up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD

15 days free trial to Access Article
Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

IEEE Transactions on Multimedia, 2008

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis

Abstract:

The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

15 days free trial to Access Article
Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors

2008

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis

Abstract:

sumes up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hi-erarchy, multimedia extensions, SIMD

15 days free trial to Access Article

Asadollah Shahbahrami - One of the best experts on this subject based on the ideXlab platform.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

2009

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis

Abstract:

up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD

15 days free trial to Access Article
Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

IEEE Transactions on Multimedia, 2008

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis

Abstract:

The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

15 days free trial to Access Article
Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors

2008

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis

Abstract:

sumes up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hi-erarchy, multimedia extensions, SIMD

15 days free trial to Access Article

Ken Kennedy - One of the best experts on this subject based on the ideXlab platform.

237IMPROVING MEMORY PERFORMANCE IMPROVING MEMORY HIERARCHY PERFORMANCE THROUGH COMBINED Loop Interchange AND MULTI-LEVEL FUSION

2016

Co-Authors: Ken Kennedy

Abstract:

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing Loops that access similar sets of data. Typically, it is applied to Loops at the same level after Loop Interchange, which first attains the best nesting order for each local Loop nest. However, since Loop Interchange cannot foresee the overall optimi-zation effect, it often selects the wrong Loops to be placed outermost for fusion, achieving suboptimal performance globally. Building on traditional unimodular transformations on perfectly nested Loops, we present a novel transforma-tion, dependence hoisting, that effectively combines inter-change and fusion for arbitrarily nested Loops. We present techniques to simultaneously Interchange and fuse Loops at multiple levels. By evaluating the compound optimiza-tion effect beforehand, we have achieved better perform-ance than was possible by previous techniques, which apply Interchange and fusion separately. Key words: Memory hierarchy performance, compiler opti-mizations, Loop transformations, Loop Interchange, Loop fusion

15 days free trial to Access Article
improving memory hierarchy performance through combined Loop Interchange and multi level fusion

IEEE International Conference on High Performance Computing Data and Analytics, 2004

Co-Authors: Qing Yi, Ken Kennedy

Abstract:

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing Loops that access similar sets of data. Typically, it is applied to Loops at the same level after Loop Interchange, which first attains the best nesting order for each local Loop nest. However, since Loop Interchange cannot foresee the overall optimization effect, it often selects the wrong Loops to be placed outermost for fusion, achieving suboptimal performance globally. Building on traditional unimodular transformations on perfectly nested Loops, we present a novel transformation, dependence hoisting, that effectively combines Interchange and fusion for arbitrarily nested Loops. We present techniques to simultaneously Interchange and fuse Loops at multiple levels. By evaluating the compound optimization effect beforehand, we have achieved better performance than was possible by previous techniques, which apply Interchange and fusion separately.

15 days free trial to Access Article

Ben Juurlink - One of the best experts on this subject based on the ideXlab platform.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

2009

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis

Abstract:

up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD

15 days free trial to Access Article
Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

IEEE Transactions on Multimedia, 2008

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis

Abstract:

The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

15 days free trial to Access Article
Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors

2008

Co-Authors: Asadollah Shahbahrami, Ben Juurlink, Student Member, Senior Member, Stamatis Vassiliadis

Abstract:

sumes up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying Loop Interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hi-erarchy, multimedia extensions, SIMD

15 days free trial to Access Article

Ling Shao - One of the best experts on this subject based on the ideXlab platform.

dmatiler revisiting Loop tiling for direct memory access

International Conference on Parallel Architectures and Compilation Techniques, 2010

Co-Authors: Haibo Lin, Tao Liu, Tong Chen, Lakshminarayanan Renganarayana, John Kevin Patrick Obrien, Ling Shao

Abstract:

In this paper we present the design and implementation of a DMATiler which combines compiler analysis and runtime management to optimize local memory performance. In traditional cache model based Loop tiling optimizations, the compiler approximates runtime cache misses as the number of distinct cache lines touched by a Loop nest. In contrast, the DMATiler has the full control of the addresses, sizes, and sequences of data transfers. DMATiler uses a simplified DMA performance model to formulate the cost model for DMA-tiled Loop nests, then solves it using a custom gradient descent algorithm with heuristics guided by DMA characteristics. Given a Loop nest, DMATiler uses Loop Interchange to make the Loop order more friendlier for data movements. Moreover, DMATiler applies compressed data buffer and advanced DMA command to further optimize data transfers. We have implemented the DMATiler in the IBM XL C/C++ for Multi-core Acceleration for Linux, and have conducted experiments with a set of Loop nest benchmarks. The results show DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level Loop blocking (average speedup of 6.2x) on the Cell BE processor.

15 days free trial to Access Article

Discover everything there is to know about the scientific topic Loop Interchange with ideXlab!

Stamatis Vassiliadis - One of the best experts on this subject based on the ideXlab platform.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors

Asadollah Shahbahrami - One of the best experts on this subject based on the ideXlab platform.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors

Ken Kennedy - One of the best experts on this subject based on the ideXlab platform.

237IMPROVING MEMORY PERFORMANCE IMPROVING MEMORY HIERARCHY PERFORMANCE THROUGH COMBINED Loop Interchange AND MULTI-LEVEL FUSION

improving memory hierarchy performance through combined Loop Interchange and multi level fusion

Ben Juurlink - One of the best experts on this subject based on the ideXlab platform.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors

Ling Shao - One of the best experts on this subject based on the ideXlab platform.

dmatiler revisiting Loop tiling for direct memory access