The Experts below are selected from a list of 8823 Experts worldwide ranked by ideXlab platform

Edwin H M Sha - One of the best experts on this subject based on the ideXlab platform.

  • Optimized Address Assignment With Array and Loop Transformations for Minimizing Schedule Length
    IEEE Transactions on Circuits and Systems I: Regular Papers, 2008
    Co-Authors: Chun Jason Xue, Zili Shao, Zhiping Jia, Meng Wang, Edwin H M Sha
    Abstract:

    Reducing Address arithmetic operations by optimization of Address Offset assignment greatly improves the performance of digital signal processor (DSP) applications. However, minimizing Address operations alone may not directly reduce code size and schedule length for DSPs with multiple functional units. Little research work has been conducted on loop optimization with Address Offset assignment problem for architectures with multiple functional units. In this paper, we combine loop scheduling, array interleaving, and Address assignment to minimize the schedule length and the number of Address operations for loops on DSP architectures with multiple functional units. Array interleaving is applied to optimize Address assignment for arrays in loop scheduling process. An algorithm, Address operation reduction rotation scheduling (AORRS), is proposed. The algorithm minimizes both schedule length and the number of Address operations. with to list scheduling, AORRS shows an average reduction of 38.4% in schedule length and an average reduction of 31.7% in the number of Address operations. Compared with rotation scheduling, AORRS shows an average reduction of 15.9% in schedule length and 33.6% in the number of Address operations.

  • ICASSP (5) - Optimizing DSP scheduling via Address assignment with array and loop transformation
    Proceedings. (ICASSP '05). IEEE International Conference on Acoustics Speech and Signal Processing 2005., 1
    Co-Authors: Chun Xue, Zili Shao, Ying Chen, Edwin H M Sha
    Abstract:

    Reducing Address arithmetic instructions by optimization of Address Offset assignment greatly improves the performance of DSP applications. However, minimizing Address operations alone may not directly reduce code size and schedule length for multiple functional units DSPs. In this paper, we exploit Address assignment and scheduling for application with loops on multiple functional unit DSPs. Array transformation is used in our approach to leverage the indirect Addressing modes provided by most of the DSP architectures. An algorithm, Address instruction reduction loop scheduling (AIRLS), is proposed. The algorithm utilizes the techniques of rotation scheduling, Address assignment and array transformation to minimize both Address instructions and schedule length. Compared to the list scheduling, AIRLS shows an average reduction of 35.4% in schedule length and an average reduction of 38.3% in Address instructions. Compared to the rotation scheduling, AIRLS shows an average reduction of 19.2% in schedule length and 39.5% in the number of Address instructions.

Huiyang Zhou - One of the best experts on this subject based on the ideXlab platform.

  • a gpgpu compiler for memory optimization and parallelism management
    Programming Language Design and Implementation, 2010
    Co-Authors: Yi Yang, Jingfei Kong, Ping Xiang, Huiyang Zhou
    Abstract:

    This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It Addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or Address-Offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

  • PLDI - A GPGPU compiler for memory optimization and parallelism management
    Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI '10, 2010
    Co-Authors: Yi Yang, Jingfei Kong, Ping Xiang, Huiyang Zhou
    Abstract:

    This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It Addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or Address-Offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

Zili Shao - One of the best experts on this subject based on the ideXlab platform.

  • Optimized Address Assignment With Array and Loop Transformations for Minimizing Schedule Length
    IEEE Transactions on Circuits and Systems I: Regular Papers, 2008
    Co-Authors: Chun Jason Xue, Zili Shao, Zhiping Jia, Meng Wang, Edwin H M Sha
    Abstract:

    Reducing Address arithmetic operations by optimization of Address Offset assignment greatly improves the performance of digital signal processor (DSP) applications. However, minimizing Address operations alone may not directly reduce code size and schedule length for DSPs with multiple functional units. Little research work has been conducted on loop optimization with Address Offset assignment problem for architectures with multiple functional units. In this paper, we combine loop scheduling, array interleaving, and Address assignment to minimize the schedule length and the number of Address operations for loops on DSP architectures with multiple functional units. Array interleaving is applied to optimize Address assignment for arrays in loop scheduling process. An algorithm, Address operation reduction rotation scheduling (AORRS), is proposed. The algorithm minimizes both schedule length and the number of Address operations. with to list scheduling, AORRS shows an average reduction of 38.4% in schedule length and an average reduction of 31.7% in the number of Address operations. Compared with rotation scheduling, AORRS shows an average reduction of 15.9% in schedule length and 33.6% in the number of Address operations.

  • ICASSP (5) - Optimizing DSP scheduling via Address assignment with array and loop transformation
    Proceedings. (ICASSP '05). IEEE International Conference on Acoustics Speech and Signal Processing 2005., 1
    Co-Authors: Chun Xue, Zili Shao, Ying Chen, Edwin H M Sha
    Abstract:

    Reducing Address arithmetic instructions by optimization of Address Offset assignment greatly improves the performance of DSP applications. However, minimizing Address operations alone may not directly reduce code size and schedule length for multiple functional units DSPs. In this paper, we exploit Address assignment and scheduling for application with loops on multiple functional unit DSPs. Array transformation is used in our approach to leverage the indirect Addressing modes provided by most of the DSP architectures. An algorithm, Address instruction reduction loop scheduling (AIRLS), is proposed. The algorithm utilizes the techniques of rotation scheduling, Address assignment and array transformation to minimize both Address instructions and schedule length. Compared to the list scheduling, AIRLS shows an average reduction of 35.4% in schedule length and an average reduction of 38.3% in Address instructions. Compared to the rotation scheduling, AIRLS shows an average reduction of 19.2% in schedule length and 39.5% in the number of Address instructions.

Jun Yang - One of the best experts on this subject based on the ideXlab platform.

  • procedural level Address Offset assignment of dsp applications with loops
    International Conference on Parallel Processing, 2003
    Co-Authors: Youtao Zhang, Jun Yang
    Abstract:

    Automatic optimization of Address Offset assignment for DSP applications, which reduces the number of Address arithmetic instructions to meet the tight memory size restrictions and performance requirements, received a lot of attention in recent years. However, most of current research focuses at the basic block level and does not distinguish different program structures, especially loops. Moreover, the effectiveness of modify register (MR) is not fully exploited since it is used only in the post optimization step. A novel Address Offset assignment approach is proposed at the procedural level. The MR is effectively used in the Address assignment for loop structures. By taking advantage of MR, variables accessed in sequence within a loop are assigned to memory words of equal distances. Both static and dynamic Addressing instruction counts are greatly reduced. For DSPSTONE benchmarks and on average, 9.9%, 17.1% and 21.8% improvements are achieved over Address Offset assignment [R. Leupers et al., (1996)] together with MR optimization when there is 1, 2 and 4 Address registers respectively

Yi Yang - One of the best experts on this subject based on the ideXlab platform.

  • a gpgpu compiler for memory optimization and parallelism management
    Programming Language Design and Implementation, 2010
    Co-Authors: Yi Yang, Jingfei Kong, Ping Xiang, Huiyang Zhou
    Abstract:

    This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It Addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or Address-Offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

  • PLDI - A GPGPU compiler for memory optimization and parallelism management
    Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI '10, 2010
    Co-Authors: Yi Yang, Jingfei Kong, Ping Xiang, Huiyang Zhou
    Abstract:

    This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It Addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or Address-Offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.