Loop-Carried Dependence

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 121965 Experts worldwide ranked by ideXlab platform

David I August - One of the best experts on this subject based on the ideXlab platform.

  • decoupled software pipelining with the synchronization array
    International Conference on Parallel Architectures and Compilation Techniques, 2004
    Co-Authors: Ram Rangan, Neil Vachharajani, Manish Vachharajani, David I August
    Abstract:

    Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the Loop-Carried Dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors.

Ram Rangan - One of the best experts on this subject based on the ideXlab platform.

  • decoupled software pipelining with the synchronization array
    International Conference on Parallel Architectures and Compilation Techniques, 2004
    Co-Authors: Ram Rangan, Neil Vachharajani, Manish Vachharajani, David I August
    Abstract:

    Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the Loop-Carried Dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors.

Jing Wang - One of the best experts on this subject based on the ideXlab platform.

  • Loop-Carried Dependence and the general URPR software pipelining approach (unrolling, pipelining and rerolling)
    Proceedings of the Twenty-Fourth Annual Hawaii International Conference on System Sciences, 1991
    Co-Authors: Bogong Su, Jing Wang
    Abstract:

    This paper first theoretically analyzes the influence of Loop-Carried Dependence on software pipelining. It then defines two loop categories: restrictable and unrestrictable loops, puts forward and proves a sufficient and necessary condition for distinguishing the two kinds of loops. This condition is related with the number of operation pairs with Loop-Carried Dependence, the execution time of operations, and other loop parameters. Next, this paper proves that any unrestrictable loop can be transformed into a semantically equivalent restrictable loop by unrolling K times. K is determined by the number of operation pairs with Loop-Carried Dependence within the original unrestrictable loop. Finally, the paper presents a general URPR software pipelining approach which consists of a pre-processing algorithm, a new compaction algorithm for a loop body and a URPR algorithm. Preliminary experiments show that the general URPR can guarantee a time-optimal result for any loop in the absence of resource constraints and still keep good space efficiency and low complexity.

Neil Vachharajani - One of the best experts on this subject based on the ideXlab platform.

  • decoupled software pipelining with the synchronization array
    International Conference on Parallel Architectures and Compilation Techniques, 2004
    Co-Authors: Ram Rangan, Neil Vachharajani, Manish Vachharajani, David I August
    Abstract:

    Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the Loop-Carried Dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors.

Manish Vachharajani - One of the best experts on this subject based on the ideXlab platform.

  • decoupled software pipelining with the synchronization array
    International Conference on Parallel Architectures and Compilation Techniques, 2004
    Co-Authors: Ram Rangan, Neil Vachharajani, Manish Vachharajani, David I August
    Abstract:

    Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the Loop-Carried Dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple non-speculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors.