Iteration Space

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 26832 Experts worldwide ranked by ideXlab platform

Wlodzimierz Bielecki - One of the best experts on this subject based on the ideXlab platform.

  • Generation of parallel synchronization-free tiled code
    Computing, 2018
    Co-Authors: Wlodzimierz Bielecki, Marek Palkowski, Piotr Skotnicki
    Abstract:

    A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence graphs to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target (corrected) tiles. Then parallel synchronization-free tiled code is generated on the basis of valid (corrected) tiles applying the transitive closure of dependence graphs. The main contribution of the paper is demonstrating that the presented technique is able to generate parallel synchronization-free tiled code, provided that the exact transitive closure of a dependence graph can be calculated and there exist synchronization-free slices on the statement instance level in the loop nest. We show that the presented approach extracts such a parallelism when well-known techniques fail to extract it. Enlarging the scope of loop nests, for which synchronization-free tiled code can be generated, is achieved by means of applying the intersection of extracted slices and generated valid tiles, in contrast to forming slices of valid tiles as suggested in previously published techniques based on the transitive closure of a dependence graph. The presented approach is implemented in the publicly available TC optimizing compiler. Results of experiments demonstrating the effectiveness of the approach and the efficiency of parallel programs generated by means of it are discussed.

  • Tuning Iteration Space slicing based tiled multi-core code implementing Nussinov’s RNA folding
    BMC Bioinformatics, 2018
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov’s RNA folding. Such techniques are within the Iteration Space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search Space the best tile size and tile dimension maximizing target code performance. For a given search Space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov’s RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  • Tuning Iteration Space slicing based tiled multi-core code implementing Nussinov's RNA folding.
    BMC Bioinformatics, 2018
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov’s RNA folding. Such techniques are within the Iteration Space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search Space the best tile size and tile dimension maximizing target code performance. For a given search Space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov’s RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  • Tuning Iteration Space slicing based tiled multi-core code implementing Nussinov’s RNA folding
    BMC Bioinformatics, 2018
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    Background RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov’s RNA folding. Such techniques are within the Iteration Space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. Results To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search Space the best tile size and tile dimension maximizing target code performance. Conclusions For a given search Space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov’s RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  • Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing
    BMC Bioinformatics, 2017
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    Background RNA secondary structure prediction is a compute intensive task that lies at the core of several search algorithms in bioinformatics. Fortunately, the RNA folding approaches, such as the Nussinov base pair maximization, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. Polyhedral compilation techniques have proven to be a powerful tool for optimization of dense array codes. However, classical affine loop nest transformations used with these techniques do not optimize effectively codes of dynamic programming of RNA structure predictions. Results The purpose of this paper is to present a novel approach allowing for generation of a parallel tiled Nussinov RNA loop nest exposing significantly higher performance than that of known related code. This effect is achieved due to improving code locality and calculation parallelization. In order to improve code locality, we apply our previously published technique of automatic loop nest tiling to all the three loops of the Nussinov loop nest. This approach first forms original rectangular 3D tiles and then corrects them to establish their validity by means of applying the transitive closure of a dependence graph. To produce parallel code, we apply the loop skewing technique to a tiled Nussinov loop nest. Conclusions The technique is implemented as a part of the publicly available polyhedral source-to-source TRACO compiler. Generated code was run on modern Intel multi-core processors and coprocessors. We present the speed-up factor of generated Nussinov RNA parallel code and demonstrate that it is considerably faster than related codes in which only the two outer loops of the Nussinov loop nest are tiled.

Marek Palkowski - One of the best experts on this subject based on the ideXlab platform.

  • Generation of parallel synchronization-free tiled code
    Computing, 2018
    Co-Authors: Wlodzimierz Bielecki, Marek Palkowski, Piotr Skotnicki
    Abstract:

    A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence graphs to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target (corrected) tiles. Then parallel synchronization-free tiled code is generated on the basis of valid (corrected) tiles applying the transitive closure of dependence graphs. The main contribution of the paper is demonstrating that the presented technique is able to generate parallel synchronization-free tiled code, provided that the exact transitive closure of a dependence graph can be calculated and there exist synchronization-free slices on the statement instance level in the loop nest. We show that the presented approach extracts such a parallelism when well-known techniques fail to extract it. Enlarging the scope of loop nests, for which synchronization-free tiled code can be generated, is achieved by means of applying the intersection of extracted slices and generated valid tiles, in contrast to forming slices of valid tiles as suggested in previously published techniques based on the transitive closure of a dependence graph. The presented approach is implemented in the publicly available TC optimizing compiler. Results of experiments demonstrating the effectiveness of the approach and the efficiency of parallel programs generated by means of it are discussed.

  • Tuning Iteration Space slicing based tiled multi-core code implementing Nussinov's RNA folding.
    BMC Bioinformatics, 2018
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov’s RNA folding. Such techniques are within the Iteration Space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search Space the best tile size and tile dimension maximizing target code performance. For a given search Space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov’s RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  • Tuning Iteration Space slicing based tiled multi-core code implementing Nussinov’s RNA folding
    BMC Bioinformatics, 2018
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov’s RNA folding. Such techniques are within the Iteration Space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search Space the best tile size and tile dimension maximizing target code performance. For a given search Space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov’s RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  • Tuning Iteration Space slicing based tiled multi-core code implementing Nussinov’s RNA folding
    BMC Bioinformatics, 2018
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    Background RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov’s RNA folding. Such techniques are within the Iteration Space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. Results To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search Space the best tile size and tile dimension maximizing target code performance. Conclusions For a given search Space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov’s RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  • Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing
    BMC Bioinformatics, 2017
    Co-Authors: Marek Palkowski, Wlodzimierz Bielecki
    Abstract:

    Background RNA secondary structure prediction is a compute intensive task that lies at the core of several search algorithms in bioinformatics. Fortunately, the RNA folding approaches, such as the Nussinov base pair maximization, involve mathematical operations over affine control loops whose Iteration Space can be represented by the polyhedral model. Polyhedral compilation techniques have proven to be a powerful tool for optimization of dense array codes. However, classical affine loop nest transformations used with these techniques do not optimize effectively codes of dynamic programming of RNA structure predictions. Results The purpose of this paper is to present a novel approach allowing for generation of a parallel tiled Nussinov RNA loop nest exposing significantly higher performance than that of known related code. This effect is achieved due to improving code locality and calculation parallelization. In order to improve code locality, we apply our previously published technique of automatic loop nest tiling to all the three loops of the Nussinov loop nest. This approach first forms original rectangular 3D tiles and then corrects them to establish their validity by means of applying the transitive closure of a dependence graph. To produce parallel code, we apply the loop skewing technique to a tiled Nussinov loop nest. Conclusions The technique is implemented as a part of the publicly available polyhedral source-to-source TRACO compiler. Generated code was run on modern Intel multi-core processors and coprocessors. We present the speed-up factor of generated Nussinov RNA parallel code and demonstrate that it is considerably faster than related codes in which only the two outer loops of the Nussinov loop nest are tiled.

Guang R. Gao - One of the best experts on this subject based on the ideXlab platform.

  • Register allocation for software pipelined multi-dimensional loops
    2005
    Co-Authors: Hongbo Rong, Alban Douillet, Guang R. Gao
    Abstract:

    Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop Iterations to explore instruction-level parallelism from the entire n-dimensional Iteration Space. This paper investigates register allocation for software pipelined multi-dimensional loops. For single loop software pipelining, the lifetime instances of a loop variant in successive Iterations of the loop form a repetitive pattern. An effective register allocation method is to represent the pattern as a vector of lifetimes (or a vector lifetime using Rau’s terminology) and map it to rotating registers. Unfortunately, the software pipelined schedule of a multi-dimensional loop is considerably more complex, and so are the vector lifetimes in it. In this paper, we develop a way to normalize and represent vector lifetimes in multi-dimensional loop software pipelining, which capture their complexity, while exposing their regularity that enables us to develop a simple, yet powerful solution. Our algorithm is based on the development of a metric, called distance, that quantitatively determines the degree of potential overlapping (conflicts) between two vector lifetimes. We show how to calculate and use the distance, conservatively or aggressively, to guide the register allocation of the vector lifetimes under a bin-packing algorithm framework. The classical register allocation for software pipelined single loops is subsumed by our method as a special case. The method has been implemented in the ORC compiler and produced code for the Itanium architecture. We report the effectiveness of our method on 134 loop nests with 348 loop levels. Several strategies for register allocation are compared and analyzed

  • PLDI - Register allocation for software pipelined multi-dimensional loops
    Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation - PLDI '05, 2005
    Co-Authors: Hongbo Rong, Alban Douillet, Guang R. Gao
    Abstract:

    Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop Iterations to explore instruction-level parallelism from the entire n-dimensional Iteration Space. This paper investigates register allocation for software pipelined multi-dimensional loops.For single loop software pipelining, the lifetime instances of a loop variant in successive Iterations of the loop form a repetitive pattern. An effective register allocation method is to represent the pattern as a vector of lifetimes (or a vector lifetime using Rau's terminology) and map it to rotating registers. Unfortunately, the software pipelined schedule of a multi-dimensional loop is considerably more complex, and so are the vector lifetimes in it.In this paper, we develop a way to normalize and represent vector lifetimes in multi-dimensional loop software pipelining, which capture their complexity, while exposing their regularity that enables us to develop a simple, yet powerful solution. Our algorithm is based on the development of a metric, called distance, that quantitatively determines the degree of potential overlapping (conflicts) between two vector lifetimes. We show how to calculate and use the distance, conservatively or aggressively, to guide the register allocation of the vector lifetimes under a bin-packing algorithm framework. The classical register allocation for software pipelined single loops is subsumed by our method as a special case.The method has been implemented in the ORC compiler and produced code for the Itanium architecture. We report the effectiveness of our method on 134 loop nests with 348 loop levels. Several strategies for register allocation are compared and analyzed.

Constantine D Polychronopoulos - One of the best experts on this subject based on the ideXlab platform.

  • cache aware Iteration Space partitioning
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008
    Co-Authors: Arun Kejariwal, Alexandru Nicolau, Utpal Banerjee, Alexander V Veidenbaum, Constantine D Polychronopoulos
    Abstract:

    The need for high performance per watt has led to the development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails program parallelization and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the Iteration Space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware partitioning of Iteration Spaces of parallel loops. We present a case study using a kernel from the industry-standard SPEC CPU benchmark suite.

  • enhanced loop coalescing a compiler technique for transforming non uniform Iteration Spaces
    IEEE International Conference on High Performance Computing Data and Analytics, 2005
    Co-Authors: Arun Kejariwal, Alexandru Nicolau, Constantine D Polychronopoulos
    Abstract:

    Parallel nested loops are the largest potential source of parallelism in numerical and scientific applications. Therefore, executing parallel loops with low run-time overhead is very important for achieving high performance on parallel computers. Guided self-scheduling (GSS) has long been used for dynamic scheduling of parallel loops on shared memory parallel machines and for efficient utilization of dynamically allocated processors. In order to minimize the synchronization (or scheduling) overhead in GSS, loop coalescing has been proposed as a restructuring technique to transform nested loops into a single loop. In other words, coalescing "flattens" the Iteration Space in lexicographic order of the indices of the original loop. Although coalescing helps reduce the run-time scheduling overhead, it does not necessarily minimize the makespan, i.e., the maximum finishing time, especially in situations where the execution time (workload) of Iterations is not uniform as is often the case in practice, e.g., in control intensive applications. This can be attributed to the fact that the makespan is directly dependent on the workload distribution across the flattened Iteration Space. The latter in itself depends on the order of coalescing of the loop indices. We show that coalescing (as proposed) can potentially result in large makespans. In this paper, we present a loop permutation-based approach to loop coalescing, referred to as enhanced loop coalescing, to achieve near-optimal schedules. Several examples are presented and the general technique is discussed in detail.

  • a novel approach for partitioning Iteration Spaces with variable densities
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005
    Co-Authors: Arun Kejariwal, Alexandru Nicolau, Utpal Banerjee, Constantine D Polychronopoulos
    Abstract:

    Efficient partitioning of parallel loops plays a critical role in high performance and efficient use of multiprocessor systems. Although a significant amount of work has been done in partitioning and scheduling of loops with rectangular Iteration Spaces, the problem of partitioning non-rectangular Iteration Spaces --- e.g., triangular, trapezoidal Iteration Spaces --- with variable densities has not been addressed so far to the best of our knowledge. In this paper, we present a mathematical model for partitioning N-dimensional non-rectangular Iteration Spaces with variable densities. We present a unimodular loop transformation and a geometric approach for partitioning an Iteration Space along an axis corresponding to the outermost loop across a given number of processors to achieve near-optimal performance, i.e., to achieve near-optimal load balance across different processors. We present a case study to illustrate the effectiveness of our approach.

Hongbo Rong - One of the best experts on this subject based on the ideXlab platform.

  • Productively Expressing High-performance Spatial Designs of Givens Rotation-based QR Decomposition Algorithm.
    arXiv: Programming Languages, 2018
    Co-Authors: Hongbo Rong
    Abstract:

    QR decomposition is used prevalently in wireless communication. In this paper, we express the Givens-rotation-based QR decomposition algorithm on a spatial architecture using T2S (Temporal To Spatial), a high-productivity spatial programming methodology for expressing high-performance spatial designs. There are interesting challenges: the loop Iteration Space is not rectangular, and it is not obvious how the imperative algorithm can be expressed in a functional notation, the starting point of T2S. Using QR decomposition as an example, this paper elucidates some general principle, and de-mystifies high-performance spatial programming. The paper also serves as a tutorial of spatial programming for programmers who are not mathematicians, not expert programmers, and not experts on spatial architectures, but still hope to intuitively identify a high-performance design and map to spatial architectures efficiently.

  • Register allocation for software pipelined multi-dimensional loops
    2005
    Co-Authors: Hongbo Rong, Alban Douillet, Guang R. Gao
    Abstract:

    Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop Iterations to explore instruction-level parallelism from the entire n-dimensional Iteration Space. This paper investigates register allocation for software pipelined multi-dimensional loops. For single loop software pipelining, the lifetime instances of a loop variant in successive Iterations of the loop form a repetitive pattern. An effective register allocation method is to represent the pattern as a vector of lifetimes (or a vector lifetime using Rau’s terminology) and map it to rotating registers. Unfortunately, the software pipelined schedule of a multi-dimensional loop is considerably more complex, and so are the vector lifetimes in it. In this paper, we develop a way to normalize and represent vector lifetimes in multi-dimensional loop software pipelining, which capture their complexity, while exposing their regularity that enables us to develop a simple, yet powerful solution. Our algorithm is based on the development of a metric, called distance, that quantitatively determines the degree of potential overlapping (conflicts) between two vector lifetimes. We show how to calculate and use the distance, conservatively or aggressively, to guide the register allocation of the vector lifetimes under a bin-packing algorithm framework. The classical register allocation for software pipelined single loops is subsumed by our method as a special case. The method has been implemented in the ORC compiler and produced code for the Itanium architecture. We report the effectiveness of our method on 134 loop nests with 348 loop levels. Several strategies for register allocation are compared and analyzed

  • PLDI - Register allocation for software pipelined multi-dimensional loops
    Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation - PLDI '05, 2005
    Co-Authors: Hongbo Rong, Alban Douillet, Guang R. Gao
    Abstract:

    Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop Iterations to explore instruction-level parallelism from the entire n-dimensional Iteration Space. This paper investigates register allocation for software pipelined multi-dimensional loops.For single loop software pipelining, the lifetime instances of a loop variant in successive Iterations of the loop form a repetitive pattern. An effective register allocation method is to represent the pattern as a vector of lifetimes (or a vector lifetime using Rau's terminology) and map it to rotating registers. Unfortunately, the software pipelined schedule of a multi-dimensional loop is considerably more complex, and so are the vector lifetimes in it.In this paper, we develop a way to normalize and represent vector lifetimes in multi-dimensional loop software pipelining, which capture their complexity, while exposing their regularity that enables us to develop a simple, yet powerful solution. Our algorithm is based on the development of a metric, called distance, that quantitatively determines the degree of potential overlapping (conflicts) between two vector lifetimes. We show how to calculate and use the distance, conservatively or aggressively, to guide the register allocation of the vector lifetimes under a bin-packing algorithm framework. The classical register allocation for software pipelined single loops is subsumed by our method as a special case.The method has been implemented in the ORC compiler and produced code for the Itanium architecture. We report the effectiveness of our method on 134 loop nests with 348 loop levels. Several strategies for register allocation are compared and analyzed.