Vector Parallelism

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 183 Experts worldwide ranked by ideXlab platform

Albert Cohen - One of the best experts on this subject based on the ideXlab platform.

  • Software Pipelining in Nested Loops with Prolog-Epilog Merging
    2009
    Co-Authors: Mohammed Fellahi, Albert Cohen
    Abstract:

    Software pipelining (or modulo scheduling) is a powerful back-end optimization to exploit instruction and Vector Parallelism. Software pipelining is particularly popular for embedded devices as it improves the computation throughput without increasing the size of the inner loop kernel (unlike loop unrolling), a desirable property to minimize the amount of code in local memories or caches. Unfortunately, common media and signal processing codes exhibit series of low-tripcount inner loops. In this situation, software pipelining is often not an option: it incurs severe fill/drain time overheads and code size expansion due to nested prologs and epilogs. We propose a method to pipeline series of inner loops without increasing the size of the loop nest, apart from an outermost prolog and epilog. Our method achieves significant code size savings and allows pipelining of low-trip-count loops. These benefits come at the cost of additional scheduling constraints, leading to a linear optimization problem to trade memory usage for pipelining opportunities.

  • HiPEAC - Software Pipelining in Nested Loops with Prolog-Epilog Merging
    High Performance Embedded Architectures and Compilers, 2009
    Co-Authors: Mohammed Fellahi, Albert Cohen
    Abstract:

    Software pipelining (or modulo scheduling) is a powerful back-end optimization to exploit instruction and Vector Parallelism. Software pipelining is particularly popular for embedded devices as it improves the computation throughput without increasing the size of the inner loop kernel (unlike loop unrolling), a desirable property to minimize the amount of code in local memories or caches. Unfortunately, common media and signal processing codes exhibit series of low-trip-count inner loops. In this situation, software pipelining is often not an option: it incurs severe fill/drain time overheads and code size expansion due to nested prologs and epilogs. We propose a method to pipeline series of inner loops without increasing the size of the loop nest, apart from an outermost prolog and epilog. Our method achieves significant code size savings and allows pipelining of low-trip-count loops. These benefits come at the cost of additional scheduling constraints, leading to a linear optimization problem to trade memory usage for pipelining opportunities.

  • deep jam conversion of coarse grain Parallelism to fine grain and Vector Parallelism
    Journal of Instruction-level Parallelism, 2007
    Co-Authors: Patrick Carribault, Stephane Zuckerman, Albert Cohen, William Jalby
    Abstract:

    A number of computational applications lack instruction-level Parallelism. This loss is particularly acute on sequences of dependent instructions on wide-issue or deeply pipelined architectures. We consider four real applications from computational biology, cryptanalysis, and data compression. These applications are characterized by long sequences of dependent instructions, irregular control-flow and intricate scalar and memory dependence patterns. While these benchmarks exhibit good memory locality and branch-predictability, state-of-the-art compiler optimizations fail to exploit much instruction-level Parallelism. This paper shows that major performance gains are possible on such applications, through a loop transformation called deep jam. This transformation reshapes the control-flow of a program to facilitate the extraction of independent computations through classical back-end techniques. Deep jam combines accurate dependence analysis and control speculation, with a generalized form of recursive, multi-variant unroll-and-jam; it brings together independent instructions across irregular control structures, removing memory-based dependences through scalar and array renaming. This optimization contributes to the extraction of fine-grain Parallelism in irregular applications. We propose a feedback-directed deep jam algorithm, selecting a jamming strategy, function of the architecture and application characteristics.

  • deep jam conversion of coarse grain Parallelism to instruction level and Vector Parallelism for irregular applications
    International Conference on Parallel Architectures and Compilation Techniques, 2005
    Co-Authors: Patrick Carribault, Albert Cohen, William Jalby
    Abstract:

    A number of compute-intensive applications suffer from performance loss due to the lack of instruction-level Parallelism in sequences of dependent instructions. This is particularly accurate on wide-issue architectures with large register banks, when the memory hierarchy (locality and bandwidth) is not the dominant bottleneck. We consider two real applications from computational biology and from cryptanalysis, characterized by long sequences of dependent instructions, irregular control-flow and intricate scalar and array dependence patterns. Although these applications exhibit excellent memory locality and branch-prediction behavior, state-of-the-art loop transformations and back-end optimizations are unable to exploit much instruction-level Parallelism. We show that good speedups can be achieved through deep jam, a new transformation of the program control- and data-flow. Deep jam combines scalar and array renaming with a generalized form of recursive unroll-and-jam; it brings together independent instructions across irregular control structures, removing memory-based dependences. This optimization contributes to the extraction of fine-grain Parallelism in irregular applications. We propose a feedback-directed deep jam algorithm, selecting a jamming strategy, function of the architecture and application characteristics.

  • IEEE PACT - Deep jam: conversion of coarse-grain Parallelism to instruction-level and Vector Parallelism for irregular applications
    14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05), 2005
    Co-Authors: Patrick Carribault, Albert Cohen, William Jalby
    Abstract:

    A number of compute-intensive applications suffer from performance loss due to the lack of instruction-level Parallelism in sequences of dependent instructions. This is particularly accurate on wide-issue architectures with large register banks, when the memory hierarchy (locality and bandwidth) is not the dominant bottleneck. We consider two real applications from computational biology and from cryptanalysis, characterized by long sequences of dependent instructions, irregular control-flow and intricate scalar and array dependence patterns. Although these applications exhibit excellent memory locality and branch-prediction behavior, state-of-the-art loop transformations and back-end optimizations are unable to exploit much instruction-level Parallelism. We show that good speedups can be achieved through deep jam, a new transformation of the program control- and data-flow. Deep jam combines scalar and array renaming with a generalized form of recursive unroll-and-jam; it brings together independent instructions across irregular control structures, removing memory-based dependences. This optimization contributes to the extraction of fine-grain Parallelism in irregular applications. We propose a feedback-directed deep jam algorithm, selecting a jamming strategy, function of the architecture and application characteristics.

Saman Amarasinghe - One of the best experts on this subject based on the ideXlab platform.

  • LCTES - Compiler 2.0: Using Machine Learning to Modernize Compiler Technology
    The 21st ACM SIGPLAN SIGBED Conference on Languages Compilers and Tools for Embedded Systems, 2020
    Co-Authors: Saman Amarasinghe
    Abstract:

    Modern compilers are still built using technology that existed decades ago. These include basic algorithms and techniques for lexing, parsing, data-flow analysis, data dependence analysis, Vectorization, register allocation, instruction selection, and instruction scheduling. It is high time that we modernize our compiler toolchain. In this talk, I will show the path to the modernization of one important compiler technique -- Vectorization. Vectorization was first introduced in the era of Cray Vector processors during the 1980's. In modernizing Vectorization, I will first show how to use new techniques that better target modern hardware. While Vector supercomputers need large Vectors, which are only available by parallelizing loops, modern SIMD instructions efficiently work on short Vectors. Thus, in 2000, we introduced Superword Level Parallelism (SLP) based Vectorization. SLP finds short Vector instructions within basic blocks, and by loop unrolling we can convert Vector Parallelism to SLP. Next, I will show how we can take advantage of the power of modern computers for compilation, by using more accurate but expensive techniques to improve SLP Vectorization. Due to the hardware resource constraints of the era, like many other compiler optimizations, SLP implementation was a greedy algorithm. In 2018, we introduced goSLP, which uses integer linear programming to find an optimal instruction packing strategy and achieves 7.58% geomean performance improvement over the LLVM's SLP implementation on SPEC2017fp C/C++ programs. Finally, I will show how to truly modernize a compiler by automatically learning the necessary components of the compiler with Ithemal and Vemal. The optimality of goSLP is under LLVM's simple per instruction additive cost model that fits within the Integer programming framework. However, the actual cost of execution in a modern out-of-order, pipelined, superscalar processor is much more complex. Manually building such cost models as well as manually developing compiler optimizations is costly, tedious, error-prone and is hard to keep up with the architectural changes. Ithemal is the first learnt cost model for predicting the throughput of x86 basic blocks. It not only significantly outperforms (more than halves the error) state-of-the-art analytical hand-written tools like llvm-mca, but also is learnt from data requiring minimal human effort. Vemal is a learnt policy for end-to-end Vectorization as opposed to tuning heuristics, which outperforms LLVM's SLP Vectorizer. These data-driven techniques can help achieve state-of-the-art results while also reducing the development and maintenance burden of the compiler developer.

  • exploiting Vector Parallelism in software pipelined loops
    International Symposium on Microarchitecture, 2005
    Co-Authors: Samuel Larsen, Rodric Rabbah, Saman Amarasinghe
    Abstract:

    An emerging trend in processor design is the addition of short Vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional Vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting Vector Parallelism in software pipelined loops. The proposed methodology. Our approach results in better resource utilization and allows for software pipelining with shorter initiation intervals. The proposed optimization is applied in the compiler backend, where Vectorization decisions are more amenable to cost analysis. This is unique in that traditional Vectorization optimizations are usually carried out at the statement level. Although our technique most naturally complements statically scheduled machines, we believe it is applicable to any architecture that tightly integrates support for instruction and data level Parallelism. We evaluate our methodology using nine SPEC FP benchmarks. In comparison to software pipelining, our approach achieves a maximum speedup of 1.38x, with average of 1.11x

  • MICRO - Exploiting Vector Parallelism in Software Pipelined Loops
    38th Annual IEEE ACM International Symposium on Microarchitecture (MICRO'05), 1
    Co-Authors: Samuel Larsen, Rodric Rabbah, Saman Amarasinghe
    Abstract:

    An emerging trend in processor design is the addition of short Vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional Vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting Vector Parallelism in software pipelined loops. The proposed methodology. Our approach results in better resource utilization and allows for software pipelining with shorter initiation intervals. The proposed optimization is applied in the compiler backend, where Vectorization decisions are more amenable to cost analysis. This is unique in that traditional Vectorization optimizations are usually carried out at the statement level. Although our technique most naturally complements statically scheduled machines, we believe it is applicable to any architecture that tightly integrates support for instruction and data level Parallelism. We evaluate our methodology using nine SPEC FP benchmarks. In comparison to software pipelining, our approach achieves a maximum speedup of 1.38x, with average of 1.11x

David K. Poulsen - One of the best experts on this subject based on the ideXlab platform.

  • Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors
    Journal of Parallel and Distributed Computing, 1996
    Co-Authors: David K. Poulsen, Pen-chung Yew
    Abstract:

    This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and Vector Parallelism.

  • Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors
    1994
    Co-Authors: David K. Poulsen
    Abstract:

    This dissertation considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. The benefits of prefetching and forwarding are considered for large, numerical application codes with loop-level and Vector Parallelism. Data prefetching is applied to these applications using two different multiprocessor prefetching algorithms implemented within a parallelizing compiler. Data forwarding considers array references involved in communication-related accesses between successive parallel loops, rather than within a single loop nest. A hybrid prefetching and forwarding scheme and a compiler algorithm for data forwarding are also presented. EPG-sim, a system of execution-driven simulation tools for studying parallel architectures, algorithms, and applications, was developed as a prerequisite for this work. EPG-sim performs execution-driven simulation and critical path simulation within a single, integrated environment. EPG-sim provides an extremely wide range of cost/accuracy trade-offs and a number of novel features compared to existing execution-driven systems. The Parallelism and communication behavior of numerical application codes are studied via EPG-sim critical path simulation, which establishes the potential performance of prefetching and forwarding for these codes. The evaluation of prefetching and forwarding is accomplished via detailed EPG-sim execution-driven simulations of optimized, parallel versions of these application codes. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked Vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. Data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. A new hybrid prefetching and forwarding scheme is presented that provides increased performance stability by adapting to varying application characteristics and architectural parameters. The hybrid scheme is shown to be effective in improving the performance of forwarding in reduced cache sizes. A compiler algorithm for data forwarding is presented that implements point-to-point forwarding, hybrid prefetching and forwarding, and selective forwarding. Software and hardware support for prefetching and forwarding are also discussed.

  • ICPP (2) - Data Prefetching and Data Forwarding in Shared Memory Multiprocessors
    1994 International Conference on Parallel Processing (ICPP'94), 1994
    Co-Authors: David K. Poulsen, Pen-chung Yew
    Abstract:

    This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked Vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. The use of data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. Algorithms for data prefetching and data forwarding are implemented in a parallelizing compiler. Evaluation of the proposed schemes and algorithms is accomplished via execution-driven simulation of large, optimized, parallel numerical application codes with loop-level and Vector Parallelism. More data, discussion, and experiment details can be found in [1].

Pen-chung Yew - One of the best experts on this subject based on the ideXlab platform.

  • Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors
    Journal of Parallel and Distributed Computing, 1996
    Co-Authors: David K. Poulsen, Pen-chung Yew
    Abstract:

    This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and Vector Parallelism.

  • ICPP (2) - Data Prefetching and Data Forwarding in Shared Memory Multiprocessors
    1994 International Conference on Parallel Processing (ICPP'94), 1994
    Co-Authors: David K. Poulsen, Pen-chung Yew
    Abstract:

    This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked Vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. The use of data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. Algorithms for data prefetching and data forwarding are implemented in a parallelizing compiler. Evaluation of the proposed schemes and algorithms is accomplished via execution-driven simulation of large, optimized, parallel numerical application codes with loop-level and Vector Parallelism. More data, discussion, and experiment details can be found in [1].

Jidong Zhai - One of the best experts on this subject based on the ideXlab platform.

  • bitflow exploiting Vector Parallelism for binary neural networks on cpu
    International Parallel and Distributed Processing Symposium, 2018
    Co-Authors: Jidong Zhai, Yifan Gong, Yuhao Zhu, Wei Liu, Jiangming Jin
    Abstract:

    Deep learning has revolutionized computer vision and other fields since its big bang in 2012. However, it is challenging to deploy Deep Neural Networks (DNNs) into real-world applications due to their high computational complexity. Binary Neural Networks (BNNs) dramatically reduce computational complexity by replacing most arithmetic operations with bitwise operations. Existing implementations of BNNs have been focusing on GPU or FPGA, and using the conventional image-to-column method that doesn't perform well for binary convolution due to low arithmetic intensity and unfriendly pattern for bitwise operations. We propose BitFlow, a gemm-operator-network three-level optimization framework for fully exploiting the computing power of BNNs on CPU. BitFlow features a new class of algorithm named PressedConv for efficient binary convolution using locality-aware layout and Vector Parallelism. We evaluate BitFlow with the VGG network. On a single core of Intel Xeon Phi, BitFlow obtains 1.8x speedup over unoptimized BNN implementations, and 11.5x speedup over counterpart full-precision DNNs. Over 64 cores, BitFlow enables BNNs to run 1.1x faster than counterpart full-precision DNNs on GPU (GTX 1080).

  • IPDPS - BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU
    2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018
    Co-Authors: Jidong Zhai, Yifan Gong, Yuhao Zhu, Liu Wei, Su Lei, Jin Jiangming
    Abstract:

    Deep learning has revolutionized computer vision and other fields since its big bang in 2012. However, it is challenging to deploy Deep Neural Networks (DNNs) into real-world applications due to their high computational complexity. Binary Neural Networks (BNNs) dramatically reduce computational complexity by replacing most arithmetic operations with bitwise operations. Existing implementations of BNNs have been focusing on GPU or FPGA, and using the conventional image-to-column method that doesn't perform well for binary convolution due to low arithmetic intensity and unfriendly pattern for bitwise operations. We propose BitFlow, a gemm-operator-network three-level optimization framework for fully exploiting the computing power of BNNs on CPU. BitFlow features a new class of algorithm named PressedConv for efficient binary convolution using locality-aware layout and Vector Parallelism. We evaluate BitFlow with the VGG network. On a single core of Intel Xeon Phi, BitFlow obtains 1.8x speedup over unoptimized BNN implementations, and 11.5x speedup over counterpart full-precision DNNs. Over 64 cores, BitFlow enables BNNs to run 1.1x faster than counterpart full-precision DNNs on GPU (GTX 1080).