Vector Instruction

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 3852 Experts worldwide ranked by ideXlab platform

Amarasinghe Saman - One of the best experts on this subject based on the ideXlab platform.

  • Revec: program rejuvenation through reVectorization
    'Association for Computing Machinery (ACM)', 2020
    Co-Authors: Mendis Charith, Jain Ajay, Jain Paras, Amarasinghe Saman
    Abstract:

    © 2019 Copyright held by the owner/author(s). Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or Vector Instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level Vector intrinsics in their code. However, once programs are written using Vector intrinsics of a specific Instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer Vector Instruction sets. Hence, programmers have to manually rewrite the same code using Vector intrinsics of a newer generation to exploit higher data widths and capabilities of new Instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which reVectorizes already Vectorized code, by retargeting it to use Vector Instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-Vectorized code. Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160× and 1.430× on fast integer unpacking kernels, and speedups of 1.145× and 1.195× on hand-Vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 Vector Instructions respectively. We also extensively test Revec’s impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting

  • Revec: Program Rejuvenation through ReVectorization
    'Association for Computing Machinery (ACM)', 2019
    Co-Authors: Mendis Charith, Jain Ajay, Jain Paras, Amarasinghe Saman
    Abstract:

    Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or Vector Instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level Vector intrinsics in their code. However, once programs are written using Vector intrinsics of a specific Instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer Vector Instruction sets. Hence, programmers have to manually rewrite the same code using Vector intrinsics of a newer generation to exploit higher data widths and capabilities of new Instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which reVectorizes already Vectorized code, by retargeting it to use Vector Instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-Vectorized code. Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160$\times$ and 1.430$\times$ on fast integer unpacking kernels, and speedups of 1.145$\times$ and 1.195$\times$ on hand-Vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 Vector Instructions respectively. We also extensively test Revec's impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting.Comment: The first two authors contributed equally to this work. Published at CC 201

  • goSLP: Globally Optimized Superword Level Parallelism Framework
    'Association for Computing Machinery (ACM)', 2018
    Co-Authors: Mendis Charith, Amarasinghe Saman
    Abstract:

    Modern microprocessors are equipped with single Instruction multiple data (SIMD) or Vector Instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-Vectorization techniques use heuristics to discover Vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one Vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-Vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the Vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-Vectorizer.Comment: Published at OOPSLA 201

Markus Puschel - One of the best experts on this subject based on the ideXlab platform.

  • high performance sparse fast fourier transforms
    Signal Processing Systems, 2014
    Co-Authors: Jorn Schumacher, Markus Puschel
    Abstract:

    The Sparse Fast Fourier Transform is a recent algorithm developed by Hassanieh et al. at MIT for Discrete Fourier Transforms on signals with a sparse frequency domain. A reference implementation of the algorithm exists and proves that the Sparse Fast Fourier Transform can be faster than modern FFT libraries. However, the reference implementation does not take advantage of modern hardware features like Vector Instruction sets or multithreading. In this Master Thesis the reference implementation’s performance will be analyzed and evaluated. Several optimizations are proposed and implemented in a high-performance Sparse Fast Fourier Transform library. The optimized code is evaluated for performance and compared to the reference implementation as well as the FFTW library. The main result is that, depending on the input parameters, the optimized Sparse Fast Fourier Transform library is two to five times faster than the reference implementation.

  • computer generation of general size linear transform libraries
    Symposium on Code Generation and Optimization, 2009
    Co-Authors: Yevgen Voronenko, Frederic De Mesmay, Markus Puschel
    Abstract:

    The development of high-performance libraries has become extraordinarily difficult due to multiple processor cores, Vector Instruction sets, and deep memory hierarchies. Often, the library has to be reimplemented and reoptimized, when a new platform is released. In this paper we show how to automatically generate general input-size libraries for the domain of linear transforms. The input to our generator is a formal specification of the transform and the recursive algorithms the library should use; the output is a library that supports general input size, is Vectorized and multithreaded, provides an adaptation mechanism for the memory hierarchy, and has excellent performance, comparable to or better than the best human-written libraries. Further, we show that our library generator enables various customizations; one example is the generation of Java libraries.

Prieto-matias Manuel - One of the best experts on this subject based on the ideXlab platform.

  • SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
    2019
    Co-Authors: Rucci Enzo, García Sánchez Carlos, Botella Guillermo, De Giusti, Armando Eduardo, Naiouf Marcelo, Prieto-matias Manuel
    Abstract:

    The well-known Smith–Waterman (SW) algorithm is the most commonly used method for local sequence alignments, but its acceptance is limited by the computational requirements for large protein databases. Although the acceleration of SW has already been studied on many parallel platforms, there are hardly any studies which take advantage of the latest Intel architectures based on AVX-512 Vector extensions. This SIMD set is currently supported by Intel’s Knights Landing (KNL) accelerator and Intel’s Skylake (SKL) general purpose processors. In this paper, we present an SW version that is optimized for both architectures: the renowned SWIMM 2.0. The novelty of this Vector Instruction set requires the revision of previous programming and optimization techniques. SWIMM 2.0 is based on a massive multi-threading and SIMD exploitation. It is competitive in terms of performance compared with other state-of-the-art implementations, reaching 511 GCUPS on a single KNL node and 734 GCUPS on a server equipped with a dual SKL processor. Moreover, these successful performance rates make SWIMM 2.0 the most efficient energy footprint implementation in this study achieving 2.94 GCUPS/Watts on the SKL processor.Facultad de Informátic

  • SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
    'Springer Science and Business Media LLC', 2019
    Co-Authors: Rucci Enzo, García Sánchez Carlos, De Giusti, Armando Eduardo, Botella Juan Guillermo, Naiouf, Ricardo Marcelo, Prieto-matias Manuel
    Abstract:

    The well-known Smith–Waterman (SW) algorithm is the most commonly used method for local sequence alignments, but its acceptance is limited by the computational requirements for large protein databases. Although the acceleration of SW has already been studied on many parallel platforms, there are hardly any studies which take advantage of the latest Intel architectures based on AVX-512 Vector extensions. This SIMD set is currently supported by Intel’s Knights Landing (KNL) accelerator and Intel’s Skylake (SKL) general purpose processors. In this paper, we present an SW version that is optimized for both architectures: the renowned SWIMM 2.0. The novelty of this Vector Instruction set requires the revision of previous programming and optimization techniques. SWIMM 2.0 is based on a massive multi-threading and SIMD exploitation. It is competitive in terms of performance compared with other state-of-the-art implementations, reaching 511 GCUPS on a single KNL node and 734 GCUPS on a server equipped with a dual SKL processor. Moreover, these successful performance rates make SWIMM 2.0 the most efficient energy footprint implementation in this study achieving 2.94 GCUPS/Watts on the SKL processor.Fil: Rucci, Enzo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Garcia Sanchez, Carlos. Universidad Complutense de Madrid; EspañaFil: Botella Juan, Guillermo. Universidad Complutense de Madrid; EspañaFil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Prieto Matias, Manuel. Universidad Complutense de Madrid; Españ

Mendis Charith - One of the best experts on this subject based on the ideXlab platform.

  • Revec: program rejuvenation through reVectorization
    'Association for Computing Machinery (ACM)', 2020
    Co-Authors: Mendis Charith, Jain Ajay, Jain Paras, Amarasinghe Saman
    Abstract:

    © 2019 Copyright held by the owner/author(s). Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or Vector Instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level Vector intrinsics in their code. However, once programs are written using Vector intrinsics of a specific Instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer Vector Instruction sets. Hence, programmers have to manually rewrite the same code using Vector intrinsics of a newer generation to exploit higher data widths and capabilities of new Instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which reVectorizes already Vectorized code, by retargeting it to use Vector Instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-Vectorized code. Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160× and 1.430× on fast integer unpacking kernels, and speedups of 1.145× and 1.195× on hand-Vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 Vector Instructions respectively. We also extensively test Revec’s impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting

  • Revec: Program Rejuvenation through ReVectorization
    'Association for Computing Machinery (ACM)', 2019
    Co-Authors: Mendis Charith, Jain Ajay, Jain Paras, Amarasinghe Saman
    Abstract:

    Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or Vector Instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level Vector intrinsics in their code. However, once programs are written using Vector intrinsics of a specific Instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer Vector Instruction sets. Hence, programmers have to manually rewrite the same code using Vector intrinsics of a newer generation to exploit higher data widths and capabilities of new Instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which reVectorizes already Vectorized code, by retargeting it to use Vector Instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-Vectorized code. Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160$\times$ and 1.430$\times$ on fast integer unpacking kernels, and speedups of 1.145$\times$ and 1.195$\times$ on hand-Vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 Vector Instructions respectively. We also extensively test Revec's impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting.Comment: The first two authors contributed equally to this work. Published at CC 201

  • goSLP: Globally Optimized Superword Level Parallelism Framework
    'Association for Computing Machinery (ACM)', 2018
    Co-Authors: Mendis Charith, Amarasinghe Saman
    Abstract:

    Modern microprocessors are equipped with single Instruction multiple data (SIMD) or Vector Instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-Vectorization techniques use heuristics to discover Vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one Vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-Vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the Vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-Vectorizer.Comment: Published at OOPSLA 201

Rucci Enzo - One of the best experts on this subject based on the ideXlab platform.

  • SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
    2019
    Co-Authors: Rucci Enzo, García Sánchez Carlos, Botella Guillermo, De Giusti, Armando Eduardo, Naiouf Marcelo, Prieto-matias Manuel
    Abstract:

    The well-known Smith–Waterman (SW) algorithm is the most commonly used method for local sequence alignments, but its acceptance is limited by the computational requirements for large protein databases. Although the acceleration of SW has already been studied on many parallel platforms, there are hardly any studies which take advantage of the latest Intel architectures based on AVX-512 Vector extensions. This SIMD set is currently supported by Intel’s Knights Landing (KNL) accelerator and Intel’s Skylake (SKL) general purpose processors. In this paper, we present an SW version that is optimized for both architectures: the renowned SWIMM 2.0. The novelty of this Vector Instruction set requires the revision of previous programming and optimization techniques. SWIMM 2.0 is based on a massive multi-threading and SIMD exploitation. It is competitive in terms of performance compared with other state-of-the-art implementations, reaching 511 GCUPS on a single KNL node and 734 GCUPS on a server equipped with a dual SKL processor. Moreover, these successful performance rates make SWIMM 2.0 the most efficient energy footprint implementation in this study achieving 2.94 GCUPS/Watts on the SKL processor.Facultad de Informátic

  • SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
    'Springer Science and Business Media LLC', 2019
    Co-Authors: Rucci Enzo, García Sánchez Carlos, De Giusti, Armando Eduardo, Botella Juan Guillermo, Naiouf, Ricardo Marcelo, Prieto-matias Manuel
    Abstract:

    The well-known Smith–Waterman (SW) algorithm is the most commonly used method for local sequence alignments, but its acceptance is limited by the computational requirements for large protein databases. Although the acceleration of SW has already been studied on many parallel platforms, there are hardly any studies which take advantage of the latest Intel architectures based on AVX-512 Vector extensions. This SIMD set is currently supported by Intel’s Knights Landing (KNL) accelerator and Intel’s Skylake (SKL) general purpose processors. In this paper, we present an SW version that is optimized for both architectures: the renowned SWIMM 2.0. The novelty of this Vector Instruction set requires the revision of previous programming and optimization techniques. SWIMM 2.0 is based on a massive multi-threading and SIMD exploitation. It is competitive in terms of performance compared with other state-of-the-art implementations, reaching 511 GCUPS on a single KNL node and 734 GCUPS on a server equipped with a dual SKL processor. Moreover, these successful performance rates make SWIMM 2.0 the most efficient energy footprint implementation in this study achieving 2.94 GCUPS/Watts on the SKL processor.Fil: Rucci, Enzo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Garcia Sanchez, Carlos. Universidad Complutense de Madrid; EspañaFil: Botella Juan, Guillermo. Universidad Complutense de Madrid; EspañaFil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Prieto Matias, Manuel. Universidad Complutense de Madrid; Españ