Vector Register

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 2622 Experts worldwide ranked by ideXlab platform

Mateo Valero - One of the best experts on this subject based on the ideXlab platform.

  • Using Arm’s scalable Vector extension on stencil codes
    The Journal of Supercomputing, 2019
    Co-Authors: Adrià Armejach, Rubén Langarita, Rekai González-alberquilla, Chris Adeniyi-jones, Juan M. Cebrian, Helena Caminal, Marc Casas, Mateo Valero, Miquel Moretó
    Abstract:

    Data-level parallelism is frequently ignored or underutilized. Achieved through Vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual Vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or Register size. In addition, automatic compiler Vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new Vector ISA, the scalable Vector extension (SVE), which is Vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical Vector Register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using Vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward Vectorized code of up to 1.57$$\times$$×. In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.

  • Efficiency analysis of modern Vector architectures: Vector ALU sizes, core counts and clock frequencies
    The Journal of Supercomputing, 2019
    Co-Authors: Adrian Barredo, Juan M. Cebrian, Marc Casas, Mateo Valero, Miquel Moretó
    Abstract:

    Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for Vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and Vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts Vector Register length at runtime.

  • An Integrated Vector-Scalar Design on an In-Order ARM Core
    ACM Transactions on Architecture and Code Optimization, 2017
    Co-Authors: Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratkovic, Adrian Cristal, Osman Unsal, Mateo Valero
    Abstract:

    In the low-end mobile processor market, power, energy, and area budgets are significantly lower than in the server/desktop/laptop/high-end mobile markets. It has been shown that Vector processors are a highly energy-efficient way to increase performance; however, adding support for them incurs area and power overheads that would not be acceptable for low-end mobile processors. In this work, we propose an integrated Vector-scalar design for the ARM architecture that mostly reuses scalar hardware to support the execution of Vector instructions. The key element of the design is our proposed block-based model of execution that groups Vector computational instructions together to execute them in a coordinated manner. We implemented a classic Vector unit and compare its results against our integrated design. Our integrated design improves the performance (more than 6×) and energy consumption (up to 5×) of a scalar in-order core with negligible area overhead (only 4.7% when using a Vector Register with 32 elements). In contrast, the area overhead of the classic Vector unit can be significant (around 44%) if a dedicated Vector floating-point unit is incorporated. Our block-based Vector execution outperforms the classic Vector unit for all kernels with floating-point data and also consumes less energy. We also complement the integrated design with three energy/performance-efficient techniques that further reduce power and increase performance. The first proposal covers the design and implementation of chaining logic that is optimized to work with the cache hierarchy through Vector memory instructions, the second proposal reduces the number of reads/writes from/to the Vector Register file, and the third idea optimizes complex memory access patterns with the memory shape instruction and unified indexed Vector load.

  • VECPAR - Registers Size Influence on Vector Architectures
    Vector and Parallel Processing – VECPAR’98, 1999
    Co-Authors: Luisa F. Villa, Roger Espasa, Mateo Valero
    Abstract:

    In this work we studied the influence of the Vector Register size over two different concepts of Vector architectures. Long Vector Registers play an important role in a conventional Vector architecture, however, even using highly Vectorisable codes, only a small fraction of that large Vector Registers is used. Reducing Vector Register size on a conventional Vector architecture results in a severe performance degradation, providing slowdowns in the range of 1.8 to 3.8. When we included an out-of-order execution on a Vector architecture, the need for long Vector Registers was reduced. We used a trace driven approach to simulate a selection of the Perfect Club and Specfp92 programs. The results of the simulations showed that the reduction of the Register size on an out-of-order Vector architecture led to slowdowns in the range of 1.04 to 1.9. These compare favourably with the values found for a conventional Vector machine. Even when reducing the Registers size to 1/4 of the original size on an out-of-order machine, the slowdown was between 1.04 and 1.5, and was better still than on a conventional Vector machine. Finally, when comparing both architectures, using the same Register file size (8kb) we found that the gains in performance using out-of-order execution were between 1.13 and 1.40.

  • Registers size influence on Vector architectures
    1999
    Co-Authors: Luisa F. Villa, Roger Espasa, Mateo Valero
    Abstract:

    In this work we studied the influence of the Vector Register size over two different concepts of Vector architectures. Long Vector Registers play an important role in a conventional Vector architecture, however, even using highly Vectorisable codes, only a small fraction of that large Vector Registers is used. Reducing Vector Register size on a conventional Vector architecture results in a severe performance degradation, providing slowdowns in the range of 1.8 to 3.8. When we included an out-of-order execution on a Vector architecture, the need for long Vector Registers was reduced. We used a trace driven approach to simulate a selection of the Perfect Club and Specfp92 programs. The results of the simulations showed that the reduction of the Register size on an out-of-order Vector architecture led to slowdowns in the range of 1.04 to 1.9. These compare favourably with the values found for a conventional Vector machine. Even when reducing the Registers size to 1/4 of the original size on an out-of-order machine, the slowdown was between 1.04 and 1.5, and was better still than on a conventional Vector machine. Finally, when comparing both architectures, using the same Register file size (8kb) we found that the gains in performance using out-of-order execution were between 1.13 and 1.40.

Wei-chung Hsu - One of the best experts on this subject based on the ideXlab platform.

  • CGO - Translating traditional SIMD instructions to Vector length agnostic architectures
    2019 IEEE ACM International Symposium on Code Generation and Optimization (CGO), 2019
    Co-Authors: Wei-chung Hsu
    Abstract:

    One interesting trend of SIMD architecture is towards Vector Length Agnostic (VLA) designs. For example, ARM new Vector ISA, Scalable Vector Extension (SVE), and RISC-V Vector extension are adopting such a direction. VLA decouples the Vector Register length from the compiled binary so that the same executable could run on different implementations of Vector length. However, in the current application world, the majority of SIMD code is fixed-length based, such as ARM-NEON, x86- SSE, x86-AVX, and traditional Vector machines. Therefore, how to migrate legacy SIMD code to VLA architecture would be an interesting and important technical challenge.

Moreto Planas Miquel - One of the best experts on this subject based on the ideXlab platform.

  • Efficiency analysis of modern Vector architectures: Vector ALU sizes, core counts and clock frequencies
    2020
    Co-Authors: Barredo Ferreira Adrián, Cebrián González, Juan Manuel, Valero Cortés Mateo, Casas Guix Marc, Moreto Planas Miquel
    Abstract:

    Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for Vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and Vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts Vector Register length at runtime.Funding was provided by RoMoL ERC Advanced Grant (Grant No. GA 321253), Juan de la Cierva (Grant No. JCI-2012-15047), Marie Curie (Grant No. 2013 BP_B 00243).Peer Reviewe

  • Using Arm’s scalable Vector extension on stencil codes
    'Springer Science and Business Media LLC', 2020
    Co-Authors: Armejach Sanosa Adrià, Cebrián González, Juan Manuel, Valero Cortés Mateo, Casas Guix Marc, Caminal Pallarés Helena, Langarita Rubén, González-alberquilla Rekai, Adeniyi-jones Chris, Moreto Planas Miquel
    Abstract:

    Data-level parallelism is frequently ignored or underutilized. Achieved through Vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual Vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or Register size. In addition, automatic compiler Vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new Vector ISA, the scalable Vector extension (SVE), which is Vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical Vector Register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using Vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward Vectorized code of up to 1.57×. In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.Peer ReviewedPostprint (author's final draft

  • Stencil codes on a Vector length agnostic architecture
    'Association for Computing Machinery (ACM)', 2018
    Co-Authors: Armejach Sanosa Adrià, Cebrián González, Juan Manuel, Valero Cortés Mateo, Caminal Pallarés Helena, González-alberquilla Rekai, Adeniyi-jones Chris, Casas Marc, Moreto Planas Miquel
    Abstract:

    Data-level parallelism is frequently ignored or underutilized. Achieved through Vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual Vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or Register size. In addition, automatic compiler Vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new Vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical Vector Register length. In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using Vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward Vectorized code of up to 56.6% for 2,048 bit Vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.This work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414). The Mont-Blanc project receives funding from the EUs H2020 Framework Programme (H2020/2014-2020) under grant agreements no. 671697 and no. 779877. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship number FJCI-2015-24753.Peer ReviewedPostprint (author's final draft

Miquel Moretó - One of the best experts on this subject based on the ideXlab platform.

  • Using Arm’s scalable Vector extension on stencil codes
    The Journal of Supercomputing, 2019
    Co-Authors: Adrià Armejach, Rubén Langarita, Rekai González-alberquilla, Chris Adeniyi-jones, Juan M. Cebrian, Helena Caminal, Marc Casas, Mateo Valero, Miquel Moretó
    Abstract:

    Data-level parallelism is frequently ignored or underutilized. Achieved through Vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual Vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or Register size. In addition, automatic compiler Vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new Vector ISA, the scalable Vector extension (SVE), which is Vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical Vector Register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using Vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward Vectorized code of up to 1.57$$\times$$×. In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.

  • Efficiency analysis of modern Vector architectures: Vector ALU sizes, core counts and clock frequencies
    The Journal of Supercomputing, 2019
    Co-Authors: Adrian Barredo, Juan M. Cebrian, Marc Casas, Mateo Valero, Miquel Moretó
    Abstract:

    Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for Vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and Vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts Vector Register length at runtime.

B Hanounik - One of the best experts on this subject based on the ideXlab platform.

  • linear time matrix transpose algorithms using Vector Register file with diagonal Registers
    International Parallel and Distributed Processing Symposium, 2001
    Co-Authors: B Hanounik
    Abstract:

    Matrix transpose operation (MT) is used frequently in many multimedia and high performance applications. Therefore, using a faster MT operation results in a shorter execution time of these applications. In this paper we propose two new MT algorithms. The algorithms exploit diagonal Register properties to achieve a linear-time execution of MT operation using Vector processor that supports diagonal Registers. We demonstrate the algorithms as well as proofs, examples, and various enhancements to the proposed algorithms A performance evaluation shows that the proposed algorithms are at least twice as fast as one of the leading MT algorithms such as an algorithm that is implemented using Motorola's AltiVec architecture (n/spl ges/16). We believe that our work opens new doors to improve the execution time of many two-dimensional operations such as DCT, DFT, and Shearsort.

  • IPDPS - Linear-time matrix transpose algorithms using Vector Register file with diagonal Registers
    Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 1
    Co-Authors: B Hanounik
    Abstract:

    Matrix transpose operation (MT) is used frequently in many multimedia and high performance applications. Therefore, using a faster MT operation results in a shorter execution time of these applications. In this paper we propose two new MT algorithms. The algorithms exploit diagonal Register properties to achieve a linear-time execution of MT operation using Vector processor that supports diagonal Registers. We demonstrate the algorithms as well as proofs, examples, and various enhancements to the proposed algorithms A performance evaluation shows that the proposed algorithms are at least twice as fast as one of the leading MT algorithms such as an algorithm that is implemented using Motorola's AltiVec architecture (n/spl ges/16). We believe that our work opens new doors to improve the execution time of many two-dimensional operations such as DCT, DFT, and Shearsort.