Systolic Arrays

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 11226 Experts worldwide ranked by ideXlab platform

Songchun Zhu - One of the best experts on this subject based on the ideXlab platform.

  • sparse winograd convolutional neural networks on small scale Systolic Arrays
    Field Programmable Gate Arrays, 2019
    Co-Authors: Feng Shi, Yuhe Gao, Benjamin Kuschner, Songchun Zhu
    Abstract:

    The reconfigurability, energy-efficiency, and massive parallelism on FPGAs make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement a framework on FPGA by combining the sparse Winograd convolution, clusters of small-scale Systolic Arrays, and a tailored recursive Z-Morton memory layout design. We also provide an analytical model analysis for the general Winograd convolution algorithm as a design reference. Experimental results on various CNN models show that it achieves very high computation resource utilization, 20x~30x energy efficiency, and more than 5x speedup compared with the dense implementation.

  • sparse winograd convolutional neural networks on small scale Systolic Arrays
    arXiv: Distributed Parallel and Cluster Computing, 2018
    Co-Authors: Feng Shi, Yuhe Gao, Benjamin Kuschner, Songchun Zhu
    Abstract:

    The reconfigurability, energy-efficiency, and massive parallelism on FPGAs make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement an accelerator on FPGA by combining the sparse Winograd convolution, clusters of small-scale Systolic Arrays, and a tailored memory layout design. We also provide an analytical model analysis for the general Winograd convolution algorithm as a design reference. Experimental results on VGG16 show that it achieves very high computational resource utilization, 20x ~ 30x energy efficiency, and more than 5x speedup compared with the dense implementation.

Mile K. Stojcev - One of the best experts on this subject based on the ideXlab platform.

  • design of linear Systolic Arrays for matrix multiplication
    Advances in Electrical and Computer Engineering, 2014
    Co-Authors: Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. Stojcev, Tatjana R Nikolic
    Abstract:

    This paper presents architecture for matrix multiplication optimized to be integrated as an accelerator unit to a host computer. Two linear Systolic Arrays with unidirectional data flo ...

  • Orthogonal fault-tolerant Systolic Arrays for matrix multiplication
    Microelectronics Reliability, 2011
    Co-Authors: Igor Z. Milovanovic, Emina I. Milovanovic, Mile K. Stojcev, M. P. Bekakos
    Abstract:

    Abstract A systematic approach for designing one class of fault-tolerant Systolic Arrays (FTSAs) with orthogonal interconnects and unidirectional data flow, Orthogonal Unidirectional Systolic Array (OUSA), for multiplication of rectangular matrices is presented in this paper. The method employs space-time redundancy to achieve fault-tolerance. By conducting proposed systematic design procedure, four different Systolic Arrays of OUSA type are obtained. All the Arrays can tolerate single transient errors and majority of multiple errors with high probability. In order to provide high bandwidth in data access, a special hardware called address generator unit, was designed. Hardware complexity and performance gains achieved at higher (system, algorithm and architecture) design levels were analyzed. The obtained results show that with n2 + 2n processing elements the total execution time of the fault-tolerant algorithm is 6n + 3 time units, the hardware overhead due to involving fault-tolerance is in the range from 6.25% down to 0.8%, while time overhead is 50%. In addition, by involving hardware implemented address generation unit we reduce the total execution time of the algorithm almost five times, compared to software address calculations.

  • Hexagonal Systolic Arrays for matrix multiplication
    2001
    Co-Authors: M. P. Bekakos, Emina I. Milovanovic, I. Ž. Milovanović, T. I. Tokić, Mile K. Stojcev
    Abstract:

    We consider the problem of matrix multiplication on hexagonal Systolic Arrays (SA). We begin with the description of the procedure for Systolic array designing which is based on data dependency and space-time mapping of the nested loop algorithms. Then we introduce some performance measures which are used throughout the chapter for comparison of various SAs. We proceed with modification of the standard design procedure which enables synthesis of Systolic Arrays with the optimal number of processing elements (PE) for a given problem size and minimal execution time for a given number of PEs. Then we analyse and compare different hexagonal Arrays. Further, we show how execution time of matrix multiplication algorithm can be reduced if the number of PEs is increased with respect to the optimal one. Finally, we address the problem of fault-tolerant matrix multiplication on hexagonal Arrays.

  • two level pipelined Systolic Arrays for matrix vector multiplication
    Journal of Systems Architecture, 1998
    Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Milorad Tosic, Mile K. Stojcev
    Abstract:

    Abstract Novel two-level pipelined linear Systolic Arrays for matrix vector multiplication are proposed. The number of processing elements in the proposed Arrays s reduced to half of the number of processing elements in the existing Arrays. An area-time (AT) criteria is used to compare the proposed Arrays with the fastest existing one.

  • The Design of Optimal Planar Systolic Arrays for Matrix Multiplication
    Computers & Mathematics with Applications, 1997
    Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. Stojcev
    Abstract:

    Abstract The objective of this paper is to provide a systematic methodology for the design of space-time optimal pure planar Systolic Arrays for matrix multiplication. The procedure is based on data dependence approach. By the described procedure, we obtain ten different Systolic Arrays denoted as S 1 to S 10 classified into three classes according to interconnection patterns between the processing elements. Common properties of all Systolic array designs are: each Systolic array consists of n 2 processing elements, near-neighbour communications, and active execution time of 3 n − 2 time units. Compared to designs found in the literature, our procedure always leads to Systolic Arrays with optimal number of processing elements. The improvement in space domain is not achieved at the cost of execution time or PEs complexity. We present mathematically rigorous procedure which gives the exact ordering of input matrix elements at the beginning of the computation. Examples illustrating the methodology are shown.

M. P. Bekakos - One of the best experts on this subject based on the ideXlab platform.

  • Orthogonal fault-tolerant Systolic Arrays for matrix multiplication
    Microelectronics Reliability, 2011
    Co-Authors: Igor Z. Milovanovic, Emina I. Milovanovic, Mile K. Stojcev, M. P. Bekakos
    Abstract:

    Abstract A systematic approach for designing one class of fault-tolerant Systolic Arrays (FTSAs) with orthogonal interconnects and unidirectional data flow, Orthogonal Unidirectional Systolic Array (OUSA), for multiplication of rectangular matrices is presented in this paper. The method employs space-time redundancy to achieve fault-tolerance. By conducting proposed systematic design procedure, four different Systolic Arrays of OUSA type are obtained. All the Arrays can tolerate single transient errors and majority of multiple errors with high probability. In order to provide high bandwidth in data access, a special hardware called address generator unit, was designed. Hardware complexity and performance gains achieved at higher (system, algorithm and architecture) design levels were analyzed. The obtained results show that with n2 + 2n processing elements the total execution time of the fault-tolerant algorithm is 6n + 3 time units, the hardware overhead due to involving fault-tolerance is in the range from 6.25% down to 0.8%, while time overhead is 50%. In addition, by involving hardware implemented address generation unit we reduce the total execution time of the algorithm almost five times, compared to software address calculations.

  • Hexagonal Systolic Arrays for matrix multiplication
    2001
    Co-Authors: M. P. Bekakos, Emina I. Milovanovic, I. Ž. Milovanović, T. I. Tokić, Mile K. Stojcev
    Abstract:

    We consider the problem of matrix multiplication on hexagonal Systolic Arrays (SA). We begin with the description of the procedure for Systolic array designing which is based on data dependency and space-time mapping of the nested loop algorithms. Then we introduce some performance measures which are used throughout the chapter for comparison of various SAs. We proceed with modification of the standard design procedure which enables synthesis of Systolic Arrays with the optimal number of processing elements (PE) for a given problem size and minimal execution time for a given number of PEs. Then we analyse and compare different hexagonal Arrays. Further, we show how execution time of matrix multiplication algorithm can be reduced if the number of PEs is increased with respect to the optimal one. Finally, we address the problem of fault-tolerant matrix multiplication on hexagonal Arrays.

  • VHDL Code Automatic Generator for Systolic Arrays
    2006 2nd International Conference on Information & Communication Technologies, 1
    Co-Authors: I.n. Tselepis, M. P. Bekakos
    Abstract:

    Systolic Arrays speed up scientific computations with inherent parallelization, by exploiting massive data pipeline parallelism. In addition, they include short and problem-size independent signal paths, predictable performance, scalability, and simple design and test. In this paper, a server-based software tool for the automatic generation of VHDL code describing Systolic Arrays topologies is presented. Input parameters of the tool are several essential factors for the architectural description of Systolic Arrays (SA), like the interconnection topology of the Systolic array, i.e., linear, mesh or hex-connected, the size of the Systolic array, i.e., the number of the processing elements (PE) in each dimension, the function of the PE, i.e., the relation between the output and the input ports of every PE and finally the bitlength of PE ports, i.e., the data word size of every port.

Igor Z. Milovanovic - One of the best experts on this subject based on the ideXlab platform.

  • design of linear Systolic Arrays for matrix multiplication
    Advances in Electrical and Computer Engineering, 2014
    Co-Authors: Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. Stojcev, Tatjana R Nikolic
    Abstract:

    This paper presents architecture for matrix multiplication optimized to be integrated as an accelerator unit to a host computer. Two linear Systolic Arrays with unidirectional data flo ...

  • Orthogonal fault-tolerant Systolic Arrays for matrix multiplication
    Microelectronics Reliability, 2011
    Co-Authors: Igor Z. Milovanovic, Emina I. Milovanovic, Mile K. Stojcev, M. P. Bekakos
    Abstract:

    Abstract A systematic approach for designing one class of fault-tolerant Systolic Arrays (FTSAs) with orthogonal interconnects and unidirectional data flow, Orthogonal Unidirectional Systolic Array (OUSA), for multiplication of rectangular matrices is presented in this paper. The method employs space-time redundancy to achieve fault-tolerance. By conducting proposed systematic design procedure, four different Systolic Arrays of OUSA type are obtained. All the Arrays can tolerate single transient errors and majority of multiple errors with high probability. In order to provide high bandwidth in data access, a special hardware called address generator unit, was designed. Hardware complexity and performance gains achieved at higher (system, algorithm and architecture) design levels were analyzed. The obtained results show that with n2 + 2n processing elements the total execution time of the fault-tolerant algorithm is 6n + 3 time units, the hardware overhead due to involving fault-tolerance is in the range from 6.25% down to 0.8%, while time overhead is 50%. In addition, by involving hardware implemented address generation unit we reduce the total execution time of the algorithm almost five times, compared to software address calculations.

  • two level pipelined Systolic Arrays for matrix vector multiplication
    Journal of Systems Architecture, 1998
    Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Milorad Tosic, Mile K. Stojcev
    Abstract:

    Abstract Novel two-level pipelined linear Systolic Arrays for matrix vector multiplication are proposed. The number of processing elements in the proposed Arrays s reduced to half of the number of processing elements in the existing Arrays. An area-time (AT) criteria is used to compare the proposed Arrays with the fastest existing one.

  • The Design of Optimal Planar Systolic Arrays for Matrix Multiplication
    Computers & Mathematics with Applications, 1997
    Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. Stojcev
    Abstract:

    Abstract The objective of this paper is to provide a systematic methodology for the design of space-time optimal pure planar Systolic Arrays for matrix multiplication. The procedure is based on data dependence approach. By the described procedure, we obtain ten different Systolic Arrays denoted as S 1 to S 10 classified into three classes according to interconnection patterns between the processing elements. Common properties of all Systolic array designs are: each Systolic array consists of n 2 processing elements, near-neighbour communications, and active execution time of 3 n − 2 time units. Compared to designs found in the literature, our procedure always leads to Systolic Arrays with optimal number of processing elements. The improvement in space domain is not achieved at the cost of execution time or PEs complexity. We present mathematically rigorous procedure which gives the exact ordering of input matrix elements at the beginning of the computation. Examples illustrating the methodology are shown.

  • Matrix multiplication on non-planar Systolic Arrays
    4th International Conference on Telecommunications in Modern Satellite Cable and Broadcasting Services. TELSIKS'99 (Cat. No.99EX365), 1
    Co-Authors: T.i. Tokic, Emina I. Milovanovic, Igor Z. Milovanovic, N.m. Novakovic, M.k. Slojcev
    Abstract:

    A modification of the standard design procedure for mapping nested loop algorithms into Systolic Arrays is described in this article. This modification enables the authors to obtain non-planar Systolic Arrays for matrix multiplication with an optimal number of processing elements for a given problem size. The modification is based on composition of two linear mappings.

Raehong Park - One of the best experts on this subject based on the ideXlab platform.

  • unified Systolic Arrays for computation of the dct dst dht
    IEEE Transactions on Circuits and Systems for Video Technology, 1997
    Co-Authors: Sung Bum Pan, Raehong Park
    Abstract:

    We propose unified Systolic Arrays for computation of the one-dimensional (1-D) and two-dimensional (2-D) discrete cosine transform/discrete sine transform/discrete Hartley transform (DCT/DST/DHT). By decomposing the transforms into even- and odd-numbered frequency samples, the proposed architecture computes the 1-D DCT/DST/DHT. Compared to the conventional methods, the proposed Systolic Arrays exhibit advantages in terms of the number of PE's and latency. We generalize the proposed structure for computation of the 2-D DCT/DST/DHT. The unified Systolic Arrays can be employed for computation of the inverse DCT/DST/DHT (IDCT/IDST/IDHT).

  • vlsi architectures for block matching algorithms using Systolic Arrays
    IEEE Transactions on Circuits and Systems for Video Technology, 1996
    Co-Authors: Sung Bum Pan, Seung Soo Chae, Raehong Park
    Abstract:

    We investigate hardware implementation of block matching algorithms (BMAs) for motion estimation of moving sequences. Using Systolic Arrays, we propose VLSI architectures for the two-stage BMA and full search (FS) BMA. The two-stage BMA using integral projections reduces greatly the computational complexity with its performance comparable to that of the FS BMA. The proposed hardware architectures for the two-stage BMA and FS BMA are faster than the conventional hardware architectures with lower hardware complexity. Also, the proposed architecture of the first stage of the two-stage BMA is modeled in VHDL and simulated. Simulation results show the functional validity of the proposed architecture.