The Experts below are selected from a list of 11226 Experts worldwide ranked by ideXlab platform
Songchun Zhu - One of the best experts on this subject based on the ideXlab platform.
-
sparse winograd convolutional neural networks on small scale Systolic Arrays
Field Programmable Gate Arrays, 2019Co-Authors: Feng Shi, Yuhe Gao, Benjamin Kuschner, Songchun ZhuAbstract:The reconfigurability, energy-efficiency, and massive parallelism on FPGAs make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement a framework on FPGA by combining the sparse Winograd convolution, clusters of small-scale Systolic Arrays, and a tailored recursive Z-Morton memory layout design. We also provide an analytical model analysis for the general Winograd convolution algorithm as a design reference. Experimental results on various CNN models show that it achieves very high computation resource utilization, 20x~30x energy efficiency, and more than 5x speedup compared with the dense implementation.
-
sparse winograd convolutional neural networks on small scale Systolic Arrays
arXiv: Distributed Parallel and Cluster Computing, 2018Co-Authors: Feng Shi, Yuhe Gao, Benjamin Kuschner, Songchun ZhuAbstract:The reconfigurability, energy-efficiency, and massive parallelism on FPGAs make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement an accelerator on FPGA by combining the sparse Winograd convolution, clusters of small-scale Systolic Arrays, and a tailored memory layout design. We also provide an analytical model analysis for the general Winograd convolution algorithm as a design reference. Experimental results on VGG16 show that it achieves very high computational resource utilization, 20x ~ 30x energy efficiency, and more than 5x speedup compared with the dense implementation.
Mile K. Stojcev - One of the best experts on this subject based on the ideXlab platform.
-
design of linear Systolic Arrays for matrix multiplication
Advances in Electrical and Computer Engineering, 2014Co-Authors: Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. Stojcev, Tatjana R NikolicAbstract:This paper presents architecture for matrix multiplication optimized to be integrated as an accelerator unit to a host computer. Two linear Systolic Arrays with unidirectional data flo ...
-
Orthogonal fault-tolerant Systolic Arrays for matrix multiplication
Microelectronics Reliability, 2011Co-Authors: Igor Z. Milovanovic, Emina I. Milovanovic, Mile K. Stojcev, M. P. BekakosAbstract:Abstract A systematic approach for designing one class of fault-tolerant Systolic Arrays (FTSAs) with orthogonal interconnects and unidirectional data flow, Orthogonal Unidirectional Systolic Array (OUSA), for multiplication of rectangular matrices is presented in this paper. The method employs space-time redundancy to achieve fault-tolerance. By conducting proposed systematic design procedure, four different Systolic Arrays of OUSA type are obtained. All the Arrays can tolerate single transient errors and majority of multiple errors with high probability. In order to provide high bandwidth in data access, a special hardware called address generator unit, was designed. Hardware complexity and performance gains achieved at higher (system, algorithm and architecture) design levels were analyzed. The obtained results show that with n2 + 2n processing elements the total execution time of the fault-tolerant algorithm is 6n + 3 time units, the hardware overhead due to involving fault-tolerance is in the range from 6.25% down to 0.8%, while time overhead is 50%. In addition, by involving hardware implemented address generation unit we reduce the total execution time of the algorithm almost five times, compared to software address calculations.
-
Hexagonal Systolic Arrays for matrix multiplication
2001Co-Authors: M. P. Bekakos, Emina I. Milovanovic, I. Ž. Milovanović, T. I. Tokić, Mile K. StojcevAbstract:We consider the problem of matrix multiplication on hexagonal Systolic Arrays (SA). We begin with the description of the procedure for Systolic array designing which is based on data dependency and space-time mapping of the nested loop algorithms. Then we introduce some performance measures which are used throughout the chapter for comparison of various SAs. We proceed with modification of the standard design procedure which enables synthesis of Systolic Arrays with the optimal number of processing elements (PE) for a given problem size and minimal execution time for a given number of PEs. Then we analyse and compare different hexagonal Arrays. Further, we show how execution time of matrix multiplication algorithm can be reduced if the number of PEs is increased with respect to the optimal one. Finally, we address the problem of fault-tolerant matrix multiplication on hexagonal Arrays.
-
two level pipelined Systolic Arrays for matrix vector multiplication
Journal of Systems Architecture, 1998Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Milorad Tosic, Mile K. StojcevAbstract:Abstract Novel two-level pipelined linear Systolic Arrays for matrix vector multiplication are proposed. The number of processing elements in the proposed Arrays s reduced to half of the number of processing elements in the existing Arrays. An area-time (AT) criteria is used to compare the proposed Arrays with the fastest existing one.
-
The Design of Optimal Planar Systolic Arrays for Matrix Multiplication
Computers & Mathematics with Applications, 1997Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. StojcevAbstract:Abstract The objective of this paper is to provide a systematic methodology for the design of space-time optimal pure planar Systolic Arrays for matrix multiplication. The procedure is based on data dependence approach. By the described procedure, we obtain ten different Systolic Arrays denoted as S 1 to S 10 classified into three classes according to interconnection patterns between the processing elements. Common properties of all Systolic array designs are: each Systolic array consists of n 2 processing elements, near-neighbour communications, and active execution time of 3 n − 2 time units. Compared to designs found in the literature, our procedure always leads to Systolic Arrays with optimal number of processing elements. The improvement in space domain is not achieved at the cost of execution time or PEs complexity. We present mathematically rigorous procedure which gives the exact ordering of input matrix elements at the beginning of the computation. Examples illustrating the methodology are shown.
M. P. Bekakos - One of the best experts on this subject based on the ideXlab platform.
-
Orthogonal fault-tolerant Systolic Arrays for matrix multiplication
Microelectronics Reliability, 2011Co-Authors: Igor Z. Milovanovic, Emina I. Milovanovic, Mile K. Stojcev, M. P. BekakosAbstract:Abstract A systematic approach for designing one class of fault-tolerant Systolic Arrays (FTSAs) with orthogonal interconnects and unidirectional data flow, Orthogonal Unidirectional Systolic Array (OUSA), for multiplication of rectangular matrices is presented in this paper. The method employs space-time redundancy to achieve fault-tolerance. By conducting proposed systematic design procedure, four different Systolic Arrays of OUSA type are obtained. All the Arrays can tolerate single transient errors and majority of multiple errors with high probability. In order to provide high bandwidth in data access, a special hardware called address generator unit, was designed. Hardware complexity and performance gains achieved at higher (system, algorithm and architecture) design levels were analyzed. The obtained results show that with n2 + 2n processing elements the total execution time of the fault-tolerant algorithm is 6n + 3 time units, the hardware overhead due to involving fault-tolerance is in the range from 6.25% down to 0.8%, while time overhead is 50%. In addition, by involving hardware implemented address generation unit we reduce the total execution time of the algorithm almost five times, compared to software address calculations.
-
Hexagonal Systolic Arrays for matrix multiplication
2001Co-Authors: M. P. Bekakos, Emina I. Milovanovic, I. Ž. Milovanović, T. I. Tokić, Mile K. StojcevAbstract:We consider the problem of matrix multiplication on hexagonal Systolic Arrays (SA). We begin with the description of the procedure for Systolic array designing which is based on data dependency and space-time mapping of the nested loop algorithms. Then we introduce some performance measures which are used throughout the chapter for comparison of various SAs. We proceed with modification of the standard design procedure which enables synthesis of Systolic Arrays with the optimal number of processing elements (PE) for a given problem size and minimal execution time for a given number of PEs. Then we analyse and compare different hexagonal Arrays. Further, we show how execution time of matrix multiplication algorithm can be reduced if the number of PEs is increased with respect to the optimal one. Finally, we address the problem of fault-tolerant matrix multiplication on hexagonal Arrays.
-
VHDL Code Automatic Generator for Systolic Arrays
2006 2nd International Conference on Information & Communication Technologies, 1Co-Authors: I.n. Tselepis, M. P. BekakosAbstract:Systolic Arrays speed up scientific computations with inherent parallelization, by exploiting massive data pipeline parallelism. In addition, they include short and problem-size independent signal paths, predictable performance, scalability, and simple design and test. In this paper, a server-based software tool for the automatic generation of VHDL code describing Systolic Arrays topologies is presented. Input parameters of the tool are several essential factors for the architectural description of Systolic Arrays (SA), like the interconnection topology of the Systolic array, i.e., linear, mesh or hex-connected, the size of the Systolic array, i.e., the number of the processing elements (PE) in each dimension, the function of the PE, i.e., the relation between the output and the input ports of every PE and finally the bitlength of PE ports, i.e., the data word size of every port.
Igor Z. Milovanovic - One of the best experts on this subject based on the ideXlab platform.
-
design of linear Systolic Arrays for matrix multiplication
Advances in Electrical and Computer Engineering, 2014Co-Authors: Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. Stojcev, Tatjana R NikolicAbstract:This paper presents architecture for matrix multiplication optimized to be integrated as an accelerator unit to a host computer. Two linear Systolic Arrays with unidirectional data flo ...
-
Orthogonal fault-tolerant Systolic Arrays for matrix multiplication
Microelectronics Reliability, 2011Co-Authors: Igor Z. Milovanovic, Emina I. Milovanovic, Mile K. Stojcev, M. P. BekakosAbstract:Abstract A systematic approach for designing one class of fault-tolerant Systolic Arrays (FTSAs) with orthogonal interconnects and unidirectional data flow, Orthogonal Unidirectional Systolic Array (OUSA), for multiplication of rectangular matrices is presented in this paper. The method employs space-time redundancy to achieve fault-tolerance. By conducting proposed systematic design procedure, four different Systolic Arrays of OUSA type are obtained. All the Arrays can tolerate single transient errors and majority of multiple errors with high probability. In order to provide high bandwidth in data access, a special hardware called address generator unit, was designed. Hardware complexity and performance gains achieved at higher (system, algorithm and architecture) design levels were analyzed. The obtained results show that with n2 + 2n processing elements the total execution time of the fault-tolerant algorithm is 6n + 3 time units, the hardware overhead due to involving fault-tolerance is in the range from 6.25% down to 0.8%, while time overhead is 50%. In addition, by involving hardware implemented address generation unit we reduce the total execution time of the algorithm almost five times, compared to software address calculations.
-
two level pipelined Systolic Arrays for matrix vector multiplication
Journal of Systems Architecture, 1998Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Milorad Tosic, Mile K. StojcevAbstract:Abstract Novel two-level pipelined linear Systolic Arrays for matrix vector multiplication are proposed. The number of processing elements in the proposed Arrays s reduced to half of the number of processing elements in the existing Arrays. An area-time (AT) criteria is used to compare the proposed Arrays with the fastest existing one.
-
The Design of Optimal Planar Systolic Arrays for Matrix Multiplication
Computers & Mathematics with Applications, 1997Co-Authors: Ivan Milentijevic, Emina I. Milovanovic, Igor Z. Milovanovic, Mile K. StojcevAbstract:Abstract The objective of this paper is to provide a systematic methodology for the design of space-time optimal pure planar Systolic Arrays for matrix multiplication. The procedure is based on data dependence approach. By the described procedure, we obtain ten different Systolic Arrays denoted as S 1 to S 10 classified into three classes according to interconnection patterns between the processing elements. Common properties of all Systolic array designs are: each Systolic array consists of n 2 processing elements, near-neighbour communications, and active execution time of 3 n − 2 time units. Compared to designs found in the literature, our procedure always leads to Systolic Arrays with optimal number of processing elements. The improvement in space domain is not achieved at the cost of execution time or PEs complexity. We present mathematically rigorous procedure which gives the exact ordering of input matrix elements at the beginning of the computation. Examples illustrating the methodology are shown.
-
Matrix multiplication on non-planar Systolic Arrays
4th International Conference on Telecommunications in Modern Satellite Cable and Broadcasting Services. TELSIKS'99 (Cat. No.99EX365), 1Co-Authors: T.i. Tokic, Emina I. Milovanovic, Igor Z. Milovanovic, N.m. Novakovic, M.k. SlojcevAbstract:A modification of the standard design procedure for mapping nested loop algorithms into Systolic Arrays is described in this article. This modification enables the authors to obtain non-planar Systolic Arrays for matrix multiplication with an optimal number of processing elements for a given problem size. The modification is based on composition of two linear mappings.
Raehong Park - One of the best experts on this subject based on the ideXlab platform.
-
unified Systolic Arrays for computation of the dct dst dht
IEEE Transactions on Circuits and Systems for Video Technology, 1997Co-Authors: Sung Bum Pan, Raehong ParkAbstract:We propose unified Systolic Arrays for computation of the one-dimensional (1-D) and two-dimensional (2-D) discrete cosine transform/discrete sine transform/discrete Hartley transform (DCT/DST/DHT). By decomposing the transforms into even- and odd-numbered frequency samples, the proposed architecture computes the 1-D DCT/DST/DHT. Compared to the conventional methods, the proposed Systolic Arrays exhibit advantages in terms of the number of PE's and latency. We generalize the proposed structure for computation of the 2-D DCT/DST/DHT. The unified Systolic Arrays can be employed for computation of the inverse DCT/DST/DHT (IDCT/IDST/IDHT).
-
vlsi architectures for block matching algorithms using Systolic Arrays
IEEE Transactions on Circuits and Systems for Video Technology, 1996Co-Authors: Sung Bum Pan, Seung Soo Chae, Raehong ParkAbstract:We investigate hardware implementation of block matching algorithms (BMAs) for motion estimation of moving sequences. Using Systolic Arrays, we propose VLSI architectures for the two-stage BMA and full search (FS) BMA. The two-stage BMA using integral projections reduces greatly the computational complexity with its performance comparable to that of the FS BMA. The proposed hardware architectures for the two-stage BMA and FS BMA are faster than the conventional hardware architectures with lower hardware complexity. Also, the proposed architecture of the first stage of the two-stage BMA is modeled in VHDL and simulated. Simulation results show the functional validity of the proposed architecture.