Multiplication

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 309 Experts worldwide ranked by ideXlab platform

Jose E Roman - One of the best experts on this subject based on the ideXlab platform.

  • A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems
    IEEE Transactions on Parallel and Distributed Systems, 2021
    Co-Authors: Xia Liao, Jose E Roman
    Abstract:

    In this article, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x–1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

  • A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems
    arXiv: Mathematical Software, 2020
    Co-Authors: Xia Liao, Jose E Roman
    Abstract:

    In this paper, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512--520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves $1.4$x--$1.6$x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

Xia Liao - One of the best experts on this subject based on the ideXlab platform.

  • A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems
    IEEE Transactions on Parallel and Distributed Systems, 2021
    Co-Authors: Xia Liao, Jose E Roman
    Abstract:

    In this article, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x–1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

  • A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems
    arXiv: Mathematical Software, 2020
    Co-Authors: Xia Liao, Jose E Roman
    Abstract:

    In this paper, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512--520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves $1.4$x--$1.6$x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

Marc Moreno Maza - One of the best experts on this subject based on the ideXlab platform.

  • complexity and performance results for non fft based univariate polynomial Multiplication
    ADVANCES IN MATHEMATICAL AND COMPUTATIONAL METHODS: ADDRESSING MODERN CHALLENGES OF SCIENCE TECHNOLOGY AND SOCIETY, 2011
    Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost
    Abstract:

    Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

  • Complexity and performance results for non FFT-based univariate polynomial Multiplication
    ACM Communications in Computer Algebra, 2011
    Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost
    Abstract:

    Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms for univariate polynomial Multiplication algorithms that are are independent of the coefficient ring: the plain and the Toom-Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++.

  • Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication
    2011
    Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost
    Abstract:

    Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

  • PASCO - Cache friendly sparse matrix-vector Multiplication
    Proceedings of the 4th International Workshop on Parallel and Symbolic Computation - PASCO '10, 2010
    Co-Authors: Sardar Anisul Haque, Shahadat Hossain, Marc Moreno Maza
    Abstract:

    Sparse matrix-vector Multiplication or SpMXV is an important kernel in scientific computing. For example, the conjugate gradient method (CG) is an iterative linear system solving process where Multiplication of the coefficient matrix A with a dense vector x is the main computational step accounting for as much as 90% of the overall running time. Though the total number of arithmetic operations (involving nonzero entries only) to compute Ax is fixed, reducing the probability of cache misses per operation is still a challenging area of research. This preprocessing is done once and its cost is amortized by repeated Multiplications. Computers that employ cache memory to improve the speed of data access rely on reuse of data that are brought into the cache memory. The challenge is to exploit data locality especially for unstructured problems: modeling data locality in this context is hard.

Éric Schost - One of the best experts on this subject based on the ideXlab platform.

  • A simple and fast online power series Multiplication and its analysis
    Journal of Symbolic Computation, 2016
    Co-Authors: Romain Lebreton, Éric Schost
    Abstract:

    This paper focus on online (or relaxed) algorithms for the Multiplication of power series over a field and their analysis. We propose a new online algorithm for the Multiplication using middle and short products of polynomials as building blocks, and we give the first precise analysis of the arithmetic complexity of various online Multiplications. Our algorithm is faster than Fischer and Stockmeyer's by a constant factor; this is confirmed by our experimental results.

  • complexity and performance results for non fft based univariate polynomial Multiplication
    ADVANCES IN MATHEMATICAL AND COMPUTATIONAL METHODS: ADDRESSING MODERN CHALLENGES OF SCIENCE TECHNOLOGY AND SOCIETY, 2011
    Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost
    Abstract:

    Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

  • Complexity and performance results for non FFT-based univariate polynomial Multiplication
    ACM Communications in Computer Algebra, 2011
    Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost
    Abstract:

    Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms for univariate polynomial Multiplication algorithms that are are independent of the coefficient ring: the plain and the Toom-Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++.

  • Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication
    2011
    Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost
    Abstract:

    Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

Tomoko Yonemura - One of the best experts on this subject based on the ideXlab platform.

  • RNS Montgomery reduction algorithms using quadratic residuosity
    Journal of Cryptographic Engineering, 2019
    Co-Authors: Shinichi Kawamura, Yuichi Komano, Hideo Shimizu, Tomoko Yonemura
    Abstract:

    The residue number system (RNS) is a method for representing an integer as an n -tuple of its residues with respect to a given base. Since RNS has inherent parallelism, it is actively researched to implement a faster processing system for public-key cryptography. This paper proposes new RNS Montgomery reduction algorithms, Q-RNSs, the main part of which is twice a matrix Multiplication. Letting n be the size of a base set, the number of unit modular Multiplications in the proposed algorithms is evaluated as $$(2n^2+n)$$ ( 2 n 2 + n ) . This is achieved by posing a new restriction on the RNS base, namely, that its elements should have a certain quadratic residuosity. This makes it possible to remove some Multiplication steps from conventional algorithms, and thus the new algorithms are simpler and have higher regularity compared with conventional ones. From our experiments, it is confirmed that there are sufficient candidates for RNS bases meeting the quadratic residuosity requirements.