Multiplication - Explore the Science & Experts

The Experts below are selected from a list of 309 Experts worldwide ranked by ideXlab platform

Jose E Roman - One of the best experts on this subject based on the ideXlab platform.

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Xia Liao, Jose E Roman

Abstract:

In this article, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x–1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

15 days free trial to Access Article
A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

arXiv: Mathematical Software, 2020

Co-Authors: Xia Liao, Jose E Roman

Abstract:

In this paper, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512--520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves $1.4$x--$1.6$x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

15 days free trial to Access Article

Xia Liao - One of the best experts on this subject based on the ideXlab platform.

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Xia Liao, Jose E Roman

Abstract:

In this article, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x–1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

15 days free trial to Access Article
A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

arXiv: Mathematical Software, 2020

Co-Authors: Xia Liao, Jose E Roman

Abstract:

In this paper, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix Multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such Multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512--520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves $1.4$x--$1.6$x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

15 days free trial to Access Article

Marc Moreno Maza - One of the best experts on this subject based on the ideXlab platform.

complexity and performance results for non fft based univariate polynomial Multiplication

ADVANCES IN MATHEMATICAL AND COMPUTATIONAL METHODS: ADDRESSING MODERN CHALLENGES OF SCIENCE TECHNOLOGY AND SOCIETY, 2011

Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost

Abstract:

Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

15 days free trial to Access Article
Complexity and performance results for non FFT-based univariate polynomial Multiplication

ACM Communications in Computer Algebra, 2011

Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost

Abstract:

Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms for univariate polynomial Multiplication algorithms that are are independent of the coefficient ring: the plain and the Toom-Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++.

15 days free trial to Access Article
Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication

2011

Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost

Abstract:

Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

15 days free trial to Access Article
PASCO - Cache friendly sparse matrix-vector Multiplication

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation - PASCO '10, 2010

Co-Authors: Sardar Anisul Haque, Shahadat Hossain, Marc Moreno Maza

Abstract:

Sparse matrix-vector Multiplication or SpMXV is an important kernel in scientific computing. For example, the conjugate gradient method (CG) is an iterative linear system solving process where Multiplication of the coefficient matrix A with a dense vector x is the main computational step accounting for as much as 90% of the overall running time. Though the total number of arithmetic operations (involving nonzero entries only) to compute Ax is fixed, reducing the probability of cache misses per operation is still a challenging area of research. This preprocessing is done once and its cost is amortized by repeated Multiplications. Computers that employ cache memory to improve the speed of data access rely on reuse of data that are brought into the cache memory. The challenge is to exploit data locality especially for unstructured problems: modeling data locality in this context is hard.

15 days free trial to Access Article

Éric Schost - One of the best experts on this subject based on the ideXlab platform.

A simple and fast online power series Multiplication and its analysis

Journal of Symbolic Computation, 2016

Co-Authors: Romain Lebreton, Éric Schost

Abstract:

This paper focus on online (or relaxed) algorithms for the Multiplication of power series over a field and their analysis. We propose a new online algorithm for the Multiplication using middle and short products of polynomials as building blocks, and we give the first precise analysis of the arithmetic complexity of various online Multiplications. Our algorithm is faster than Fischer and Stockmeyer's by a constant factor; this is confirmed by our experimental results.

15 days free trial to Access Article
complexity and performance results for non fft based univariate polynomial Multiplication

ADVANCES IN MATHEMATICAL AND COMPUTATIONAL METHODS: ADDRESSING MODERN CHALLENGES OF SCIENCE TECHNOLOGY AND SOCIETY, 2011

Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost

Abstract:

Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

15 days free trial to Access Article
Complexity and performance results for non FFT-based univariate polynomial Multiplication

ACM Communications in Computer Algebra, 2011

Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost

Abstract:

Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms for univariate polynomial Multiplication algorithms that are are independent of the coefficient ring: the plain and the Toom-Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++.

15 days free trial to Access Article
Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication

2011

Co-Authors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric Schost

Abstract:

Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

15 days free trial to Access Article

Tomoko Yonemura - One of the best experts on this subject based on the ideXlab platform.

RNS Montgomery reduction algorithms using quadratic residuosity

Journal of Cryptographic Engineering, 2019

Co-Authors: Shinichi Kawamura, Yuichi Komano, Hideo Shimizu, Tomoko Yonemura

Abstract:

The residue number system (RNS) is a method for representing an integer as an n -tuple of its residues with respect to a given base. Since RNS has inherent parallelism, it is actively researched to implement a faster processing system for public-key cryptography. This paper proposes new RNS Montgomery reduction algorithms, Q-RNSs, the main part of which is twice a matrix Multiplication. Letting n be the size of a base set, the number of unit modular Multiplications in the proposed algorithms is evaluated as $$(2n^2+n)$$ ( 2 n 2 + n ) . This is achieved by posing a new restriction on the RNS base, namely, that its elements should have a certain quadratic residuosity. This makes it possible to remove some Multiplication steps from conventional algorithms, and thus the new algorithms are simpler and have higher regularity compared with conventional ones. From our experiments, it is confirmed that there are sufficient candidates for RNS bases meeting the quadratic residuosity requirements.

15 days free trial to Access Article

Discover everything there is to know about the scientific topic Multiplication with ideXlab!

Jose E Roman - One of the best experts on this subject based on the ideXlab platform.

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

Xia Liao - One of the best experts on this subject based on the ideXlab platform.

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

Marc Moreno Maza - One of the best experts on this subject based on the ideXlab platform.

complexity and performance results for non fft based univariate polynomial Multiplication

Complexity and performance results for non FFT-based univariate polynomial Multiplication

Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication

PASCO - Cache friendly sparse matrix-vector Multiplication

Éric Schost - One of the best experts on this subject based on the ideXlab platform.

A simple and fast online power series Multiplication and its analysis

complexity and performance results for non fft based univariate polynomial Multiplication

Complexity and performance results for non FFT-based univariate polynomial Multiplication

Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication

Tomoko Yonemura - One of the best experts on this subject based on the ideXlab platform.

RNS Montgomery reduction algorithms using quadratic residuosity