The Experts below are selected from a list of 309 Experts worldwide ranked by ideXlab platform
Jose E Roman  One of the best experts on this subject based on the ideXlab platform.

A Parallel Structured DivideandConquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems
IEEE Transactions on Parallel and Distributed Systems, 2021CoAuthors: Xia Liao, Jose E RomanAbstract:In this article, a parallel structured divideandconquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrixmatrix Multiplications is the most computationally expensive part of the divideandconquer algorithm, and one of the matrices involved in such Multiplications is a rankstructured Cauchylike matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchylike matrices without any communication, and further reduces the computation costs by using a structured lowrank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x–1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

A parallel structured divideandconquer algorithm for symmetric tridiagonal eigenvalue problems
arXiv: Mathematical Software, 2020CoAuthors: Xia Liao, Jose E RomanAbstract:In this paper, a parallel structured divideandconquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrixmatrix Multiplications is the most computationally expensive part of the divideandconquer algorithm, and one of the matrices involved in such Multiplications is a rankstructured Cauchylike matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchylike matrices without any communication, and further reduces the computation costs by using a structured lowrank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves $1.4$x$1.6$x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.
Xia Liao  One of the best experts on this subject based on the ideXlab platform.

A Parallel Structured DivideandConquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems
IEEE Transactions on Parallel and Distributed Systems, 2021CoAuthors: Xia Liao, Jose E RomanAbstract:In this article, a parallel structured divideandconquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrixmatrix Multiplications is the most computationally expensive part of the divideandconquer algorithm, and one of the matrices involved in such Multiplications is a rankstructured Cauchylike matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchylike matrices without any communication, and further reduces the computation costs by using a structured lowrank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [16] and only scaled to 300 processes for the same matrices. Comparing with PDSTEDC in ScaLAPACK, PSDC is always faster and achieves 1.4x–1.6x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

A parallel structured divideandconquer algorithm for symmetric tridiagonal eigenvalue problems
arXiv: Mathematical Software, 2020CoAuthors: Xia Liao, Jose E RomanAbstract:In this paper, a parallel structured divideandconquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix Multiplication algorithm, called PSMMA. Computing the eigenvectors via matrixmatrix Multiplications is the most computationally expensive part of the divideandconquer algorithm, and one of the matrices involved in such Multiplications is a rankstructured Cauchylike matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchylike matrices without any communication, and further reduces the computation costs by using a structured lowrank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves $1.4$x$1.6$x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.
Marc Moreno Maza  One of the best experts on this subject based on the ideXlab platform.

complexity and performance results for non fft based univariate polynomial Multiplication
ADVANCES IN MATHEMATICAL AND COMPUTATIONAL METHODS: ADDRESSING MODERN CHALLENGES OF SCIENCE TECHNOLOGY AND SOCIETY, 2011CoAuthors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric SchostAbstract:Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

Complexity and performance results for non FFTbased univariate polynomial Multiplication
ACM Communications in Computer Algebra, 2011CoAuthors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric SchostAbstract:Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms for univariate polynomial Multiplication algorithms that are are independent of the coefficient ring: the plain and the ToomCook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++.

Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication
2011CoAuthors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric SchostAbstract:Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

PASCO  Cache friendly sparse matrixvector Multiplication
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation  PASCO '10, 2010CoAuthors: Sardar Anisul Haque, Shahadat Hossain, Marc Moreno MazaAbstract:Sparse matrixvector Multiplication or SpMXV is an important kernel in scientific computing. For example, the conjugate gradient method (CG) is an iterative linear system solving process where Multiplication of the coefficient matrix A with a dense vector x is the main computational step accounting for as much as 90% of the overall running time. Though the total number of arithmetic operations (involving nonzero entries only) to compute Ax is fixed, reducing the probability of cache misses per operation is still a challenging area of research. This preprocessing is done once and its cost is amortized by repeated Multiplications. Computers that employ cache memory to improve the speed of data access rely on reuse of data that are brought into the cache memory. The challenge is to exploit data locality especially for unstructured problems: modeling data locality in this context is hard.
Éric Schost  One of the best experts on this subject based on the ideXlab platform.

A simple and fast online power series Multiplication and its analysis
Journal of Symbolic Computation, 2016CoAuthors: Romain Lebreton, Éric SchostAbstract:This paper focus on online (or relaxed) algorithms for the Multiplication of power series over a field and their analysis. We propose a new online algorithm for the Multiplication using middle and short products of polynomials as building blocks, and we give the first precise analysis of the arithmetic complexity of various online Multiplications. Our algorithm is faster than Fischer and Stockmeyer's by a constant factor; this is confirmed by our experimental results.

complexity and performance results for non fft based univariate polynomial Multiplication
ADVANCES IN MATHEMATICAL AND COMPUTATIONAL METHODS: ADDRESSING MODERN CHALLENGES OF SCIENCE TECHNOLOGY AND SOCIETY, 2011CoAuthors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric SchostAbstract:Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].

Complexity and performance results for non FFTbased univariate polynomial Multiplication
ACM Communications in Computer Algebra, 2011CoAuthors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric SchostAbstract:Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms for univariate polynomial Multiplication algorithms that are are independent of the coefficient ring: the plain and the ToomCook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++.

Complexity and Performance Results for Non FFT‐Based Univariate Polynomial Multiplication
2011CoAuthors: Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan, Éric SchostAbstract:Today's parallel hardware architectures and computer memory hierarchies enforce revisiting fundamental algorithms which were often designed with algebraic complexity as the main complexity measure and with sequential running time as the main performance counter. This study is devoted to two algorithms of univariate polynomial Multiplication; that are independent of the coefficient ring: the plain and the Toom‐Cook univariate Multiplications. We analyze their cache complexity and report on their parallel implementations in Cilk++ [1].
Tomoko Yonemura  One of the best experts on this subject based on the ideXlab platform.

RNS Montgomery reduction algorithms using quadratic residuosity
Journal of Cryptographic Engineering, 2019CoAuthors: Shinichi Kawamura, Yuichi Komano, Hideo Shimizu, Tomoko YonemuraAbstract:The residue number system (RNS) is a method for representing an integer as an n tuple of its residues with respect to a given base. Since RNS has inherent parallelism, it is actively researched to implement a faster processing system for publickey cryptography. This paper proposes new RNS Montgomery reduction algorithms, QRNSs, the main part of which is twice a matrix Multiplication. Letting n be the size of a base set, the number of unit modular Multiplications in the proposed algorithms is evaluated as $$(2n^2+n)$$ ( 2 n 2 + n ) . This is achieved by posing a new restriction on the RNS base, namely, that its elements should have a certain quadratic residuosity. This makes it possible to remove some Multiplication steps from conventional algorithms, and thus the new algorithms are simpler and have higher regularity compared with conventional ones. From our experiments, it is confirmed that there are sufficient candidates for RNS bases meeting the quadratic residuosity requirements.