Distributed Memory

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 98319 Experts worldwide ranked by ideXlab platform

D.w. Walker - One of the best experts on this subject based on the ideXlab platform.

  • Parallel matrix transpose algorithms on Distributed Memory concurrent computers
    1994
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    This paper describes parallel matrix transpose algorithms on Distributed Memory concurrent processors. We assume that the matrix is Distributed over a P {times} Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A {center_dot} B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T} {center_dot} B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  • The design of scalable software libraries for Distributed Memory concurrent computers
    Proceedings of 8th International Parallel Processing Symposium, 1994
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    Describes the design of ScaLAPACK, a scalable software library for performing dense and banded linear algebra computations on Distributed Memory concurrent computers. The specification of the data distribution has important consequences for interprocessor communication and load balance, and hence is a major factor in determining performance and scalability of the library routines. The block cyclic data distribution is adopted as a simple, yet general purpose, way of decomposing block-partitioned matrices. Distributed Memory versions of the Level 3 BLAS provide an easy and convenient way of implementing the ScaLAPACK routines.

  • the design of a standard message passing interface for Distributed Memory concurrent computers
    Parallel Computing, 1994
    Co-Authors: D.w. Walker
    Abstract:

    This paper presents an overview of MPI, a proposed standard message passing interface for MIMD Distributed Memory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

  • scalapack a scalable linear algebra library for Distributed Memory concurrent computers
    Symposium on Frontiers of Massively Parallel Computation, 1992
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, Roldan Pozo, D.w. Walker
    Abstract:

    The authors describe ScaLAPACK, a Distributed Memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of Distributed versions of the Level 3 BLAS as building blocks, and an object-oriented interface to the library routines. The square block scattered decomposition is described. The implementation of a Distributed Memory version of the right-looking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrate the scalability of the algorithm. >

  • Parallel matrix transpose algorithms on Distributed Memory concurrent computers
    Proceedings of Scalable Parallel Libraries Conference, 2024
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    This paper describes parallel matrix transpose algorithms on Distributed Memory concurrent processors. We assume that the matrix is Distributed over a P/spl times/Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C=A/spl middot/B, the algorithms are used to compute parallel multiplications of transposed matrices, C=A/sup T//spl middot/B/sup T/, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer. >

J.j. Dongarra - One of the best experts on this subject based on the ideXlab platform.

  • linear systems solvers for Distributed Memory machines with gpu accelerators
    European Conference on Parallel Processing, 2019
    Co-Authors: Jakub Kurzak, Asim Yarkhan, Mark Gates, Ali Charara, Ichitaro Yamazaki, J.j. Dongarra
    Abstract:

    This work presents two implementations of linear solvers for Distributed-Memory machines with GPU accelerators—one based on the Cholesky factorization and one based on the LU factorization with partial pivoting. The routines are developed as part of the Software for Linear Algebra Targeting Exascale (SLATE) package, which represents a sharp departure from the traditional conventions established by legacy packages, such as LAPACK and ScaLAPACK. The article lays out the principles of the new approach, discusses the implementation details, and presents the performance results.

  • dynamic task scheduling for linear algebra algorithms on Distributed Memory multicore systems
    IEEE International Conference on High Performance Computing Data and Analytics, 2009
    Co-Authors: Fengguang Song, Asim Yarkhan, J.j. Dongarra
    Abstract:

    This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared-Memory or Distributed-Memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task-based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose a Distributed algorithm to solve data dependences without process cooperation. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky, LU, and QR factorizations. Our experiments on both shared-Memory machines (16, 32 cores) and Distributed-Memory machines (1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time.

  • Parallel matrix transpose algorithms on Distributed Memory concurrent computers
    1994
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    This paper describes parallel matrix transpose algorithms on Distributed Memory concurrent processors. We assume that the matrix is Distributed over a P {times} Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A {center_dot} B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T} {center_dot} B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  • The design of scalable software libraries for Distributed Memory concurrent computers
    Proceedings of 8th International Parallel Processing Symposium, 1994
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    Describes the design of ScaLAPACK, a scalable software library for performing dense and banded linear algebra computations on Distributed Memory concurrent computers. The specification of the data distribution has important consequences for interprocessor communication and load balance, and hence is a major factor in determining performance and scalability of the library routines. The block cyclic data distribution is adopted as a simple, yet general purpose, way of decomposing block-partitioned matrices. Distributed Memory versions of the Level 3 BLAS provide an easy and convenient way of implementing the ScaLAPACK routines.

  • scalapack a scalable linear algebra library for Distributed Memory concurrent computers
    Symposium on Frontiers of Massively Parallel Computation, 1992
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, Roldan Pozo, D.w. Walker
    Abstract:

    The authors describe ScaLAPACK, a Distributed Memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of Distributed versions of the Level 3 BLAS as building blocks, and an object-oriented interface to the library routines. The square block scattered decomposition is described. The implementation of a Distributed Memory version of the right-looking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrate the scalability of the algorithm. >

Jaeyoung Choi - One of the best experts on this subject based on the ideXlab platform.

  • Parallel matrix transpose algorithms on Distributed Memory concurrent computers
    1994
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    This paper describes parallel matrix transpose algorithms on Distributed Memory concurrent processors. We assume that the matrix is Distributed over a P {times} Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A {center_dot} B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T} {center_dot} B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  • The design of scalable software libraries for Distributed Memory concurrent computers
    Proceedings of 8th International Parallel Processing Symposium, 1994
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    Describes the design of ScaLAPACK, a scalable software library for performing dense and banded linear algebra computations on Distributed Memory concurrent computers. The specification of the data distribution has important consequences for interprocessor communication and load balance, and hence is a major factor in determining performance and scalability of the library routines. The block cyclic data distribution is adopted as a simple, yet general purpose, way of decomposing block-partitioned matrices. Distributed Memory versions of the Level 3 BLAS provide an easy and convenient way of implementing the ScaLAPACK routines.

  • scalapack a scalable linear algebra library for Distributed Memory concurrent computers
    Symposium on Frontiers of Massively Parallel Computation, 1992
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, Roldan Pozo, D.w. Walker
    Abstract:

    The authors describe ScaLAPACK, a Distributed Memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of Distributed versions of the Level 3 BLAS as building blocks, and an object-oriented interface to the library routines. The square block scattered decomposition is described. The implementation of a Distributed Memory version of the right-looking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrate the scalability of the algorithm. >

  • Parallel matrix transpose algorithms on Distributed Memory concurrent computers
    Proceedings of Scalable Parallel Libraries Conference, 2024
    Co-Authors: Jaeyoung Choi, J.j. Dongarra, D.w. Walker
    Abstract:

    This paper describes parallel matrix transpose algorithms on Distributed Memory concurrent processors. We assume that the matrix is Distributed over a P/spl times/Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C=A/spl middot/B, the algorithms are used to compute parallel multiplications of transposed matrices, C=A/sup T//spl middot/B/sup T/, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer. >

Anshul Gupta - One of the best experts on this subject based on the ideXlab platform.

  • A Shared- and Distributed-Memory parallel general sparse direct solver
    Applicable Algebra in Engineering Communication and Computing, 2007
    Co-Authors: Anshul Gupta
    Abstract:

    An important recent development in the area of solution of general sparse systems of linear equations has been the introduction of new algorithms that allow complete decoupling of symbolic and numerical phases of sparse Gaussian elimination with partial pivoting. This enables efficient solution of a series of sparse systems with the same nonzero pattern but different coefficient values, which is a fairly common situation in practical applications. This paper reports on a shared- and Distributed-Memory parallel general sparse solver based on these new symbolic and unsymmetric-pattern multifrontal algorithms.

  • a shared and Distributed Memory parallel sparse direct solver
    Lecture Notes in Computer Science, 2006
    Co-Authors: Anshul Gupta
    Abstract:

    In this paper, we describe a parallel direct solver for general sparse systems of linear equations that has recently been included in the Watson Sparse Matrix Package (WSMP) [7]. This solver utilizes both shared- and Distributed- Memory parallelism in the same program and is designed for a hierarchical parallel computer with network-interconnected SMP nodes. We compare the WSMP solver with two similar well known solvers: MUMPS [2] and Super_LUDist [10]. We show that the WSMP solver achieves significantly better performance than both these solvers based on traditional algorithms and is more numerically robust than Super_LUDist. We had earlier shown [8] that MUMPS and Super_LUDist are amongst the fastest Distributed-Memory general sparse solvers available.

P Banerjee - One of the best experts on this subject based on the ideXlab platform.

  • a framework for exploiting task and data parallelism on Distributed Memory multicomputers
    IEEE Transactions on Parallel and Distributed Systems, 1997
    Co-Authors: Shankar Ramaswamy, Sachin S Sapatnekar, P Banerjee
    Abstract:

    Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared Memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for Distributed Memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster.

  • automatic selection of dynamic data partitioning schemes for Distributed Memory multicomputers
    Languages and Compilers for Parallel Computing, 1995
    Co-Authors: Daniel J Palermo, P Banerjee
    Abstract:

    For Distributed-Memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. This system is being built as part of the PARADIGM (PARAllelizing compiler for Distributed-Memory General-purpose Multicomputers) project at the University of Illinois. The complete system will provide a fully automated means to parallelize programs written in a serial programming model obtaining high performance on a wide range of Distributed-Memory multicomputers.

  • communication optimizations used in the paradigm compiler for Distributed Memory multicomputers
    International Conference on Parallel Processing, 1994
    Co-Authors: Daniel J Palermo, John A Chandy, P Banerjee
    Abstract:

    The PARADIGM (PARAllelizing compiler for Distributed-Memory General-purpose Multicomputers) project at the University of Illinois provides a fully automated means to parallelize programs, written in a serial programming model, for execution on Distributed-Memory multicomputers. To provide efficient execution, PARADIGM automatically performs various optimizations to reduce the overhead and idle time caused by interprocessor communication. Optimizations studied in this paper include message coalescing, message vectorization, message aggregation, and coarse gram pipelining. To separate the optimization algorithms from machine-specific details, parameterized models are used to estimate communication and computation costs for a given machine. The models are also used in coarse gram pipelining to automatically select a task granularity that balances the available parallelism with the costs of communication. To determine the applicability of the optimizations on different machines, we analyzed their performance on an Intel iPSC/860, an Intel iPSC/2, and a Thinking Machines CM-5.