Parallel Kernel

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 14553 Experts worldwide ranked by ideXlab platform

Scott Mahlke - One of the best experts on this subject based on the ideXlab platform.

  • vast the illusion of a large memory space for gpus
    International Conference on Parallel Architectures and Compilation Techniques, 2014
    Co-Authors: Mehrzad Samadi, Scott Mahlke
    Abstract:

    Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data Parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data Parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the Kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data Parallel Kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.

  • paraprox pattern based approximation for data Parallel applications
    Architectural Support for Programming Languages and Operating Systems, 2014
    Co-Authors: Mehrzad Samadi, Davoud Anoushe Jamshidi, Scott Mahlke
    Abstract:

    Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where the output of programs need not be perfectly correct to provide useful results or even noticeable differences to the end user. These soft domains include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate computing is transparency to insulate both software and hardware developers from the time, cost, and difficulty of using approximation. This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-Parallel programs that operates on commodity hardware systems. Paraprox starts with a data-Parallel Kernel implemented using OpenCL or CUDA and creates a parameterized approximate Kernel that is tuned at runtime to maximize performance subject to a target output quality (TOQ) that is supplied by the user. Approximate Kernels are created by recognizing common computation idioms found in data-Parallel programs (e.g., Map, Scatter/Gather, Reduction, Scan, Stencil, and Partition) and substituting approximate implementations in their place. Across a set of 13 soft data-Parallel applications with at most 10% quality degradation, Paraprox yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor compared to accurate execution on each platform.

Evgeny Savelev - One of the best experts on this subject based on the ideXlab platform.

  • Asymptotic properties of Parallel Bayesian Kernel density estimators
    Annals of the Institute of Statistical Mathematics, 2019
    Co-Authors: Alexey Miroshnikov, Evgeny Savelev
    Abstract:

    In this article, we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger et al. (in: Proceedings of the thirtieth conference on uncertainty in artificial intelligence, AUAI Press, pp 623–632, 2014 ). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.

  • asymptotic properties of Parallel bayesian Kernel density estimators
    arXiv: Statistics Theory, 2016
    Co-Authors: Alexey Miroshnikov, Evgeny Savelev
    Abstract:

    In this article we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger, Wang and Xing (2014). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.

Mehrzad Samadi - One of the best experts on this subject based on the ideXlab platform.

  • vast the illusion of a large memory space for gpus
    International Conference on Parallel Architectures and Compilation Techniques, 2014
    Co-Authors: Mehrzad Samadi, Scott Mahlke
    Abstract:

    Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data Parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data Parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the Kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data Parallel Kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.

  • paraprox pattern based approximation for data Parallel applications
    Architectural Support for Programming Languages and Operating Systems, 2014
    Co-Authors: Mehrzad Samadi, Davoud Anoushe Jamshidi, Scott Mahlke
    Abstract:

    Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where the output of programs need not be perfectly correct to provide useful results or even noticeable differences to the end user. These soft domains include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate computing is transparency to insulate both software and hardware developers from the time, cost, and difficulty of using approximation. This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-Parallel programs that operates on commodity hardware systems. Paraprox starts with a data-Parallel Kernel implemented using OpenCL or CUDA and creates a parameterized approximate Kernel that is tuned at runtime to maximize performance subject to a target output quality (TOQ) that is supplied by the user. Approximate Kernels are created by recognizing common computation idioms found in data-Parallel programs (e.g., Map, Scatter/Gather, Reduction, Scan, Stencil, and Partition) and substituting approximate implementations in their place. Across a set of 13 soft data-Parallel applications with at most 10% quality degradation, Paraprox yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor compared to accurate execution on each platform.

George Biros - One of the best experts on this subject based on the ideXlab platform.

  • pvfmm a Parallel Kernel independent fmm for particle and volume potentials
    Communications in Computational Physics, 2015
    Co-Authors: Dhairya Malhotra, George Biros
    Abstract:

    We describe our implementation of a Parallel fast multipole method for evaluating potentials for discrete and continuous source distributions. The first requires summation over the source points and the second requiring integration over a continuous source density. Both problems require (N2) complexity when computed directly; however, can be accelerated to (N) time using FMM. In our PVFMM software library, we use Kernel independent FMM and this allows us to compute potentials for a wide range of elliptic Kernels. Our method is high order, adaptive and scalable. In this paper, we discuss several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory Parallelism and use of coprocessors. Our distributed memory implementation uses space-filling curve for partitioning data and a hypercube communication scheme. We present convergence results for Laplace, Stokes and Helmholtz (low wavenumber) Kernels for both particle and volume FMM. We measure efficiency of our method in terms of CPU cycles per unknown for different accuracies and different Kernels. We also demonstrate scalability of our implementation up to several thousand processor cores on the Stampede platform at the Texas Advanced Computing Center.

  • a new Parallel Kernel independent fast multipole method
    Conference on High Performance Computing (Supercomputing), 2003
    Co-Authors: Lexing Ying, George Biros, Denis Zorin, Harper M Langston
    Abstract:

    We present a new adaptive fast multipole algorithm and its Parallel implementation. The algorithm is Kernel-independent in the sense that the evaluation of pairwise interactions does not rely on any analytic expansions, but only utilizes Kernel evaluations. The new method provides the enabling technology for many important problems in computational science and engineering. Examples include viscous flows, fracture mechanics and screened Coulombic interactions. Our MPI-based Parallel implementation logically separates the computation and communication phases to avoid synchronization in the upward and downward computation passes, and thus allows us to fully exploit computation and communication overlapping. We measure isogranular and fixed-size scalability for a variety of Kernels on the Pittsburgh Supercomputing Center's TCS-1 Alphaserver on up to 3000 processors. We have solved viscous flow problems with up to 2.1 billion unknowns and we have achieved 1.6 Tflops/s peak performance and 1.13 Tflops/s sustained performance.

Alexey Miroshnikov - One of the best experts on this subject based on the ideXlab platform.

  • Asymptotic properties of Parallel Bayesian Kernel density estimators
    Annals of the Institute of Statistical Mathematics, 2019
    Co-Authors: Alexey Miroshnikov, Evgeny Savelev
    Abstract:

    In this article, we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger et al. (in: Proceedings of the thirtieth conference on uncertainty in artificial intelligence, AUAI Press, pp 623–632, 2014 ). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.

  • asymptotic properties of Parallel bayesian Kernel density estimators
    arXiv: Statistics Theory, 2016
    Co-Authors: Alexey Miroshnikov, Evgeny Savelev
    Abstract:

    In this article we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger, Wang and Xing (2014). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.