The Experts below are selected from a list of 14553 Experts worldwide ranked by ideXlab platform
Scott Mahlke - One of the best experts on this subject based on the ideXlab platform.
-
vast the illusion of a large memory space for gpus
International Conference on Parallel Architectures and Compilation Techniques, 2014Co-Authors: Mehrzad Samadi, Scott MahlkeAbstract:Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data Parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data Parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the Kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data Parallel Kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.
-
paraprox pattern based approximation for data Parallel applications
Architectural Support for Programming Languages and Operating Systems, 2014Co-Authors: Mehrzad Samadi, Davoud Anoushe Jamshidi, Scott MahlkeAbstract:Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where the output of programs need not be perfectly correct to provide useful results or even noticeable differences to the end user. These soft domains include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate computing is transparency to insulate both software and hardware developers from the time, cost, and difficulty of using approximation. This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-Parallel programs that operates on commodity hardware systems. Paraprox starts with a data-Parallel Kernel implemented using OpenCL or CUDA and creates a parameterized approximate Kernel that is tuned at runtime to maximize performance subject to a target output quality (TOQ) that is supplied by the user. Approximate Kernels are created by recognizing common computation idioms found in data-Parallel programs (e.g., Map, Scatter/Gather, Reduction, Scan, Stencil, and Partition) and substituting approximate implementations in their place. Across a set of 13 soft data-Parallel applications with at most 10% quality degradation, Paraprox yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor compared to accurate execution on each platform.
Evgeny Savelev - One of the best experts on this subject based on the ideXlab platform.
-
Asymptotic properties of Parallel Bayesian Kernel density estimators
Annals of the Institute of Statistical Mathematics, 2019Co-Authors: Alexey Miroshnikov, Evgeny SavelevAbstract:In this article, we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger et al. (in: Proceedings of the thirtieth conference on uncertainty in artificial intelligence, AUAI Press, pp 623–632, 2014 ). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.
-
asymptotic properties of Parallel bayesian Kernel density estimators
arXiv: Statistics Theory, 2016Co-Authors: Alexey Miroshnikov, Evgeny SavelevAbstract:In this article we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger, Wang and Xing (2014). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.
Mehrzad Samadi - One of the best experts on this subject based on the ideXlab platform.
-
vast the illusion of a large memory space for gpus
International Conference on Parallel Architectures and Compilation Techniques, 2014Co-Authors: Mehrzad Samadi, Scott MahlkeAbstract:Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data Parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data Parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the Kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data Parallel Kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.
-
paraprox pattern based approximation for data Parallel applications
Architectural Support for Programming Languages and Operating Systems, 2014Co-Authors: Mehrzad Samadi, Davoud Anoushe Jamshidi, Scott MahlkeAbstract:Approximate computing is an approach where reduced accuracy of results is traded off for increased speed, throughput, or both. Loss of accuracy is not permissible in all computing domains, but there are a growing number of data-intensive domains where the output of programs need not be perfectly correct to provide useful results or even noticeable differences to the end user. These soft domains include multimedia processing, machine learning, and data mining/analysis. An important challenge with approximate computing is transparency to insulate both software and hardware developers from the time, cost, and difficulty of using approximation. This paper proposes a software-only system, Paraprox, for realizing transparent approximation of data-Parallel programs that operates on commodity hardware systems. Paraprox starts with a data-Parallel Kernel implemented using OpenCL or CUDA and creates a parameterized approximate Kernel that is tuned at runtime to maximize performance subject to a target output quality (TOQ) that is supplied by the user. Approximate Kernels are created by recognizing common computation idioms found in data-Parallel programs (e.g., Map, Scatter/Gather, Reduction, Scan, Stencil, and Partition) and substituting approximate implementations in their place. Across a set of 13 soft data-Parallel applications with at most 10% quality degradation, Paraprox yields an average performance gain of 2.7x on a NVIDIA GTX 560 GPU and 2.5x on an Intel Core i7 quad-core processor compared to accurate execution on each platform.
George Biros - One of the best experts on this subject based on the ideXlab platform.
-
pvfmm a Parallel Kernel independent fmm for particle and volume potentials
Communications in Computational Physics, 2015Co-Authors: Dhairya Malhotra, George BirosAbstract:We describe our implementation of a Parallel fast multipole method for evaluating potentials for discrete and continuous source distributions. The first requires summation over the source points and the second requiring integration over a continuous source density. Both problems require (N2) complexity when computed directly; however, can be accelerated to (N) time using FMM. In our PVFMM software library, we use Kernel independent FMM and this allows us to compute potentials for a wide range of elliptic Kernels. Our method is high order, adaptive and scalable. In this paper, we discuss several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory Parallelism and use of coprocessors. Our distributed memory implementation uses space-filling curve for partitioning data and a hypercube communication scheme. We present convergence results for Laplace, Stokes and Helmholtz (low wavenumber) Kernels for both particle and volume FMM. We measure efficiency of our method in terms of CPU cycles per unknown for different accuracies and different Kernels. We also demonstrate scalability of our implementation up to several thousand processor cores on the Stampede platform at the Texas Advanced Computing Center.
-
a new Parallel Kernel independent fast multipole method
Conference on High Performance Computing (Supercomputing), 2003Co-Authors: Lexing Ying, George Biros, Denis Zorin, Harper M LangstonAbstract:We present a new adaptive fast multipole algorithm and its Parallel implementation. The algorithm is Kernel-independent in the sense that the evaluation of pairwise interactions does not rely on any analytic expansions, but only utilizes Kernel evaluations. The new method provides the enabling technology for many important problems in computational science and engineering. Examples include viscous flows, fracture mechanics and screened Coulombic interactions. Our MPI-based Parallel implementation logically separates the computation and communication phases to avoid synchronization in the upward and downward computation passes, and thus allows us to fully exploit computation and communication overlapping. We measure isogranular and fixed-size scalability for a variety of Kernels on the Pittsburgh Supercomputing Center's TCS-1 Alphaserver on up to 3000 processors. We have solved viscous flow problems with up to 2.1 billion unknowns and we have achieved 1.6 Tflops/s peak performance and 1.13 Tflops/s sustained performance.
Alexey Miroshnikov - One of the best experts on this subject based on the ideXlab platform.
-
Asymptotic properties of Parallel Bayesian Kernel density estimators
Annals of the Institute of Statistical Mathematics, 2019Co-Authors: Alexey Miroshnikov, Evgeny SavelevAbstract:In this article, we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger et al. (in: Proceedings of the thirtieth conference on uncertainty in artificial intelligence, AUAI Press, pp 623–632, 2014 ). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.
-
asymptotic properties of Parallel bayesian Kernel density estimators
arXiv: Statistics Theory, 2016Co-Authors: Alexey Miroshnikov, Evgeny SavelevAbstract:In this article we perform an asymptotic analysis of Bayesian Parallel Kernel density estimators introduced by Neiswanger, Wang and Xing (2014). We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.