Kernel Execution Time

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 156 Experts worldwide ranked by ideXlab platform

Xu Wang - One of the best experts on this subject based on the ideXlab platform.

  • Computing prestack Kirchhoff Time migration on general purpose GPU
    Computers & Geosciences, 2011
    Co-Authors: Xiaohua Shi, Shihu Wang, Xu Wang
    Abstract:

    This paper introduces how to optimize a practical prestack Kirchhoff Time migration program by the Compute Unified Device Architecture (CUDA) on a general purpose GPU (GPGPU). A few useful optimization methods on GPGPU are demonstrated, such as how to increase the Kernel thread numbers on GPU cores, and how to utilize the memory streams to overlap GPU Kernel Execution Time, etc. The floating-point errors on CUDA and NVidia's GPUs are discussed in detail. Some effective methods that can be used to reduce the floating-point errors are introduced. The images generated by the practical prestack Kirchhoff Time migration programs for the same real-world seismic data inputs on CPU and GPU are demonstrated. The final GPGPU approach on NVidia GTX 260 is more than 17 Times faster than its original CPU version on Intel's P4 3.0G.

  • APPT - A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration on GPGPU
    Lecture Notes in Computer Science, 2009
    Co-Authors: Xiaohua Shi, Xu Wang
    Abstract:

    We introduced four prototypes of General Purpose GPU solutions by Compute Unified Device Architecture (CUDA) on NVidia GeForce 8800GT and Tesla C870 for a practical Curved Ray Prestack Kirchhoff Time Migration program, which is one of the most widely adopted imaging methods in the seismic data processing industry. We presented how to re-design and re-implement the original CPU code to efficient GPU code step by step. We demonstrated optimization methods, such as how to reduce the overhead of memory transportation on PCI-E bus, how to significantly increase the Kernel thread numbers on GPU cores, how to buffer the inputs and outputs of CUDA Kernel modules, and how to utilize the memory streams to overlap GPU Kernel Execution Time, etc., to improve the runTime performance on GPUs. We analyzed the floating point errors between CPUs and GPUs. We presented the images generated by CPU and GPU programs for the same real-world seismic data inputs. Our final approach of Prototype-IV on NVidia GeForce 8800GT is 16.3 Times faster than its CPU version on Intel's P4 3.0G.

Xiaohua Shi - One of the best experts on this subject based on the ideXlab platform.

  • Computing prestack Kirchhoff Time migration on general purpose GPU
    Computers & Geosciences, 2011
    Co-Authors: Xiaohua Shi, Shihu Wang, Xu Wang
    Abstract:

    This paper introduces how to optimize a practical prestack Kirchhoff Time migration program by the Compute Unified Device Architecture (CUDA) on a general purpose GPU (GPGPU). A few useful optimization methods on GPGPU are demonstrated, such as how to increase the Kernel thread numbers on GPU cores, and how to utilize the memory streams to overlap GPU Kernel Execution Time, etc. The floating-point errors on CUDA and NVidia's GPUs are discussed in detail. Some effective methods that can be used to reduce the floating-point errors are introduced. The images generated by the practical prestack Kirchhoff Time migration programs for the same real-world seismic data inputs on CPU and GPU are demonstrated. The final GPGPU approach on NVidia GTX 260 is more than 17 Times faster than its original CPU version on Intel's P4 3.0G.

  • APPT - A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration on GPGPU
    Lecture Notes in Computer Science, 2009
    Co-Authors: Xiaohua Shi, Xu Wang
    Abstract:

    We introduced four prototypes of General Purpose GPU solutions by Compute Unified Device Architecture (CUDA) on NVidia GeForce 8800GT and Tesla C870 for a practical Curved Ray Prestack Kirchhoff Time Migration program, which is one of the most widely adopted imaging methods in the seismic data processing industry. We presented how to re-design and re-implement the original CPU code to efficient GPU code step by step. We demonstrated optimization methods, such as how to reduce the overhead of memory transportation on PCI-E bus, how to significantly increase the Kernel thread numbers on GPU cores, how to buffer the inputs and outputs of CUDA Kernel modules, and how to utilize the memory streams to overlap GPU Kernel Execution Time, etc., to improve the runTime performance on GPUs. We analyzed the floating point errors between CPUs and GPUs. We presented the images generated by CPU and GPU programs for the same real-world seismic data inputs. Our final approach of Prototype-IV on NVidia GeForce 8800GT is 16.3 Times faster than its CPU version on Intel's P4 3.0G.

Anthony H. Aletras - One of the best experts on this subject based on the ideXlab platform.

  • Quantitative measurements using the modified version of the MOLLI pulse sequence and the SQUAREMR method.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    T1 and T2 maps of a healthy volunteer extracted from the modified version of the MOLLI pulse sequence and the SQUAREMR method [15]. The Kernel Execution Time on a p2.xlarge instance for the generation of database with the simulated MR signals took 154 s.

  • Multiple-element receiver coil.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    A-H) Noiseless GRE images per coil element after the application of the 512x512 GRE pulse sequence on the entire volume of the McGill human brain phantom. I) The final noiseless reconstructed GRE image after the combination of all the coil-elements. The images were reconstructed on Gadgetron. The Kernel Execution Time on a p2.16xlarge instance was 1177 s.

  • Noiseless GRE images obtained with the simulator after the application of the 512x512 GRE pulse sequence.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    A) only on the spins located at the z-position of the isocenter B) only on the spins located ±3 mm (slice thickness 6mm) away from the z-position of the isocenter, and C) on the entire volume of the anatomical model. The Kernel Execution Time on a p2.xlarge instance was 292 s for case A, 451 s for case B and 4,801 s for case C. Note the different appearance of the simulated MR images for the different volumes of spins of the same anatomical model.

  • Simulated MR images after the application of GRE pulse sequences on the entire volume of the McGill anatomical model of the human brain and reconstructed on Gadgetron.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    Left column presents the simulated MR images from a 128x128 GRE pulse sequence, center column presents the simulated MR images from a 256x256 GRE pulse sequence, and right column presents the simulated MR images from a 512x512 GRE pulse sequence. The first row shows the GRE images with no noise, the second row shows the same GRE images with Gaussian distributed noise with zero mean and standard deviation equal to 20, whereas the third row shows the same GRE images with Gaussian distributed noise with zero mean and standard deviation equal to 40. The contrast of each image was adjusted to fit the image data range, with a 2% padding at the upper and lower bounds. The Kernel Execution Time on a p2.16xlarge instance was 99s for the 128x128 GRE pulse sequence, 209s for the 256x256 GRE pulse sequence and 457s for the 512x512 GRE pulse sequence.

  • Magnetization transfer.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    (Left) Signal intensity curves for the three agar computer phantoms: Agar 2% on the left, Agar 4% in the center, and Agar 8% on the right. The curve in blue shows the signal intensity without the simulation of the MT model, whereas the curve in orange shows the corresponding signal intensity with simulation of the MT model. Note the change of the shapes of the inversion recovery curves due to MT. (Right) Magnitude images of the computer model of three cylinders after the application of a MOLLI pulse sequence for inversion Times 1200 ms and 2300 ms. The bottom left cylinder represents an Agar phantom with a 2% macromolecular content, the bottom right cylinder represents an Agar phantom with a 4% macromolecular content, whereas the top cylinder represents an Agar phantom with an 8% macromolecular content. The Kernel Execution Time on a p2.xlarge instance was 105 s without the simulation of the MT model and 137 s with the simulation of the MT model.

Theodoros Christoudias - One of the best experts on this subject based on the ideXlab platform.

  • GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model (version 2.52)
    Geoscientific Model Development, 2017
    Co-Authors: Michail Alvanos, Theodoros Christoudias
    Abstract:

    Abstract. This paper presents an application of GPU accelerators in Earth system modeling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate–chemistry model simulations. We developed a software package that automatically generates CUDA Kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC), used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA-compatible Kernel by parsing the FORTRAN code generated by the Kinetic PreProcessor (KPP) general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported. Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators, shows achieved speed-ups of 4. 5 ×  and 20. 4 × , respectively, of the Kernel Execution Time. A node-to-node real-world production performance comparison shows a 1. 75 ×  speed-up over the non-accelerated application using the KPP three-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle Time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only code of the application. The median relative difference is found to be less than 0.000000001 % when comparing the output of the accelerated Kernel the CPU-only code. The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.

  • GPU accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model (version 2.52)
    2017
    Co-Authors: Michail Alvanos, Theodoros Christoudias
    Abstract:

    Abstract. This paper presents an application of GPU accelerators in Earth system modelling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate-chemistry model simulations. We developed a software package that automatically generates CUDA Kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC), used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA compatible Kernel, by parsing the FORTRAN code generated by the Kinetic Pre-Processor (KPP) general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported. Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators shows achieved speedups of 4.5× and 22.4× respectively of the Kernel Execution Time. A node-to-node real-world production performance comparison shows a 1.75× speed-up over the non-accelerated application using the KPP 3-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle Time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only version of the application. The relative difference is found to be less than 0.00005 % when comparing the output of the accelerated Kernel the CPU-only code, within the target level of relative accuracy (relative error tolerance) of 0.1 %. The approach followed, including the computational workload division and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.

Christos G. Xanthis - One of the best experts on this subject based on the ideXlab platform.

  • Quantitative measurements using the modified version of the MOLLI pulse sequence and the SQUAREMR method.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    T1 and T2 maps of a healthy volunteer extracted from the modified version of the MOLLI pulse sequence and the SQUAREMR method [15]. The Kernel Execution Time on a p2.xlarge instance for the generation of database with the simulated MR signals took 154 s.

  • Multiple-element receiver coil.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    A-H) Noiseless GRE images per coil element after the application of the 512x512 GRE pulse sequence on the entire volume of the McGill human brain phantom. I) The final noiseless reconstructed GRE image after the combination of all the coil-elements. The images were reconstructed on Gadgetron. The Kernel Execution Time on a p2.16xlarge instance was 1177 s.

  • Noiseless GRE images obtained with the simulator after the application of the 512x512 GRE pulse sequence.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    A) only on the spins located at the z-position of the isocenter B) only on the spins located ±3 mm (slice thickness 6mm) away from the z-position of the isocenter, and C) on the entire volume of the anatomical model. The Kernel Execution Time on a p2.xlarge instance was 292 s for case A, 451 s for case B and 4,801 s for case C. Note the different appearance of the simulated MR images for the different volumes of spins of the same anatomical model.

  • Simulated MR images after the application of GRE pulse sequences on the entire volume of the McGill anatomical model of the human brain and reconstructed on Gadgetron.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    Left column presents the simulated MR images from a 128x128 GRE pulse sequence, center column presents the simulated MR images from a 256x256 GRE pulse sequence, and right column presents the simulated MR images from a 512x512 GRE pulse sequence. The first row shows the GRE images with no noise, the second row shows the same GRE images with Gaussian distributed noise with zero mean and standard deviation equal to 20, whereas the third row shows the same GRE images with Gaussian distributed noise with zero mean and standard deviation equal to 40. The contrast of each image was adjusted to fit the image data range, with a 2% padding at the upper and lower bounds. The Kernel Execution Time on a p2.16xlarge instance was 99s for the 128x128 GRE pulse sequence, 209s for the 256x256 GRE pulse sequence and 457s for the 512x512 GRE pulse sequence.

  • Magnetization transfer.
    2019
    Co-Authors: Christos G. Xanthis, Anthony H. Aletras
    Abstract:

    (Left) Signal intensity curves for the three agar computer phantoms: Agar 2% on the left, Agar 4% in the center, and Agar 8% on the right. The curve in blue shows the signal intensity without the simulation of the MT model, whereas the curve in orange shows the corresponding signal intensity with simulation of the MT model. Note the change of the shapes of the inversion recovery curves due to MT. (Right) Magnitude images of the computer model of three cylinders after the application of a MOLLI pulse sequence for inversion Times 1200 ms and 2300 ms. The bottom left cylinder represents an Agar phantom with a 2% macromolecular content, the bottom right cylinder represents an Agar phantom with a 4% macromolecular content, whereas the top cylinder represents an Agar phantom with an 8% macromolecular content. The Kernel Execution Time on a p2.xlarge instance was 105 s without the simulation of the MT model and 137 s with the simulation of the MT model.