Gpu

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 99630 Experts worldwide ranked by ideXlab platform

Jeremy Sweezy - One of the best experts on this subject based on the ideXlab platform.

  • a monte carlo volumetric ray casting estimator for global fluence tallies on Gpus
    Journal of Computational Physics, 2018
    Co-Authors: Jeremy Sweezy
    Abstract:

    Abstract A Monte Carlo fluence estimator has been designed to take advantage of the computational power of graphical processing units (Gpus). This new estimator, termed the volumetric-ray-casting estimator, is an extension of the expectation estimator. It can be used as a replacement of the track-length estimator for the estimation of global fluence. Calculations for this estimator are performed on the Gpu while the Monte Carlo random walk is performed on the central processing unit (CPU). This method lowers the implementation cost for Gpu acceleration of existing Monte Carlo particle transport codes as there is little modification of the particle history logic flow. Three test problems have been evaluated to assess the performance of the volumetric-ray-casting estimator for neutron transport on Gpu hardware in comparison to the standard track-length estimator on CPU hardware. Evaluation of neutron transport through air in a criticality accident scenario showed that the volumetric-ray-casting estimator achieved 23 times the performance of the track-length estimator using a single core CPU paired with a Gpu and 15 times the performance of the track-length estimator using an eight core CPU paired with a Gpu. Simulation of a pressurized water reactor fuel assembly showed that the performance improvement was 6 times within the fuel and 7 times within the control rods using an eight core CPU paired with a single Gpu.

  • a monte carlo volumetric ray casting estimator for global fluence tallies on Gpus
    Journal of Computational Physics, 2018
    Co-Authors: Jeremy Sweezy
    Abstract:

    Abstract A Monte Carlo fluence estimator has been designed to take advantage of the computational power of graphical processing units (Gpus). This new estimator, termed the volumetric-ray-casting estimator, is an extension of the expectation estimator. It can be used as a replacement of the track-length estimator for the estimation of global fluence. Calculations for this estimator are performed on the Gpu while the Monte Carlo random walk is performed on the central processing unit (CPU). This method lowers the implementation cost for Gpu acceleration of existing Monte Carlo particle transport codes as there is little modification of the particle history logic flow. Three test problems have been evaluated to assess the performance of the volumetric-ray-casting estimator for neutron transport on Gpu hardware in comparison to the standard track-length estimator on CPU hardware. Evaluation of neutron transport through air in a criticality accident scenario showed that the volumetric-ray-casting estimator achieved 23 times the performance of the track-length estimator using a single core CPU paired with a Gpu and 15 times the performance of the track-length estimator using an eight core CPU paired with a Gpu. Simulation of a pressurized water reactor fuel assembly showed that the performance improvement was 6 times within the fuel and 7 times within the control rods using an eight core CPU paired with a single Gpu.

Onur Mutlu - One of the best experts on this subject based on the ideXlab platform.

  • Managing Gpu Concurrency in Heterogeneous Architectures
    2014 47th Annual IEEE ACM International Symposium on Microarchitecture, 2014
    Co-Authors: Onur Kayiran, Nachiappan Chidambaram Nachiappan, Rachata Ausavarungnirun, Mahmut T. Kandemir, Onur Mutlu
    Abstract:

    Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized Gpus are projected to be the dominant computing platforms for many classes of applications. The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and Gpu applications is difficult. We show that Gpu applications tend to monopolize the shared hardware resources, such as memory and network, because of their high thread-level parallelism (TLP), and discuss the limitations of existing Gpu-based concurrency management techniques when employed in heterogeneous systems. To solve this problem, we propose an integrated concurrency management strategy that modulates the TLP in Gpus to control the performance of both CPU and Gpu applications. This mechanism considers both Gpu core state and system-wide memory and network congestion information to dynamically decide on the level of Gpu concurrency to maximize system performance. We propose and evaluate two schemes: one (CM-CPU) for boosting CPU performance in the presence of Gpu interference, the other (CM-BAL) for improving both CPU and Gpu performance in a balanced manner and thus overall system performance. Our evaluations show that the first scheme improves average CPU performance by 24%, while reducing average Gpu performance by 11%. The second scheme provides 7% average performance improvement for both CPU and Gpu applications. We also show that our solution allows the user to control performance trade-offs between CPUs and Gpus.

  • staged memory scheduling achieving high performance and scalability in heterogeneous systems
    International Symposium on Computer Architecture, 2012
    Co-Authors: Rachata Ausavarungnirun, Kevin K Chang, Lavanya Subramanian, Gabriel H Loh, Onur Mutlu
    Abstract:

    When multiple processor (CPU) cores and a Gpu integrated together on the same chip share the off-chip main memory, requests from the Gpu can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of Gpu traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-Gpu systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading Gpu frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the Gpu at varying levels to address different performance needs.

Ziliang Zong - One of the best experts on this subject based on the ideXlab platform.

  • effects of dynamic voltage and frequency scaling on a k20 Gpu
    International Conference on Parallel Processing, 2013
    Co-Authors: Rong Ge, Ryan Vogt, Jahangir A Majumder, Arif Alam, Martin Burtscher, Ziliang Zong
    Abstract:

    Improving energy efficiency is an ongoing challenge in HPC because of the ever-increasing need for performance coupled with power and economic constraints. Though Gpu-accelerated heterogeneous computing systems are capable of delivering impressive performance, it is necessary to explore all available power-aware technologies to meet the inevitable energy efficiency challenge. In this paper, we experimentally study the impacts of DVFS on application performance and energy efficiency for Gpu computing and compare them with those of DVFS for CPU computing. Based on a power-aware heterogeneous system that includes dual Intel Sandy Bridge CPUs and the latest Nvidia K20c Kepler Gpu, the study provides numerous new insights, general trends and exceptions of DVFS for Gpu computing. In general, the effects of DVFS on a Gpu differ from those of DVFS on a CPU. For example, on a Gpu running compute-bound high-performance and high-throughput workloads, the system performance and the power consumption are approximately proportional to the Gpu frequency. Hence, with a permissible power limit, increasing the Gpu frequency leads to better performance without incurring a noticeable increase in energy. This paper further provides detailed analytical explanations of the causes of the observed trends and exceptions. The findings presented in this paper have the potential to impact future CPU and Gpu architectures to achieve better energy efficiency and point out directions for designing effective DVFS schedulers for heterogeneous systems.

J. Kanzaki - One of the best experts on this subject based on the ideXlab platform.

  • monte carlo integration on Gpu
    European Physical Journal C, 2011
    Co-Authors: J. Kanzaki
    Abstract:

    We use a graphics processing unit (Gpu) for fast computations of Monte Carlo integrations. Two widely used Monte Carlo integration programs, VEGAS and BASES, are parallelized for running on a Gpu. By using W+ plus multi-gluon production processes at LHC, we test the integrated cross sections and execution time for programs written in FORTRAN and running in the CPU and those running on a Gpu. The integrated results agree with each other within statistical errors. The programs run about 50 times faster on the Gpu than on the CPU.

Berry François - One of the best experts on this subject based on the ideXlab platform.

  • Why is FPGA-Gpu Heterogeneity the Best Option for Embedded Deep Neural Networks?
    2021
    Co-Authors: Carballo-hernández Walther, Pelcat Maxime, Berry François
    Abstract:

    Graphics Processing Units (Gpus) are currently the dominating programmable architecture for Deep Learning (DL) accelerators. The adoption of Field Programmable Gate Arrays (FPGAs) in DL accelerators is however getting momentum. In this paper, we demonstrate that Direct Hardware Mapping (DHM) of a Convolutional Neural Network (CNN) on an embedded FPGA substantially outperforms a Gpu implementation in terms of energy efficiency and execution time. However, DHM is highly resource intensive and cannot fully substitute the Gpu when implementing a state-of-the-art CNN. We thus propose a hybrid FPGA-Gpu DL acceleration method and demonstrate that heterogeneous acceleration outperforms Gpu acceleration even including communication overheads. Experimental results are conducted on a heterogeneous multi-platform setup embedding an Nvidia(R) Jetson TX2 CPU-Gpu board and an Intel(R) Cyclone10GX FPGA board. The SqueezeNet, MobileNetv2, and ShuffleNetv2 mobile-oriented CNNs are experimented. We show that heterogeneous FPG-AGpu acceleration outperforms Gpu acceleration for classification inference task over MobileNetv2 (12%-30% energy reduction, 4% to 26% latency reduction), SqueezeNet (21%-28% energy reduction, same latency), and ShuffleNetv2 (25% energy reduction, 21% latency reduction).Comment: Presented at DATE Friday Workshop on System-level Design Methods for Deep Learning on Heterogeneous Architectures (SLOHA 2021) (arXiv:2102.00818

  • Why is FPGA-Gpu Heterogeneity the Best Option for Embedded Deep Neural Networks?
    HAL CCSD, 2021
    Co-Authors: Carballo-hernández Walther, Pelcat Maxime, Berry François
    Abstract:

    Presented at DATE Friday Workshop on System-level Design Methods for Deep Learning on Heterogeneous Architectures (SLOHA 2021) (arXiv:2102.00818)Graphics Processing Units (Gpus) are currently the dominating programmable architecture for Deep Learning (DL) accelerators. The adoption of Field Programmable Gate Arrays (FPGAs) in DL accelerators is however getting momentum. In this paper, we demonstrate that Direct Hardware Mapping (DHM) of a Convolutional Neural Network (CNN) on an embedded FPGA substantially outperforms a Gpu implementation in terms of energy efficiency and execution time. However, DHM is highly resource intensive and cannot fully substitute the Gpu when implementing a state-of-the-art CNN. We thus propose a hybrid FPGA-Gpu DL acceleration method and demonstrate that heterogeneous acceleration outperforms Gpu acceleration even including communication overheads. Experimental results are conducted on a heterogeneous multi-platform setup embedding an Nvidia(R) Jetson TX2 CPU-Gpu board and an Intel(R) Cyclone10GX FPGA board. The SqueezeNet, MobileNetv2, and ShuffleNetv2 mobile-oriented CNNs are experimented. We show that heterogeneous FPG-AGpu acceleration outperforms Gpu acceleration for classification inference task over MobileNetv2 (12%-30% energy reduction, 4% to 26% latency reduction), SqueezeNet (21%-28% energy reduction, same latency), and ShuffleNetv2 (25% energy reduction, 21% latency reduction)