Kernel Execution

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 10290 Experts worldwide ranked by ideXlab platform

Huiyang Zhou - One of the best experts on this subject based on the ideXlab platform.

  • Fair and cache blocking aware warp scheduling for concurrent Kernel Execution on GPU
    Future Generation Computer Systems, 2020
    Co-Authors: Chen Zhao, Fei Wang, Wu Gao, Feiping Nie, Huiyang Zhou
    Abstract:

    Abstract With Graphic Processing Units (GPUs) being widely adopted in data centers to provide computing power, efficient support for GPU multitasking attracts significant attention. The prior GPU multitasking works include spatial multitasking and simultaneous multitasking (SMK). Spatial multitasking allocates GPU resources at the streaming multiprocessor (SM) granularity which is coarse-grained, and SMK runs concurrent Kernels on the same SM, therefore is fine-grained. SMK is beneficial to improve GPU resource utilization especially when concurrent Kernels have complementary characteristics. However, the main challenge for SMK is the interference among multiple Kernels especially the contention on data cache. In this paper, we propose a fair and cache blocking aware warp scheduling (FCBWS) approach to ameliorate the contention on data cache and improve SMK on GPUs. In FCBWS, equal opportunity of issuing instructions is provided to each Kernel, and memory pipeline stalls are tried to be avoided by predicting cache blocking. Kernels are extracted from various applications to construct concurrent Kernel Execution benchmarks. The simulation experiment results show that FCBWS outperforms previous multitasking methods; even compared to the state-of-the-art SMK method, FCBWS can improve system throughput (STP) by 10% on average and reduce average normalized turnaround time (ANTT) by 41% on average.

  • Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution
    ACM Transactions on Architecture and Code Optimization, 2019
    Co-Authors: Zhen Lin, Hongwen Dai, Michael Mantor, Huiyang Zhou
    Abstract:

    Contemporary GPUs support multiple Kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such concurrent Kernel Execution (CKE) improves both resource utilization and computational throughput. Most of the prior works focus on partitioning the GPU resources at the cooperative thread array (CTA) level or the warp scheduler level to improve CKE. However, significant performance slowdown and unfairness are observed when latency-sensitive Kernels co-run with bandwidth-intensive ones. The reason is that bandwidth over-subscription from bandwidth-intensive Kernels leads to much aggravated memory access latency, which is highly detrimental to latency-sensitive Kernels. Even among bandwidth-intensive Kernels, more intensive Kernels may unfairly consume much higher bandwidth than less-intensive ones. In this article, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach dynamically detects co-running Kernels as latency sensitive or bandwidth intensive. As both the DRAM bandwidth and L2-to-L1 Network-on-Chip (NoC) bandwidth can be the critical resource, our approach partitions both bandwidth resources coordinately along with selecting proper CTA combinations. The key objective is to allocate more CTA resources for latency-sensitive Kernels and more NoC/DRAM bandwidth resources to NoC-/DRAM-intensive Kernels. We achieve it using a variation of dominant resource fairness (DRF). Compared with two state-of-the-art CKE optimization schemes, SMK [52] and WS [55], our approach improves the average harmonic speedup by 78% and 39%, respectively. Even compared to the best possible CTA combinations, which are obtained from an exhaustive search among all possible CTA combinations, our approach improves the harmonic speedup by up to 51% and 11% on average.

  • accelerate gpu concurrent Kernel Execution by mitigating memory pipeline stalls
    High-Performance Computer Architecture, 2018
    Co-Authors: Hongwen Dai, Zhen Lin, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou
    Abstract:

    Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU Kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent Kernel Execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent Kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each Kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different Kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among Kernels. Specifically, as concurrent Kernels share the memory subsystem, one Kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one Kernel, especially a memory-intensive one, will impact other Kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent Kernels. The second is to limit the number of inflight memory instructions issued from individual Kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

  • HPCA - Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
    2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018
    Co-Authors: Hongwen Dai, Zhen Lin, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou
    Abstract:

    Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU Kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent Kernel Execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent Kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each Kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different Kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among Kernels. Specifically, as concurrent Kernels share the memory subsystem, one Kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one Kernel, especially a memory-intensive one, will impact other Kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent Kernels. The second is to limit the number of inflight memory instructions issued from individual Kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

  • poster accelerate gpu concurrent Kernel Execution by mitigating memory pipeline stalls
    International Conference on Parallel Architectures and Compilation Techniques, 2017
    Co-Authors: Hongwen Dai, Zhen Lin, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou
    Abstract:

    In this study, we demonstrate that the performance may be undermined in the state-of-the-art intra-SM sharing schemes for concurrent Kernel Execution (CKE) on GPUs, due to the interference among concurrent Kernels. We highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose to balance memory accesses and limit the number of inflight memory instructions issued from concurrent Kernels to reduce memory pipeline stalls. Our proposed schemes significantly improve the performance of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK.

Frank Cordes - One of the best experts on this subject based on the ideXlab platform.

  • concurrent Kernel Execution on xeon phi within parallel heterogeneous workloads
    European Conference on Parallel Processing, 2014
    Co-Authors: Florian Wende, Thomas Steinke, Frank Cordes
    Abstract:

    Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational Kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent Kernel Execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic Kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.

  • Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture
    2014
    Co-Authors: Florian Wende, Thomas Steinke, Frank Cordes
    Abstract:

    Small-scale computations usually cannot fully utilize the compute capabilities of modern GPGPUs. With the Fermi GPU architecture Nvidia introduced the concurrent Kernel Execution feature allowing up to 16 GPU Kernels to execute simultaneously on a shared GPU device for a better utilization of the respective resources. Insufficient scheduling capabilities in this respect, however, can significantly reduce the theoretical concurrency level. With the Kepler GPU architecture Nvidia addresses this issue by introducing the Hyper-Q feature with 32 hardware managed work queues for concurrent Kernel Execution. We investigate the Hyper-Q feature within heterogeneous workloads with multiple concurrent host threads or processes offloading computations to the GPU each. By means of a synthetic benchmark Kernel and a hybrid parallel CPU-GPU real-world application, we evaluate the performance obtained with Hyper-Q on GPU and compare it against a Kernel reordering mechanism introduced by the authors for the Fermi architecture.

  • Euro-Par - Concurrent Kernel Execution on Xeon Phi within Parallel Heterogeneous Workloads
    Lecture Notes in Computer Science, 2014
    Co-Authors: Florian Wende, Thomas Steinke, Frank Cordes
    Abstract:

    Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational Kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent Kernel Execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic Kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.

  • On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering
    2012 Symposium on Application Accelerators in High Performance Computing, 2012
    Co-Authors: Florian Wende, Frank Cordes, Thomas Steinke
    Abstract:

    General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the Execution of up to 16 GPU Kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU Kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU Kernel invocations from within parallel host regions by reordering the respective GPU Kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.

Kazutomo Yoshii - One of the best experts on this subject based on the ideXlab platform.

  • FPGA - Optimizations of Sequence Alignment on FPGA: A Case Study of Extended Sequence Alignment (Abstact Only)
    Proceedings of the 2018 ACM SIGDA International Symposium on Field-Programmable Gate Arrays, 2018
    Co-Authors: Zheming Jin, Kazutomo Yoshii
    Abstract:

    Detecting similarities between sequences is an important part of Bioinformatics. In this poster, we explore the use of high-level synthesis tool and a field-programmable gate array (FPGA) for optimizing a sequence alignment algorithm. We demonstrate the optimization techniques to improve the performance of the extended sequence alignment algorithm in the BWA software package, a tool for mapping DNA sequences against a large reference sequence. Applying the optimizations to the algorithm using Xilinx SDAccel OpenCL-to-FPGA tool, we reduce the Kernel Execution time from 62.8 ms to 0.45 ms while the power consumption is approximately 11 Watts on the ADM-PCIE-8K5 FPGA platform.

Xuxian Jiang - One of the best experts on this subject based on the ideXlab platform.

  • LiveDM: Kernel malware analysis with un-tampered and temporal views of dynamic Kernel memory
    2011
    Co-Authors: Junghwan Rhee, Ryan Riley, Xuxian Jiang
    Abstract:

    Dynamic Kernel memory has been a popular target of recent Kernel malware due to the difficulty of determining the status of volatile dynamic Kernel objects. Some existing approaches use Kernel memory mapping to identify dynamic Kernel objects and check Kernel integrity. The snapshot-based memory maps generated by these approaches are based on the Kernel memory which may have been manipulated by Kernel malware. In addition, because the snapshot only reflects the memory status at a single time instance, its usage is limited in temporal Kernel Execution analysis. We introduce a new runtime Kernel memory mapping scheme called allocation-driven mapping, which systematically identifies dynamic Kernel objects, including their types and lifetimes. The scheme works by capturing Kernel object allocation and deallocation events. Our system provides a number of unique benefits to Kernel malware analysis: (1) an un-tampered view wherein the mapping of Kernel data is unaffected by the manipulation of Kernel memory and (2) a temporal view of Kernel objects to be used in temporal analysis of Kernel Execution. We demonstrate the effectiveness of allocation-driven mapping in two usage scenarios. First, we build a hidden Kernel object detector that uses an un-tampered view to detect the data hiding attacks of 10 Kernel rootkits that directly manipulate Kernel objects (DKOM). Second, we develop a temporal malware behavior monitor that tracks and visualizes malware behavior triggered by the manipulation of dynamic Kernel objects. Allocation-driven mapping enables a reliable analysis of such behavior by guiding the inspection only to the events relevant to the attack.

  • RAID - Kernel malware analysis with un-tampered and temporal views of dynamic Kernel memory
    Lecture Notes in Computer Science, 2010
    Co-Authors: Junghwan Rhee, Ryan Riley, Dongyan Xu, Xuxian Jiang
    Abstract:

    Dynamic Kernel memory has been a popular target of recent Kernel malware due to the difficulty of determining the status of volatile dynamic Kernel objects. Some existing approaches use Kernel memory mapping to identify dynamic Kernel objects and check Kernel integrity. The snapshot-based memory maps generated by these approaches are based on the Kernel memory which may have been manipulated by Kernel malware. In addition, because the snapshot only reflects the memory status at a single time instance, its usage is limited in temporal Kernel Execution analysis. We introduce a new runtime Kernel memory mapping scheme called allocation-driven mapping, which systematically identifies dynamic Kernel objects, including their types and lifetimes. The scheme works by capturing Kernel object allocation and deallocation events. Our system provides a number of unique benefits to Kernel malware analysis: (1) an un-tampered view wherein the mapping of Kernel data is unaffected by the manipulation of Kernel memory and (2) a temporal view of Kernel objects to be used in temporal analysis of Kernel Execution. We demonstrate the effectiveness of allocation-driven mapping in two usage scenarios. First, we build a hidden Kernel object detector that uses an un-tampered view to detect the data hiding attacks of 10 Kernel rootkits that directly manipulate Kernel objects (DKOM). Second, we develop a temporal malware behavior monitor that tracks and visualizes malware behavior triggered by the manipulation of dynamic Kernel objects. Allocation-driven mapping enables a reliable analysis of such behavior by guiding the inspection only to the events relevant to the attack.

Florian Wende - One of the best experts on this subject based on the ideXlab platform.

  • Concurrent Kernel Offloading
    High Performance Parallelism Pearls, 2015
    Co-Authors: Florian Wende, Thomas Steinke, Michael Klemm, Alexander Reinefeld
    Abstract:

    This chapter describes the principle of concurrent Kernel offloading to the coprocessor and the aspects which need be considered for optimizing the performance. Concurrent Kernel offload targets application scenarios with many small-scale workloads that cannot exploit the provided resources on their own. This chapter explains how the computational throughput for multiple small-scale workloads can be improved on the Intel Xeon Phi coprocessor by concurrent Kernel Execution using the offload programming model. Each of the optimization steps are elaborated and illustrated by working examples. Performance improvements are presented for two demonstrator scenarios.

  • concurrent Kernel Execution on xeon phi within parallel heterogeneous workloads
    European Conference on Parallel Processing, 2014
    Co-Authors: Florian Wende, Thomas Steinke, Frank Cordes
    Abstract:

    Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational Kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent Kernel Execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic Kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.

  • Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture
    2014
    Co-Authors: Florian Wende, Thomas Steinke, Frank Cordes
    Abstract:

    Small-scale computations usually cannot fully utilize the compute capabilities of modern GPGPUs. With the Fermi GPU architecture Nvidia introduced the concurrent Kernel Execution feature allowing up to 16 GPU Kernels to execute simultaneously on a shared GPU device for a better utilization of the respective resources. Insufficient scheduling capabilities in this respect, however, can significantly reduce the theoretical concurrency level. With the Kepler GPU architecture Nvidia addresses this issue by introducing the Hyper-Q feature with 32 hardware managed work queues for concurrent Kernel Execution. We investigate the Hyper-Q feature within heterogeneous workloads with multiple concurrent host threads or processes offloading computations to the GPU each. By means of a synthetic benchmark Kernel and a hybrid parallel CPU-GPU real-world application, we evaluate the performance obtained with Hyper-Q on GPU and compare it against a Kernel reordering mechanism introduced by the authors for the Fermi architecture.

  • Euro-Par - Concurrent Kernel Execution on Xeon Phi within Parallel Heterogeneous Workloads
    Lecture Notes in Computer Science, 2014
    Co-Authors: Florian Wende, Thomas Steinke, Frank Cordes
    Abstract:

    Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational Kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent Kernel Execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic Kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.

  • On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering
    2012 Symposium on Application Accelerators in High Performance Computing, 2012
    Co-Authors: Florian Wende, Frank Cordes, Thomas Steinke
    Abstract:

    General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the Execution of up to 16 GPU Kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU Kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU Kernel invocations from within parallel host regions by reordering the respective GPU Kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.