Cuda

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 29463 Experts worldwide ranked by ideXlab platform

Douglas L Maskell - One of the best experts on this subject based on the ideXlab platform.

  • Cuda meme accelerating motif discovery in biological sequences using Cuda enabled graphics processing units
    Pattern Recognition Letters, 2010
    Co-Authors: Yongchao Liu, Weiguo Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Motif discovery in biological sequences is of prime importance and a major challenge in computational biology. Consequently, numerous motif discovery tools have been developed to date. However, the rapid growth of both genomic sequence and gene transcription data, establishes the need for the development of scalable motif discovery tools. An approach to improve the runtime of motif discovery by an order-of-magnitude without losing sensitivity is to employ emerging many-core architectures such as Cuda-enabled GPUs. In this paper, we present a highly parallel formulation and implementation of the MEME motif discovery algorithm using the Cuda programming model. To achieve high efficiency, we introduce two parallelization approaches: sequence-level and substring-level parallelization. Furthermore, a hybrid computing framework is described to take advantage of both CPU and GPU compute resources. Our performance evaluation on a GeForce GTX 280 GPU, results in average runtime speedups of 21.4 (19.3) for the starting point search and 20.5 (16.4) for the overall runtime using the OOPS (ZOOPS) motif search model. The runtime speedups of Cuda-MEME on a single GPU are also comparable to those of ParaMEME running on 16 CPU cores of a high-performance workstation cluster. In addition to the fast speed, Cuda-MEME has the capability of finding motif instances consistent with the sequential MEME.

  • msa Cuda multiple sequence alignment on graphics processing units with Cuda
    Application-Specific Systems Architectures and Processors, 2009
    Co-Authors: Yongchao Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-Cuda, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using Cuda and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-Cuda outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.

  • ASAP - MSA-Cuda: Multiple Sequence Alignment on Graphics Processing Units with Cuda
    2009 20th IEEE International Conference on Application-specific Systems Architectures and Processors, 2009
    Co-Authors: Yongchao Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-Cuda, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using Cuda and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-Cuda outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.

Bertil Schmidt - One of the best experts on this subject based on the ideXlab platform.

  • Advanced Cuda Programming
    Parallel Programming, 2018
    Co-Authors: Bertil Schmidt, Jorge González-domínguez, Christian Hundt, Moritz Schlarb
    Abstract:

    In the recent past, Cuda has become the major framework for the programming of massively parallel accelerators. NVIDIA estimates the number of Cuda installations in the year 2016 to exceed one million. Moreover, with the rise of Deep Learning this number is expected to grow at an exponential rate in the foreseeing future. Hence, extensive Cuda knowledge is a fundamental pursuit for every programmer in the field of High Performance Computing. The previous chapter focused on the basic programming model and the memory hierarchy of modern GPUs. We have seen that proper memory utilization is key to obtain efficient code. While our examples from the previous chapter focused on thread-level implementations, we investigate now warp-level parallelization and the efficient use of atomic functions. Both techniques in combination enable further code optimization. Moreover, we discuss overlapping of communication and computation in single-GPU and multi-GPU scenarios using streams. We conclude the chapter with a brief discussion of Cuda 9 and its novel features.

  • Cuda-BLASTP: Accelerating BLASTP on Cuda-Enabled Graphics Hardware
    IEEE ACM Transactions on Computational Biology and Bioinformatics, 2011
    Co-Authors: Weiguo Liu, Bertil Schmidt, Wolfgang Müller-wittig
    Abstract:

    Scanning protein sequence database is an often repeated task in computational biology and bioinformatics. However, scanning large protein databases, such as GenBank, with popular tools such as BLASTP requires long runtimes on sequential architectures. Due to the continuing rapid growth of sequence databases, there is a high demand to accelerate this task. In this paper, we demonstrate how GPUs, powered by the Compute Unified Device Architecture (Cuda), can be used as an efficient computational platform to accelerate the BLASTP algorithm. In order to exploit the GPU's capabilities for accelerating BLASTP, we have used a compressed deterministic finite state automaton for hit detection as well as a hybrid parallelization scheme. Our implementation achieves speedups up to 10.0 on an NVIDIA GeForce GTX 295 GPU compared to the sequential NCBI BLASTP 2.2.22. Cuda-BLASTP source code which is available at https://sites.google.com/site/liuweiguohome/software.

  • Cuda meme accelerating motif discovery in biological sequences using Cuda enabled graphics processing units
    Pattern Recognition Letters, 2010
    Co-Authors: Yongchao Liu, Weiguo Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Motif discovery in biological sequences is of prime importance and a major challenge in computational biology. Consequently, numerous motif discovery tools have been developed to date. However, the rapid growth of both genomic sequence and gene transcription data, establishes the need for the development of scalable motif discovery tools. An approach to improve the runtime of motif discovery by an order-of-magnitude without losing sensitivity is to employ emerging many-core architectures such as Cuda-enabled GPUs. In this paper, we present a highly parallel formulation and implementation of the MEME motif discovery algorithm using the Cuda programming model. To achieve high efficiency, we introduce two parallelization approaches: sequence-level and substring-level parallelization. Furthermore, a hybrid computing framework is described to take advantage of both CPU and GPU compute resources. Our performance evaluation on a GeForce GTX 280 GPU, results in average runtime speedups of 21.4 (19.3) for the starting point search and 20.5 (16.4) for the overall runtime using the OOPS (ZOOPS) motif search model. The runtime speedups of Cuda-MEME on a single GPU are also comparable to those of ParaMEME running on 16 CPU cores of a high-performance workstation cluster. In addition to the fast speed, Cuda-MEME has the capability of finding motif instances consistent with the sequential MEME.

  • msa Cuda multiple sequence alignment on graphics processing units with Cuda
    Application-Specific Systems Architectures and Processors, 2009
    Co-Authors: Yongchao Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-Cuda, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using Cuda and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-Cuda outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.

  • ASAP - MSA-Cuda: Multiple Sequence Alignment on Graphics Processing Units with Cuda
    2009 20th IEEE International Conference on Application-specific Systems Architectures and Processors, 2009
    Co-Authors: Yongchao Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-Cuda, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using Cuda and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-Cuda outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.

Yongchao Liu - One of the best experts on this subject based on the ideXlab platform.

  • Cuda meme accelerating motif discovery in biological sequences using Cuda enabled graphics processing units
    Pattern Recognition Letters, 2010
    Co-Authors: Yongchao Liu, Weiguo Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Motif discovery in biological sequences is of prime importance and a major challenge in computational biology. Consequently, numerous motif discovery tools have been developed to date. However, the rapid growth of both genomic sequence and gene transcription data, establishes the need for the development of scalable motif discovery tools. An approach to improve the runtime of motif discovery by an order-of-magnitude without losing sensitivity is to employ emerging many-core architectures such as Cuda-enabled GPUs. In this paper, we present a highly parallel formulation and implementation of the MEME motif discovery algorithm using the Cuda programming model. To achieve high efficiency, we introduce two parallelization approaches: sequence-level and substring-level parallelization. Furthermore, a hybrid computing framework is described to take advantage of both CPU and GPU compute resources. Our performance evaluation on a GeForce GTX 280 GPU, results in average runtime speedups of 21.4 (19.3) for the starting point search and 20.5 (16.4) for the overall runtime using the OOPS (ZOOPS) motif search model. The runtime speedups of Cuda-MEME on a single GPU are also comparable to those of ParaMEME running on 16 CPU cores of a high-performance workstation cluster. In addition to the fast speed, Cuda-MEME has the capability of finding motif instances consistent with the sequential MEME.

  • msa Cuda multiple sequence alignment on graphics processing units with Cuda
    Application-Specific Systems Architectures and Processors, 2009
    Co-Authors: Yongchao Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-Cuda, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using Cuda and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-Cuda outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.

  • ASAP - MSA-Cuda: Multiple Sequence Alignment on Graphics Processing Units with Cuda
    2009 20th IEEE International Conference on Application-specific Systems Architectures and Processors, 2009
    Co-Authors: Yongchao Liu, Bertil Schmidt, Douglas L Maskell
    Abstract:

    Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-Cuda, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using Cuda and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-Cuda outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.

Timothy R. Anderson - One of the best experts on this subject based on the ideXlab platform.

  • A Cuda implementation of the Continuous Space Language Model
    The Journal of Supercomputing, 2014
    Co-Authors: Elizabeth A Thompson, Timothy R. Anderson
    Abstract:

    The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (Cuda). A detailed explanation of the CSLM algorithm is provided. Implementation was accomplished using a combination of CUBLAS library routines, NVIDIA NPP functions, and Cuda kernel calls on three different Cuda enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated. The efficiency of the Cuda version of the open source implementation is analyzed and compared to that using the Intel Math Kernel Libraries (MKL) on a variety of Cuda enabled and multi-core CPU platforms. It is demonstrated that substantial performance benefit can be obtained using Cuda, even with nonoptimal code. Techniques for optimizing performance are then provided. Furthermore, an analysis is performed to determine the conditions in which the performance of Cuda exceeds that of the multi-core MKL realization.

  • HPEC - Use of Cuda for the Continuous Space Language Model
    2012 IEEE Conference on High Performance Extreme Computing, 2012
    Co-Authors: Elizabeth A Thompson, Timothy R. Anderson
    Abstract:

    The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (Cuda). Implementation was accomplished using a combination of CUBLAS library routines and Cuda kernel calls on three different Cuda enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.

Chunhui Deng - One of the best experts on this subject based on the ideXlab platform.

  • gpu based real time decoding technique for high definition videos
    Intelligent Information Hiding and Multimedia Signal Processing, 2012
    Co-Authors: Huifang Deng, Chunhui Deng
    Abstract:

    In this paper, we first discussed the video decoding standard and its architecture, and then analyzed the decoding complexity of each process. By using the benefit of the Cuda programming model, and taking advantages of GPU to optimize the decoding process of MC (motion compensation) and CSC(color space conversion) that are very time consuming, we proposed a MC accelerating method based on Cuda, and a CSC accelerating method based on Cuda and OpenGL shader. The experiments show that it is feasible to decode high definition video in real time using GPGPU-Cuda

  • IIH-MSP - GPU-based Real-time Decoding Technique for High-definition Videos
    2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2012
    Co-Authors: Huifang Deng, Chunhui Deng, Jingjing Li
    Abstract:

    In this paper, we first discussed the video decoding standard and its architecture, and then analyzed the decoding complexity of each process. By using the benefit of the Cuda programming model, and taking advantages of GPU to optimize the decoding process of MC (motion compensation) and CSC(color space conversion) that are very time consuming, we proposed a MC accelerating method based on Cuda, and a CSC accelerating method based on Cuda and OpenGL shader. The experiments show that it is feasible to decode high definition video in real time using GPGPU-Cuda