Opencl Specification

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 75 Experts worldwide ranked by ideXlab platform

Gianni De Fabritiis - One of the best experts on this subject based on the ideXlab platform.

  • Swan: A tool for porting CUDA programs to Opencl
    Computer Physics Communications, 2011
    Co-Authors: Matthew J. Harvey, Gianni De Fabritiis
    Abstract:

    Abstract The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the Opencl Specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the Opencl model, as a means to aid programmers experienced with CUDA in evaluating Opencl and alternative hardware. While the performance of equivalent Opencl and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to Opencl exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that Opencl is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program summary Program title: Swan Catalogue identifier: AEIH_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Public License version 2 No. of lines in distributed program, including test data, etc.: 17 736 No. of bytes in distributed program, including test data, etc.: 131 177 Distribution format: tar.gz Programming language: C Computer: PC Operating system: Linux RAM: 256 Mbytes Classification: 6.5 External routines: NVIDIA CUDA, Opencl Nature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, Opencl, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to Opencl is relatively straightforward but laborious. The Swan tool facilitates this conversion. Solution method: Swan performs a translation of CUDA kernel source code into an Opencl equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and Opencl APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or Opencl, with the selection made at compile time. Restrictions: No support for CUDA C++ features Running time: Nominal

Matthew J. Harvey - One of the best experts on this subject based on the ideXlab platform.

  • Swan: A tool for porting CUDA programs to Opencl
    Computer Physics Communications, 2011
    Co-Authors: Matthew J. Harvey, Gianni De Fabritiis
    Abstract:

    Abstract The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the Opencl Specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the Opencl model, as a means to aid programmers experienced with CUDA in evaluating Opencl and alternative hardware. While the performance of equivalent Opencl and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to Opencl exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that Opencl is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program summary Program title: Swan Catalogue identifier: AEIH_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Public License version 2 No. of lines in distributed program, including test data, etc.: 17 736 No. of bytes in distributed program, including test data, etc.: 131 177 Distribution format: tar.gz Programming language: C Computer: PC Operating system: Linux RAM: 256 Mbytes Classification: 6.5 External routines: NVIDIA CUDA, Opencl Nature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, Opencl, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to Opencl is relatively straightforward but laborious. The Swan tool facilitates this conversion. Solution method: Swan performs a translation of CUDA kernel source code into an Opencl equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and Opencl APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or Opencl, with the selection made at compile time. Restrictions: No support for CUDA C++ features Running time: Nominal

O. U. Khan - One of the best experts on this subject based on the ideXlab platform.

  • Fast parallel sorting algorithms on GPUs
    'Academy and Industry Research Collaboration Center (AIRCC)', 2012
    Co-Authors: B. Jan, B. Montrucchio, C. S. Ragusa, F. G. Khan, O. U. Khan
    Abstract:

    This paper presents a comparative analysis of the three widely used parallel sorting algorithms: Odd-Even sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the Opencl Specification. Our results depicts minimum speed-up19x of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvidia quadro 6000 GPU for high data set size reaching 2^24 with much lower sorting time

  • Analysis of Fast Parallel Sorting Algorithms for GPU Architectures
    2011
    Co-Authors: F. G. Khan, Montrucchio Bartolomeo, O. U. Khan, Giaccone Paolo
    Abstract:

    Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the Opencl Specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architectur

  • Analysis of Fast Parallel Sorting Algorithms for GPU Architectures
    'Institute of Electrical and Electronics Engineers (IEEE)', 2011
    Co-Authors: F. G. Khan, B. Montrucchio, O. U. Khan, Paolo Giaccone
    Abstract:

    Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the Opencl Specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture

F. G. Khan - One of the best experts on this subject based on the ideXlab platform.

  • Fast parallel sorting algorithms on GPUs
    'Academy and Industry Research Collaboration Center (AIRCC)', 2012
    Co-Authors: B. Jan, B. Montrucchio, C. S. Ragusa, F. G. Khan, O. U. Khan
    Abstract:

    This paper presents a comparative analysis of the three widely used parallel sorting algorithms: Odd-Even sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the Opencl Specification. Our results depicts minimum speed-up19x of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvidia quadro 6000 GPU for high data set size reaching 2^24 with much lower sorting time

  • Analysis of Fast Parallel Sorting Algorithms for GPU Architectures
    2011
    Co-Authors: F. G. Khan, Montrucchio Bartolomeo, O. U. Khan, Giaccone Paolo
    Abstract:

    Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the Opencl Specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architectur

  • Analysis of Fast Parallel Sorting Algorithms for GPU Architectures
    'Institute of Electrical and Electronics Engineers (IEEE)', 2011
    Co-Authors: F. G. Khan, B. Montrucchio, O. U. Khan, Paolo Giaccone
    Abstract:

    Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the Opencl Specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture

Henry Sylvain - One of the best experts on this subject based on the ideXlab platform.

  • Programming Models and Runtime Systems for Heterogeneous Architectures
    2020
    Co-Authors: Henry Sylvain
    Abstract:

    Le travail réalisé lors de cette thèse s'inscrit dans le cadre du calcul haute performance sur architectures hétérogènes. Pour faciliter l'écriture d'applications exploitant ces architectures et permettre la portabilité des performances, l'utilisation de supports exécutifs automatisant la gestion des certaines tâches (gestion de la mémoire distribuée, ordonnancement des noyaux de calcul) est nécessaire. Une approche bas niveau basée sur le standard Opencl est proposée ainsi qu'une approche de plus haut niveau basée sur la programmation fonctionnelle parallèle, la seconde permettant de pallier certaines difficultés rencontrées avec la première (notamment l'adaptation de la granularité).This work takes part in the context of high-performance computing on heterogeneous architectures. Runtime systems are increasingly used to make programming these architectures easier and to ensure performance portability by automatically dealing with some tasks (management of the distributed memory, scheduling of the computational kernels...). We propose a low-level approach based on the Opencl Specification as well as a high-level approach based on parallel functional programming

  • Programming Models and Runtime Systems for Heterogeneous Architectures
    2013
    Co-Authors: Henry Sylvain, Barthou Denis, Denis Alexandre
    Abstract:

    Le travail réalisé lors de cette thèse s'inscrit dans le cadre du calcul haute performance sur architectures hétérogènes. Pour faciliter l'écriture d'applications exploitant ces architectures et permettre la portabilité des performances, l'utilisation de supports exécutifs automatisant la gestion des certaines tâches (gestion de la mémoire distribuée, ordonnancement des noyaux de calcul) est nécessaire. Une approche bas niveau basée sur le standard Opencl est proposée ainsi qu'une approche de plus haut niveau basée sur la programmation fonctionnelle parallèle, la seconde permettant de pallier certaines difficultés rencontrées avec la première (notamment l'adaptation de la granularité).This work takes part in the context of high-performance computing on heterogeneous architectures. Runtime systems are increasingly used to make programming these architectures easier and to ensure performance portability by automatically dealing with some tasks (management of the distributed memory, scheduling of the computational kernels...). We propose a low-level approach based on the Opencl Specification as well as a high-level approach based on parallel functional programming.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF

  • Modèles de programmation et supports exécutifs pour architectures hétérogènes
    HAL CCSD, 2013
    Co-Authors: Henry Sylvain
    Abstract:

    This work takes part in the context of high-performance computing on heterogeneous architectures. Runtime systems are increasingly used to make programming these architectures easier and to ensure performance portability by automatically dealing with some tasks (management of the distributed memory, scheduling of the computational kernels...). We propose a low-level approach based on the Opencl Specification as well as a high-level approach based on parallel functional programming.Le travail réalisé lors de cette thèse s'inscrit dans le cadre du calcul haute performance sur architectures hétérogènes. Pour faciliter l'écriture d'applications exploitant ces architectures et permettre la portabilité des performances, l'utilisation de supports exécutifs automatisant la gestion des certaines tâches (gestion de la mémoire distribuée, ordonnancement des noyaux de calcul) est nécessaire. Une approche bas niveau basée sur le standard Opencl est proposée ainsi qu'une approche de plus haut niveau basée sur la programmation fonctionnelle parallèle, la seconde permettant de pallier certaines difficultés rencontrées avec la première (notamment l'adaptation de la granularité)