Opencl Execution

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 78 Experts worldwide ranked by ideXlab platform

Niko Neufeld - One of the best experts on this subject based on the ideXlab platform.

  • accelerating particle identification for high speed data filtering using Opencl on fpgas and other architectures
    Field-Programmable Logic and Applications, 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

  • FPL - Accelerating particle identification for high-speed data-filtering using Opencl on FPGAs and other architectures
    2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

Srikanth Sridharan - One of the best experts on this subject based on the ideXlab platform.

  • accelerating particle identification for high speed data filtering using Opencl on fpgas and other architectures
    Field-Programmable Logic and Applications, 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

  • FPL - Accelerating particle identification for high-speed data-filtering using Opencl on FPGAs and other architectures
    2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

Onur Mutlu - One of the best experts on this subject based on the ideXlab platform.

  • FPGA - Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of Opencl Applications on FPGAs
    Proceedings of the 2020 ACM SIGDA International Symposium on Field-Programmable Gate Arrays, 2020
    Co-Authors: Jiantong Jiang, Zeke Wang, Xue Liu, Juan Gómez-luna, Nan Guan, Qingxu Deng, Wei Zhang, Onur Mutlu
    Abstract:

    FPGA vendors provide Opencl software development kits for easier programmability, with the goal of replacing the time-consuming and error-prone register-transfer level (RTL) programming. Many studies explore optimization methods (e.g., loop unrolling, local memory) to accelerate Opencl programs running on FPGAs. These programs typically follow the default Opencl Execution model, where a kernel deploys multiple work-items arranged into work-groups. However, the default Execution model is not always a good fit for an application mapped to the FPGA architecture, which is very different from the multithreaded architecture of GPUs, for which Opencl was originally designed. In this work, we identify three other Execution models that can better utilize the FPGA resources for the Opencl applications that do not fit well into the default Execution model. These three Execution models are based on two Opencl features devised for FPGA programming (namely, single work-item kernel and Opencl channel). We observe that the selection of the right Execution model determines the performance upper bound of a particular application, which can vary by two orders magnitude between the most suitable Execution model and the most unsuitable one. However, there is no way to select the most suitable Execution model other than empiricall exploring the optimization space for the four of them, which can be prohibitive. To help FPGA programmers identify the right Execution model, we propose Boyi, a systematic framework that makes automatic decisions by analyzing Opencl programming patterns in an application. After finding the right Execution model with the help of Boyi, programmers can apply other conventional optimizations to reach the performance upper bound. Our experimental evaluation shows that Boyi can 1) accurately determine the right Execution model, and 2) greatly reduce the exploration space of conventional optimization methods.

  • boyi a systematic framework for automatically deciding the right Execution model of Opencl applications on fpgas
    Field Programmable Gate Arrays, 2020
    Co-Authors: Jiantong Jiang, Zeke Wang, Xue Liu, Nan Guan, Qingxu Deng, Wei Zhang, Juan Gomezluna, Onur Mutlu
    Abstract:

    FPGA vendors provide Opencl software development kits for easier programmability, with the goal of replacing the time-consuming and error-prone register-transfer level (RTL) programming. Many studies explore optimization methods (e.g., loop unrolling, local memory) to accelerate Opencl programs running on FPGAs. These programs typically follow the default Opencl Execution model, where a kernel deploys multiple work-items arranged into work-groups. However, the default Execution model is not always a good fit for an application mapped to the FPGA architecture, which is very different from the multithreaded architecture of GPUs, for which Opencl was originally designed. In this work, we identify three other Execution models that can better utilize the FPGA resources for the Opencl applications that do not fit well into the default Execution model. These three Execution models are based on two Opencl features devised for FPGA programming (namely, single work-item kernel and Opencl channel). We observe that the selection of the right Execution model determines the performance upper bound of a particular application, which can vary by two orders magnitude between the most suitable Execution model and the most unsuitable one. However, there is no way to select the most suitable Execution model other than empiricall exploring the optimization space for the four of them, which can be prohibitive. To help FPGA programmers identify the right Execution model, we propose Boyi, a systematic framework that makes automatic decisions by analyzing Opencl programming patterns in an application. After finding the right Execution model with the help of Boyi, programmers can apply other conventional optimizations to reach the performance upper bound. Our experimental evaluation shows that Boyi can 1) accurately determine the right Execution model, and 2) greatly reduce the exploration space of conventional optimization methods.

Christian Faerber - One of the best experts on this subject based on the ideXlab platform.

  • accelerating particle identification for high speed data filtering using Opencl on fpgas and other architectures
    Field-Programmable Logic and Applications, 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

  • FPL - Accelerating particle identification for high-speed data-filtering using Opencl on FPGAs and other architectures
    2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

Paolo Durante - One of the best experts on this subject based on the ideXlab platform.

  • accelerating particle identification for high speed data filtering using Opencl on fpgas and other architectures
    Field-Programmable Logic and Applications, 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.

  • FPL - Accelerating particle identification for high-speed data-filtering using Opencl on FPGAs and other architectures
    2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
    Co-Authors: Srikanth Sridharan, Paolo Durante, Christian Faerber, Niko Neufeld
    Abstract:

    The upgrade of the LHCb experiment at CERN envisions a Data Acquisition and Event Filtering system that captures 100% of the data generated by the various sub-detectors, which measure with great precision the 40 million collisions per second of protons in CERN's Large Hadron Collider. The sensor readings result in about 40 Tbit/s of data, which need to be processed on a large computer farm. Since the computation on CPUs, as it is currently done, does not scale well, it is necessary to accelerate a good portion of the code to meet the computational demands of the proposed system. We are therefore looking for means to accelerate the most time-consuming parts of the event-filtering code. The Ring Imaging Cherenkov (RICH) detectors are one of the component detectors of the overall LHCb experiment. The Cherenkov photon that hits the detector are processed to determine the track of the original particle that caused these photons. The particle velocity and mass, derived from the Cherenkov angle, is used to identify the particle. The entire RICH photon reconstruction algorithm accounts for 50% of the second High Level Trigger (HLT) process and Cherenkov angle reconstruction comprises about 20% of the RICH and is a good candidate for acceleration. An Opencl implementation of Cherenkov angle reconstruction algorithm that calculates the trajectory of Photons in the RICH detector was developed. The paper looks at the results of the Opencl implementation of the algorithm on the Nallatech 385 card with Altera Stratix V FPGA, Nvidia GeForce GTX 690 GPU card and the Intel Xeon processor for comparison. While the two GPUs are 3.6× faster than a single FPGA, the FPGA is 3.4× better than two GPUs and 6.6× better than two multicore CPUs when energy efficiency is factored. Although significant speedup of computation was achieved on all the above architectures by using Opencl, a good portion of the gain was lost due to the overhead of data transfer and parallelism. Different strategies are put forth for improving the speedup. Some optimizations currently possible, low latency links that can replace PCIe and some possible changes to the Opencl Execution model itself are discussed.