Microbenchmarks - Explore the Science & Experts

The Experts below are selected from a list of 2721 Experts worldwide ranked by ideXlab platform

Luca Benini - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Florian Glaser, Giuseppe Tagliavini, Davide Rossi, Germain Haugou, Qiuting Huang, Luca Benini

Abstract:

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable Microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the Microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.

15 days free trial to Access Article

Florian Glaser - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Florian Glaser, Giuseppe Tagliavini, Davide Rossi, Germain Haugou, Qiuting Huang, Luca Benini

Abstract:

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable Microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the Microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.

15 days free trial to Access Article

Qiuting Huang - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Florian Glaser, Giuseppe Tagliavini, Davide Rossi, Germain Haugou, Qiuting Huang, Luca Benini

Abstract:

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable Microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the Microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.

15 days free trial to Access Article

Giuseppe Tagliavini - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Florian Glaser, Giuseppe Tagliavini, Davide Rossi, Germain Haugou, Qiuting Huang, Luca Benini

Abstract:

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable Microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the Microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.

15 days free trial to Access Article

Davide Rossi - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

IEEE Transactions on Parallel and Distributed Systems, 2021

Co-Authors: Florian Glaser, Giuseppe Tagliavini, Davide Rossi, Germain Haugou, Qiuting Huang, Luca Benini

Abstract:

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable Microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the Microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average.

15 days free trial to Access Article

Discover everything there is to know about the scientific topic Microbenchmarks with ideXlab!

Luca Benini - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

Florian Glaser - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

Qiuting Huang - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

Giuseppe Tagliavini - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters

Davide Rossi - One of the best experts on this subject based on the ideXlab platform.

energy efficient hardware accelerated synchronization for shared l1 memory multiprocessor clusters