Hardware Accelerator - Explore the Science & Experts

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

John Kubiatowicz - One of the best experts on this subject based on the ideXlab platform.

a Hardware Accelerator for tracing garbage collection

IEEE Micro, 2019

Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz

Abstract:

Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area. By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area, and energy.

15 days free trial to Access Article
a Hardware Accelerator for tracing garbage collection

International Symposium on Computer Architecture, 2018

Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz

Abstract:

A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.

15 days free trial to Access Article

Ram Krishnamurthy - One of the best experts on this subject based on the ideXlab platform.

a 230mv 950mv 2 8tbps w unified sha256 sm3 secure hashing Hardware Accelerator in 14nm tri gate cmos

European Solid-State Circuits Conference, 2018

Co-Authors: Vikram B Suresh, Sudhir K Satpathy, Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Ram Krishnamurthy

Abstract:

A unified SHA256/SM3 secure hashing Hardware Accelerator for cross-geo authentication is fabricated in 14nm tri-gate CMOS, with a throughput of 9.5/8.3Gbps respectively measured at 0.75V, 25°C. Message digest pre-addition, with mode-multiplexed digest/scheduler completion adders and distributed final hash computation reduces critical path delay by 14% and Accelerator area by 48%, resulting in a compact layout of 5992µm2. 2/4-way parallel message scheduler enables 0.5/0.25× frequency scaling at iso-hash throughput enabling 35/62% scheduler power reduction. Robust sub-threshold voltage operation down to 230mV enables a peak energy-efficiency of 2.8Tbps/W measured at 300mV.

15 days free trial to Access Article
340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

IEEE Journal of Solid-state Circuits, 2015

Co-Authors: Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Sudhir Satpathy, Vikram Suresh, Gregory K Chen, Ram Krishnamurthy

Abstract:

This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

15 days free trial to Access Article
340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

IEEE Journal of Solid-state Circuits, 2015

Co-Authors: Sanu Mathew, Vikram B Suresh, Sudhir K Satpathy, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Gregory K Chen, Ram Krishnamurthy

Abstract:

This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

15 days free trial to Access Article

Martin Maas - One of the best experts on this subject based on the ideXlab platform.

a Hardware Accelerator for tracing garbage collection

IEEE Micro, 2019

Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz

Abstract:

Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area. By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area, and energy.

15 days free trial to Access Article
a Hardware Accelerator for tracing garbage collection

International Symposium on Computer Architecture, 2018

Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz

Abstract:

A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.

15 days free trial to Access Article

Sanu Mathew - One of the best experts on this subject based on the ideXlab platform.

a 230mv 950mv 2 8tbps w unified sha256 sm3 secure hashing Hardware Accelerator in 14nm tri gate cmos

European Solid-State Circuits Conference, 2018

Co-Authors: Vikram B Suresh, Sudhir K Satpathy, Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Ram Krishnamurthy

Abstract:

A unified SHA256/SM3 secure hashing Hardware Accelerator for cross-geo authentication is fabricated in 14nm tri-gate CMOS, with a throughput of 9.5/8.3Gbps respectively measured at 0.75V, 25°C. Message digest pre-addition, with mode-multiplexed digest/scheduler completion adders and distributed final hash computation reduces critical path delay by 14% and Accelerator area by 48%, resulting in a compact layout of 5992µm2. 2/4-way parallel message scheduler enables 0.5/0.25× frequency scaling at iso-hash throughput enabling 35/62% scheduler power reduction. Robust sub-threshold voltage operation down to 230mV enables a peak energy-efficiency of 2.8Tbps/W measured at 300mV.

15 days free trial to Access Article
340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

IEEE Journal of Solid-state Circuits, 2015

Co-Authors: Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Sudhir Satpathy, Vikram Suresh, Gregory K Chen, Ram Krishnamurthy

Abstract:

This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

15 days free trial to Access Article
340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

IEEE Journal of Solid-state Circuits, 2015

Co-Authors: Sanu Mathew, Vikram B Suresh, Sudhir K Satpathy, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Gregory K Chen, Ram Krishnamurthy

Abstract:

This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

15 days free trial to Access Article

Krste Asanovic - One of the best experts on this subject based on the ideXlab platform.

a Hardware Accelerator for tracing garbage collection

IEEE Micro, 2019

Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz

Abstract:

Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area. By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area, and energy.

15 days free trial to Access Article
a Hardware Accelerator for tracing garbage collection

International Symposium on Computer Architecture, 2018

Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz

Abstract:

A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.

15 days free trial to Access Article
a Hardware Accelerator for computing an exact dot product

Symposium on Computer Arithmetic, 2017

Co-Authors: Jack Koenig, David Biancolin, Jonathan Bachrach, Krste Asanovic

Abstract:

We study the implementation of a Hardware Accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The Accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the Accelerator as a generator in Chisel, which can synthesize various configurations of the Accelerator that make different area-performance trade-offs.We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the Accelerator area ranges from 0.05 mm2 to 0.32 mm2, and all configurations could be clocked at frequencies in excess of 900MHz. The Accelerator successfully saturates the SoC's memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.

15 days free trial to Access Article

Discover everything there is to know about the scientific topic Hardware Accelerator with ideXlab!

John Kubiatowicz - One of the best experts on this subject based on the ideXlab platform.

a Hardware Accelerator for tracing garbage collection

a Hardware Accelerator for tracing garbage collection

Ram Krishnamurthy - One of the best experts on this subject based on the ideXlab platform.

a 230mv 950mv 2 8tbps w unified sha256 sm3 secure hashing Hardware Accelerator in 14nm tri gate cmos

340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

Martin Maas - One of the best experts on this subject based on the ideXlab platform.

a Hardware Accelerator for tracing garbage collection

a Hardware Accelerator for tracing garbage collection

Sanu Mathew - One of the best experts on this subject based on the ideXlab platform.

a 230mv 950mv 2 8tbps w unified sha256 sm3 secure hashing Hardware Accelerator in 14nm tri gate cmos

340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos

Krste Asanovic - One of the best experts on this subject based on the ideXlab platform.

a Hardware Accelerator for tracing garbage collection

a Hardware Accelerator for tracing garbage collection

a Hardware Accelerator for computing an exact dot product