Hardware Accelerator

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

John Kubiatowicz - One of the best experts on this subject based on the ideXlab platform.

  • a Hardware Accelerator for tracing garbage collection
    IEEE Micro, 2019
    Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz
    Abstract:

    Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area. By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area, and energy.

  • a Hardware Accelerator for tracing garbage collection
    International Symposium on Computer Architecture, 2018
    Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz
    Abstract:

    A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.

Ram Krishnamurthy - One of the best experts on this subject based on the ideXlab platform.

  • a 230mv 950mv 2 8tbps w unified sha256 sm3 secure hashing Hardware Accelerator in 14nm tri gate cmos
    European Solid-State Circuits Conference, 2018
    Co-Authors: Vikram B Suresh, Sudhir K Satpathy, Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Ram Krishnamurthy
    Abstract:

    A unified SHA256/SM3 secure hashing Hardware Accelerator for cross-geo authentication is fabricated in 14nm tri-gate CMOS, with a throughput of 9.5/8.3Gbps respectively measured at 0.75V, 25°C. Message digest pre-addition, with mode-multiplexed digest/scheduler completion adders and distributed final hash computation reduces critical path delay by 14% and Accelerator area by 48%, resulting in a compact layout of 5992µm2. 2/4-way parallel message scheduler enables 0.5/0.25× frequency scaling at iso-hash throughput enabling 35/62% scheduler power reduction. Robust sub-threshold voltage operation down to 230mV enables a peak energy-efficiency of 2.8Tbps/W measured at 300mV.

  • 340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos
    IEEE Journal of Solid-state Circuits, 2015
    Co-Authors: Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Sudhir Satpathy, Vikram Suresh, Gregory K Chen, Ram Krishnamurthy
    Abstract:

    This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

  • 340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos
    IEEE Journal of Solid-state Circuits, 2015
    Co-Authors: Sanu Mathew, Vikram B Suresh, Sudhir K Satpathy, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Gregory K Chen, Ram Krishnamurthy
    Abstract:

    This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

Martin Maas - One of the best experts on this subject based on the ideXlab platform.

  • a Hardware Accelerator for tracing garbage collection
    IEEE Micro, 2019
    Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz
    Abstract:

    Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area. By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area, and energy.

  • a Hardware Accelerator for tracing garbage collection
    International Symposium on Computer Architecture, 2018
    Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz
    Abstract:

    A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.

Sanu Mathew - One of the best experts on this subject based on the ideXlab platform.

  • a 230mv 950mv 2 8tbps w unified sha256 sm3 secure hashing Hardware Accelerator in 14nm tri gate cmos
    European Solid-State Circuits Conference, 2018
    Co-Authors: Vikram B Suresh, Sudhir K Satpathy, Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Ram Krishnamurthy
    Abstract:

    A unified SHA256/SM3 secure hashing Hardware Accelerator for cross-geo authentication is fabricated in 14nm tri-gate CMOS, with a throughput of 9.5/8.3Gbps respectively measured at 0.75V, 25°C. Message digest pre-addition, with mode-multiplexed digest/scheduler completion adders and distributed final hash computation reduces critical path delay by 14% and Accelerator area by 48%, resulting in a compact layout of 5992µm2. 2/4-way parallel message scheduler enables 0.5/0.25× frequency scaling at iso-hash throughput enabling 35/62% scheduler power reduction. Robust sub-threshold voltage operation down to 230mV enables a peak energy-efficiency of 2.8Tbps/W measured at 300mV.

  • 340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos
    IEEE Journal of Solid-state Circuits, 2015
    Co-Authors: Sanu Mathew, Mark A Anders, Himanshu Kaul, Amit Agarwal, Sudhir Satpathy, Vikram Suresh, Gregory K Chen, Ram Krishnamurthy
    Abstract:

    This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

  • 340 mv 1 1 v 289 gbps w 2090 gate nanoaes Hardware Accelerator with area optimized encrypt decrypt gf 2 4 2 polynomials in 22 nm tri gate cmos
    IEEE Journal of Solid-state Circuits, 2015
    Co-Authors: Sanu Mathew, Vikram B Suresh, Sudhir K Satpathy, Mark A Anders, Himanshu Kaul, Amit Agarwal, Steven K Hsu, Gregory K Chen, Ram Krishnamurthy
    Abstract:

    This paper describes an on-die lightweight nanoAES Hardware Accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native ${\rm GF}(2^{4})^{2}$ composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 $\mu$ m $^{2}$ and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 $\mu$ W, measured at 0.9 V, 25 $^{\circ}$ C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 $\times$ higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 $\mu$ W, measured at 340 mV, 25 $^{\circ}$ C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.

Krste Asanovic - One of the best experts on this subject based on the ideXlab platform.

  • a Hardware Accelerator for tracing garbage collection
    IEEE Micro, 2019
    Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz
    Abstract:

    Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area. By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area, and energy.

  • a Hardware Accelerator for tracing garbage collection
    International Symposium on Computer Architecture, 2018
    Co-Authors: Martin Maas, Krste Asanovic, John Kubiatowicz
    Abstract:

    A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small Hardware Accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC Accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our Accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.

  • a Hardware Accelerator for computing an exact dot product
    Symposium on Computer Arithmetic, 2017
    Co-Authors: Jack Koenig, David Biancolin, Jonathan Bachrach, Krste Asanovic
    Abstract:

    We study the implementation of a Hardware Accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The Accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the Accelerator as a generator in Chisel, which can synthesize various configurations of the Accelerator that make different area-performance trade-offs.We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the Accelerator area ranges from 0.05 mm2 to 0.32 mm2, and all configurations could be clocked at frequencies in excess of 900MHz. The Accelerator successfully saturates the SoC's memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.