Banked Register

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 144 Experts worldwide ranked by ideXlab platform

Mazen A. R. Saghir - One of the best experts on this subject based on the ideXlab platform.

  • FPL - Microarchitectural Enhancements for Configurable Multi-Threaded Soft Processors
    2007 International Conference on Field Programmable Logic and Applications, 2007
    Co-Authors: Roger Moussali, Nabil Ghanem, Mazen A. R. Saghir
    Abstract:

    This paper describes a number of microarchitectural techniques for supporting multithreading in soft processor cores. These include a new thread scheduler that combines interleaved and block multithreading; a table of operation latencies (TOOL) for determining instruction latencies; support of arbitrary-latency custom computational units; and a multi-Banked Register file for supporting simultaneous write-back operations from different threads. Our results show that four-way, multithreaded, processors achieve speedups of up to 26% over a single-threaded processor executing benchmarks that only use regular instructions, and up to 47% when executing benchmarks that include long-latency instructions.

  • Microarchitectural Enhancements for Configurable Multi-Threaded Soft Processors
    2007 International Conference on Field Programmable Logic and Applications, 2007
    Co-Authors: Roger Moussali, Nabil Ghanem, Mazen A. R. Saghir
    Abstract:

    This paper describes a number of microarchitectural techniques for supporting multithreading in soft processor cores. These include a new thread scheduler that combines interleaved and block multithreading; a table of operation latencies (TOOL) for determining instruction latencies; support of arbitrary-latency custom computational units; and a multi-Banked Register file for supporting simultaneous write-back operations from different threads. Our results show that four-way, multithreaded, processors achieve speedups of up to 26% over a single-threaded processor executing benchmarks that only use regular instructions, and up to 47% when executing benchmarks that include long-latency instructions.

Roger Moussali - One of the best experts on this subject based on the ideXlab platform.

  • FPL - Microarchitectural Enhancements for Configurable Multi-Threaded Soft Processors
    2007 International Conference on Field Programmable Logic and Applications, 2007
    Co-Authors: Roger Moussali, Nabil Ghanem, Mazen A. R. Saghir
    Abstract:

    This paper describes a number of microarchitectural techniques for supporting multithreading in soft processor cores. These include a new thread scheduler that combines interleaved and block multithreading; a table of operation latencies (TOOL) for determining instruction latencies; support of arbitrary-latency custom computational units; and a multi-Banked Register file for supporting simultaneous write-back operations from different threads. Our results show that four-way, multithreaded, processors achieve speedups of up to 26% over a single-threaded processor executing benchmarks that only use regular instructions, and up to 47% when executing benchmarks that include long-latency instructions.

  • Microarchitectural Enhancements for Configurable Multi-Threaded Soft Processors
    2007 International Conference on Field Programmable Logic and Applications, 2007
    Co-Authors: Roger Moussali, Nabil Ghanem, Mazen A. R. Saghir
    Abstract:

    This paper describes a number of microarchitectural techniques for supporting multithreading in soft processor cores. These include a new thread scheduler that combines interleaved and block multithreading; a table of operation latencies (TOOL) for determining instruction latencies; support of arbitrary-latency custom computational units; and a multi-Banked Register file for supporting simultaneous write-back operations from different threads. Our results show that four-way, multithreaded, processors achieve speedups of up to 26% over a single-threaded processor executing benchmarks that only use regular instructions, and up to 47% when executing benchmarks that include long-latency instructions.

Murali Annavaram - One of the best experts on this subject based on the ideXlab platform.

  • Warped-Compression: Enabling power efficient GPUs through Register compression
    2015 ACM IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015
    Co-Authors: Hyeran Jeon, Won Woo Ro, Murali Annavaram
    Abstract:

    This paper presents Warped-Compression, a warp-level Register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the Register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread Registers is small. Removing data redundancy of Register values through Register compression reduces the effective Register width, thereby enabling power reduction opportunities. GPU Register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result Register file consumes a large fraction of the total GPU chip power. GPU design trends show that the Register file size will continue to increase to enable even more thread level parallelism. To reduce Register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of Banked Register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single Register, or one of the Register banks, as the primary base and then computing delta values of all the other Registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing Register values, each warp-level Register access activates fewer Register banks, which leads to reduction in dynamic power. When fewer banks are used to store the Register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that Register compression saves 25% of the total Register file power consumption.

  • ISCA - Warped-compression: enabling power efficient GPUs through Register compression
    Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15, 2015
    Co-Authors: Hyeran Jeon, Won Woo Ro, Murali Annavaram
    Abstract:

    This paper presents Warped-Compression, a warp-level Register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the Register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread Registers is small. Removing data redundancy of Register values through Register compression reduces the effective Register width, thereby enabling power reduction opportunities. GPU Register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result Register file consumes a large fraction of the total GPU chip power. GPU design trends show that the Register file size will continue to increase to enable even more thread level parallelism. To reduce Register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of Banked Register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single Register, or one of the Register banks, as the primary base and then computing delta values of all the other Registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing Register values, each warp-level Register access activates fewer Register banks, which leads to reduction in dynamic power. When fewer banks are used to store the Register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that Register compression saves 25% of the total Register file power consumption.

Nabil Ghanem - One of the best experts on this subject based on the ideXlab platform.

  • FPL - Microarchitectural Enhancements for Configurable Multi-Threaded Soft Processors
    2007 International Conference on Field Programmable Logic and Applications, 2007
    Co-Authors: Roger Moussali, Nabil Ghanem, Mazen A. R. Saghir
    Abstract:

    This paper describes a number of microarchitectural techniques for supporting multithreading in soft processor cores. These include a new thread scheduler that combines interleaved and block multithreading; a table of operation latencies (TOOL) for determining instruction latencies; support of arbitrary-latency custom computational units; and a multi-Banked Register file for supporting simultaneous write-back operations from different threads. Our results show that four-way, multithreaded, processors achieve speedups of up to 26% over a single-threaded processor executing benchmarks that only use regular instructions, and up to 47% when executing benchmarks that include long-latency instructions.

  • Microarchitectural Enhancements for Configurable Multi-Threaded Soft Processors
    2007 International Conference on Field Programmable Logic and Applications, 2007
    Co-Authors: Roger Moussali, Nabil Ghanem, Mazen A. R. Saghir
    Abstract:

    This paper describes a number of microarchitectural techniques for supporting multithreading in soft processor cores. These include a new thread scheduler that combines interleaved and block multithreading; a table of operation latencies (TOOL) for determining instruction latencies; support of arbitrary-latency custom computational units; and a multi-Banked Register file for supporting simultaneous write-back operations from different threads. Our results show that four-way, multithreaded, processors achieve speedups of up to 26% over a single-threaded processor executing benchmarks that only use regular instructions, and up to 47% when executing benchmarks that include long-latency instructions.

O. Ergin - One of the best experts on this subject based on the ideXlab platform.

  • IEEE PACT - Compiler directed early Register release
    14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05), 2005
    Co-Authors: T.m. Jones, A. Gonzalez, J. Abella, M.f.r. O'boyle, O. Ergin
    Abstract:

    This paper presents a novel compiler directed technique to reduce the Register pressure and power of the Register file by releasing Registers early. The compiler identifies Registers that mil only be read once and renames them to different logical Registers. Upon issuing an instruction with one of these logical Registers as a source, the processor knows that there will be no more uses of it and can release the Register through checkpointing. This reduces the occupancy of our Banked Register file, allowing banks to be turned off for power savings. Our scheme is faster, simpler and requires less hardware than recently proposed techniques. It also maintains precise interrupts and exceptions where many other techniques do not. We reduce Register occupancy by 28% in a large Register file and gain in performance too; this translates into dynamic and static power saving of 18%. When compared to state-of-the-art approaches for varying Register file sizes, our scheme is always faster (higher IPC) and always achieves a greater reduction in Register file occupancy.

  • Compiler directed early Register release
    14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05), 2005
    Co-Authors: T.m. Jones, M.f.r. O'boyle, J. Abella, A. Gonzalez, O. Ergin
    Abstract:

    This paper presents a novel compiler directed technique to reduce the Register pressure and power of the Register file by releasing Registers early. The compiler identifies Registers that mil only be read once and renames them to different logical Registers. Upon issuing an instruction with one of these logical Registers as a source, the processor knows that there will be no more uses of it and can release the Register through checkpointing. This reduces the occupancy of our Banked Register file, allowing banks to be turned off for power savings. Our scheme is faster, simpler and requires less hardware than recently proposed techniques. It also maintains precise interrupts and exceptions where many other techniques do not. We reduce Register occupancy by 28% in a large Register file and gain in performance too; this translates into dynamic and static power saving of 18%. When compared to state-of-the-art approaches for varying Register file sizes, our scheme is always faster (higher IPC) and always achieves a greater reduction in Register file occupancy.