Register File

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 8757 Experts worldwide ranked by ideXlab platform

M. Valero - One of the best experts on this subject based on the ideXlab platform.

  • Scalable Distributed Register File
    2014
    Co-Authors: Ruben Gonzalez, Alexander Veidenbaum, Miquel Pericàs, M. Valero
    Abstract:

    Abstract — In microarchitectural design, conceptual simplicity does not always lead to reduced technological complexity. VLSI design offers several standard structures which get very inef-ficient when they are scaled up. For instance, the superscalar OOO processing model is conceptually simple – with the control-flow oriented front-end and the dataflow oriented backend – but simply scaling the structures in this architecture for wider issue soon shows several bottlenecks, particularly related to power and timing issues, In this paper we propose a superscalar architecture obtained after identifying the Register File as a core bottleneck structure of a microprocessor and replacing it with a more efficient model that also scales better. Taking into account the status of the Registers and the source of operands we redesigned the microarchitecture around a distributed set of Register Files and an extended Instruction Queue. We compare our architecture to widely used superscalar architectures using either a central physical/architectural Register File or a Future File. We found that the traditional Future File architecture is much less scalable than our Promotional Architecture. With a 256-entry ROB, the Future File Architecture (FFA) loses 7 % speed on average I

  • An optimized front-end physical Register File with banking and writeback filtering
    Lecture Notes in Computer Science, 2005
    Co-Authors: Miquel Pericàs, Ruben Gonzalez, Alexander Veidenbaum, A. Cristal, M. Valero
    Abstract:

    Register File design is one of the critical issues facing designers of out-of-order processors. Scaling up its size and number of ports with issue width and instruction window size is difficult in terms of both performance and power consumption. Two types of Register File architectures have been proposed in the past: a future logical File and a centralized physical File. The centralized Register File does not scale well but allows fast branch mis-prediction recovery. The Future File scales well, but requires reservation stations and has slow mis-prediction recovery. This paper proposes a Register File architecture that combines the best features of both approaches. The new Register File has the large size of the centralized File and its ability to quickly recover from branch misprediction. It has the advantage of the future File in that it is accessed in the front end allowing about 1/3rd of the source operands that are ready when an instruction enters the window to be read immediately. The remaining operands come from bypass logic / instruction queues and do not require Register File access. The new architecture does require reservation stations for operand storage and it investigates two approaches in terms of power-efficiency. Another advantage of the new architecture is that banking is much easier to use in this case as compared to the centralized Register File. Banking further improves the scalability of the new architecture. A technique for early release of short-lived Registers called writehack filtering is used in combination with banking to further improve the new architecture. The use of a large front-end Register File results in significant power savings and a slight IPC degradation (less than 1%). Overall, the resulting energy-delay product is lower than in previous proposals.

  • a content aware integer Register File organization
    International Symposium on Computer Architecture, 2004
    Co-Authors: Gonzalez Gonzalez, A. Cristal, Alexander Veidenbaum, D. Ortega, M. Valero
    Abstract:

    A Register File is a critical component of a modernsuperscalar processor.It has a large number of entriesand read/write ports in order to enable high levels ofinstruction parallelism.As a result, the Register File'sarea, access time, and energy consumption increasedramatically, significantly affecting the overallsuperscalar processor's performance and energyconsumption.This is especially true in 64-bitprocessors.This paper presents a new integer Register Fileorganization, which reduces energy consumption,area, and access time of the Register File with a minimal effect on overall IPC.This is accomplished byexpoiting a new concept, partial value locality, whichis defined as occurence of mutiple live valueinstances identical in a subset of their bits.A possibleimplementation of the new Register File is describedand shown to obtain proposed optimized Register Filedesigns.Overall, an energy reduction of over 50%, a18% decreas in area, and a 15% reduction in the accesstime are achieved in the new Register File.Theenergy and area savings are achieved with a 1.7%reduction in IPC for integer applications and anegligible 0.3% in numerical applications, assumingthe same clock frequency.A performance increase ofup to 13% is possible if the clcok frequency can beincreases due to a reduction in the Register File accesstime.This approach enables other, very promisingoptimizations, three of which are outlined in the paper.

  • Multiple-banked Register File architectures
    Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), 2000
    Co-Authors: J L Cruz, M. Valero, A. González, N.p. Topham
    Abstract:

    The Register File access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more Register ports) and the size of the instruction window (which implies more Registers), and to use some kind of multithreading. Under this scenario, the Register File access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage Register File has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a Register File architecture composed of multiple banks. In particular we focus on a multi-level organization of the Register File, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked Register File, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the Register File access time is factored in.

  • ISCA - Multiple-banked Register File architectures
    Proceedings of the 27th annual international symposium on Computer architecture - ISCA '00, 2000
    Co-Authors: J L Cruz, M. Valero, A. González, N.p. Topham
    Abstract:

    The Register File access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more Register ports) and the size of the instruction window (which implies more Registers), and to use some kind of multithreading. Under this scenario, the Register File access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage Register File has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a Register File architecture composed of multiple banks. In particular we focus on a multi-level organization of the Register File, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked Register File, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the Register File access time is factored in.

William J. Dally - One of the best experts on this subject based on the ideXlab platform.

  • A Compile-Time Managed Multi-Level Register File Hierarchy
    2012
    Co-Authors: Mark Gebhart, Stephen W. Keckler, William J. Dally
    Abstract:

    As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the Register File system of a modern throughput processor with a combined hardware and software solution that reduces Register File energy without harming system performance. Throughput processors utilize a large number of threads to tolerate latency, requiring a large, energy-intensive Register File to store thread context. Our results show that a compiler controlled Register File hierarchy can reduce Register File energy by up to 54%, compared to a hardware only caching approach that reduces Register File energy by 34%. We explore Register allocation algorithms that are specifically targeted to improve energy efficiency by sharing temporary Register File resources across concurrently running threads and conduct a detailed limit study on the further potential to optimize operand delivery for throughput processors. Our efficiency gains represent a direct performance gain for power limited systems, such as GPUs

  • A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
    ACM Transactions on Computer Systems, 2012
    Co-Authors: Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, Kevin Skadron
    Abstract:

    Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large Register File, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic Register File found on modern designs with a hierarchical Register File. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the Register File hierarchy. Combined with a hierarchical Register File, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the Register File hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed Register File hierarchy reduces Register File energy by 54p.

  • MICRO - A compile-time managed multi-level Register File hierarchy
    Proceedings of the 44th Annual IEEE ACM International Symposium on Microarchitecture - MICRO-44 '11, 2011
    Co-Authors: Mark Gebhart, Stephen W. Keckler, William J. Dally
    Abstract:

    As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the Register File system of a modern throughput processor with a combined hardware and software solution that reduces Register File energy without harming system performance. Throughput processors utilize a large number of threads to tolerate latency, requiring a large, energy-intensive Register File to store thread context. Our results show that a compiler controlled Register File hierarchy can reduce Register File energy by up to 54%, compared to a hardware only caching approach that reduces Register File energy by 34%. We explore Register allocation algorithms that are specifically targeted to improve energy efficiency by sharing temporary Register File resources across concurrently running threads and conduct a detailed limit study on the further potential to optimize operand delivery for throughput processors. Our efficiency gains represent a direct performance gain for power limited systems, such as GPUs.

  • HPCA - The Named-State Register File: implementation and performance
    Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture, 1
    Co-Authors: P.r. Nuth, William J. Dally
    Abstract:

    Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative Register File. The NSF uses hardware and software techniques to efficiently manage Registers among sequential or parallel procedure activations. The NSF holds more live data per Register than conventional Register Files, and requires much less spill and reload traffic to switch between concurrent contexts. The NSF speeds execution of some sequential and parallel programs by 9% to 17% over alternative Register File organizations. The NSF has access time comparable to a conventional Register File and only adds 5% to the area of a typical processor chip. >

Murali Annavaram - One of the best experts on this subject based on the ideXlab platform.

  • gpu Register File virtualization
    International Symposium on Microarchitecture, 2015
    Co-Authors: Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, Murali Annavaram
    Abstract:

    To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge Register File, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large Register File is inevitable for accommodating more parallel thread contexts, and technology scaling makes it feasible to incorporate ever increasing size of Register File. In this paper, we demonstrate that the Register File size need not be large to accommodate more threads context. We first characterize the useful lifetime of a Register and show that Register lifetimes vary drastically across various Registers that are allocated to a kernel. While some Registers are alive for the entire duration of the kernel execution, some Registers have a short lifespan. We propose GPU Register File virtualization that allows multiple warps to share physical Registers. Since warps may be scheduled for execution at different points in time, we propose to proactively release dead Registers from one warp and re-allocate them to a different warp that may occur later in time, thereby reducing the needless demand for physical Registers. By using Register virtualization, we shrink the architected Register space to a smaller physical Register space. By under-provisioning the physical Register File to be smaller than the architected Register File we reduce dynamic and static power consumption. We then develop a new Register throttling mechanism to run applications that exceed the size of the under-provisioned Register File without any deadlock. Our evaluation shows that even after halving the architected Register File size using our proposed GPU Register File virtualization applications run successfully with negligible performance overhead.

  • MICRO - GPU Register File virtualization
    Proceedings of the 48th International Symposium on Microarchitecture, 2015
    Co-Authors: Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, Murali Annavaram
    Abstract:

    To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge Register File, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large Register File is inevitable for accommodating more parallel thread contexts, and technology scaling makes it feasible to incorporate ever increasing size of Register File. In this paper, we demonstrate that the Register File size need not be large to accommodate more threads context. We first characterize the useful lifetime of a Register and show that Register lifetimes vary drastically across various Registers that are allocated to a kernel. While some Registers are alive for the entire duration of the kernel execution, some Registers have a short lifespan. We propose GPU Register File virtualization that allows multiple warps to share physical Registers. Since warps may be scheduled for execution at different points in time, we propose to proactively release dead Registers from one warp and re-allocate them to a different warp that may occur later in time, thereby reducing the needless demand for physical Registers. By using Register virtualization, we shrink the architected Register space to a smaller physical Register space. By under-provisioning the physical Register File to be smaller than the architected Register File we reduce dynamic and static power consumption. We then develop a new Register throttling mechanism to run applications that exceed the size of the under-provisioned Register File without any deadlock. Our evaluation shows that even after halving the architected Register File size using our proposed GPU Register File virtualization applications run successfully with negligible performance overhead.

  • warped Register File a power efficient Register File for gpgpus
    High-Performance Computer Architecture, 2013
    Co-Authors: Mohammad Abdelmajeed, Murali Annavaram
    Abstract:

    General purpose graphics processing units (GPGPUs) have the ability to execute hundreds of concurrent threads. To support massive parallelism GPGPUs provide a very large Register File, even larger than a cache, to hold the state of each thread. As technology scales, the leakage power consumption of the SRAM cells is getting worse making the Register File static power consumption a major concern. As the supply voltage scaling slows, dynamic power consumption of a Register File is not reducing. These concerns are particularly acute in GPGPUs due to their large Register File size. This paper presents two techniques to reduce the GPGPU Register File power consumption. By exploiting the unique software execution model of GPGPUs, we propose a tri-modal Register access control unit to reduce the leakage power. This unit first turns off any unallocated Register, and places all allocated Registers into drowsy state immediately after each access. The average inter-access distance to a Register is 789 cycles in GPGPUs. Hence, aggressively moving a Register into drowsy state immediately after each access results in 90% reduction in leakage power with negligible performance impact. To reduce dynamic power this paper proposes an active mask aware activity gating unit that avoids charging bit lines and wordlines of Registers associated with all inactive threads within a warp. Due to insufficient parallelism and branch divergence warps have many inactive threads. Hence, Registers associated with inactive threads can be identified precisely using the active mask. By combining the two techniques we show that the power consumption of the Register File can be reduced by 69% on average.

  • HPCA - Warped Register File: A power efficient Register File for GPGPUs
    2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013
    Co-Authors: Mohammad Abdel-majeed, Murali Annavaram
    Abstract:

    General purpose graphics processing units (GPGPUs) have the ability to execute hundreds of concurrent threads. To support massive parallelism GPGPUs provide a very large Register File, even larger than a cache, to hold the state of each thread. As technology scales, the leakage power consumption of the SRAM cells is getting worse making the Register File static power consumption a major concern. As the supply voltage scaling slows, dynamic power consumption of a Register File is not reducing. These concerns are particularly acute in GPGPUs due to their large Register File size. This paper presents two techniques to reduce the GPGPU Register File power consumption. By exploiting the unique software execution model of GPGPUs, we propose a tri-modal Register access control unit to reduce the leakage power. This unit first turns off any unallocated Register, and places all allocated Registers into drowsy state immediately after each access. The average inter-access distance to a Register is 789 cycles in GPGPUs. Hence, aggressively moving a Register into drowsy state immediately after each access results in 90% reduction in leakage power with negligible performance impact. To reduce dynamic power this paper proposes an active mask aware activity gating unit that avoids charging bit lines and wordlines of Registers associated with all inactive threads within a warp. Due to insufficient parallelism and branch divergence warps have many inactive threads. Hence, Registers associated with inactive threads can be identified precisely using the active mask. By combining the two techniques we show that the power consumption of the Register File can be reduced by 69% on average.

Nader Bagherzadeh - One of the best experts on this subject based on the ideXlab platform.

  • A scalable Register File architecture for superscalar processors
    Microprocessors and Microsystems, 1998
    Co-Authors: Steven Wallace, Nader Bagherzadeh
    Abstract:

    Abstract A major obstacle in designing superscalar processors is the size and port requirement of the Register File. Multiple Register Files of a scalar processor can be used in a superscalar processor if results are renamed when they are written to the Register File. Consequently, a scalable Register File architecture can be implemented without performance degradation. Another benefit is that the cycle time of the Register File is significantly shortened, potentially producing an increase in the speed of the processor.

  • IEEE PACT - A scalable Register File architecture for dynamically scheduled processors
    Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, 1996
    Co-Authors: Steven Wallace, Nader Bagherzadeh
    Abstract:

    A major obstacle in designing dynamically scheduled processors is the size and port requirement of the Register File. By using a multiple banked Register File and performing dynamic result renaming, a scalable Register File architecture can be implemented without performance degradation. In addition, a new hybrid Register renaming technique to efficiently map the logical to physical Registers and reduce the branch misprediction penalty is introduced. The performance was simulated using the SPEC95 benchmark suite.

  • A scalable Register File architecture for dynamically scheduled processors
    Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, 1996
    Co-Authors: Steven Wallace, Nader Bagherzadeh
    Abstract:

    A major obstacle in designing dynamically scheduled processors is the size and port requirement of the Register File. By using a multiple banked Register File and performing dynamic result renaming, a scalable Register File architecture can be implemented without performance degradation. In addition, a new hybrid Register renaming technique to efficiently map the logical to physical Registers and reduce the branch misprediction penalty is introduced. The performance was simulated using the SPEC95 benchmark suite.

Alexander Veidenbaum - One of the best experts on this subject based on the ideXlab platform.

  • Scalable Distributed Register File
    2014
    Co-Authors: Ruben Gonzalez, Alexander Veidenbaum, Miquel Pericàs, M. Valero
    Abstract:

    Abstract — In microarchitectural design, conceptual simplicity does not always lead to reduced technological complexity. VLSI design offers several standard structures which get very inef-ficient when they are scaled up. For instance, the superscalar OOO processing model is conceptually simple – with the control-flow oriented front-end and the dataflow oriented backend – but simply scaling the structures in this architecture for wider issue soon shows several bottlenecks, particularly related to power and timing issues, In this paper we propose a superscalar architecture obtained after identifying the Register File as a core bottleneck structure of a microprocessor and replacing it with a more efficient model that also scales better. Taking into account the status of the Registers and the source of operands we redesigned the microarchitecture around a distributed set of Register Files and an extended Instruction Queue. We compare our architecture to widely used superscalar architectures using either a central physical/architectural Register File or a Future File. We found that the traditional Future File architecture is much less scalable than our Promotional Architecture. With a 256-entry ROB, the Future File Architecture (FFA) loses 7 % speed on average I

  • An optimized front-end physical Register File with banking and writeback filtering
    Lecture Notes in Computer Science, 2005
    Co-Authors: Miquel Pericàs, Ruben Gonzalez, Alexander Veidenbaum, A. Cristal, M. Valero
    Abstract:

    Register File design is one of the critical issues facing designers of out-of-order processors. Scaling up its size and number of ports with issue width and instruction window size is difficult in terms of both performance and power consumption. Two types of Register File architectures have been proposed in the past: a future logical File and a centralized physical File. The centralized Register File does not scale well but allows fast branch mis-prediction recovery. The Future File scales well, but requires reservation stations and has slow mis-prediction recovery. This paper proposes a Register File architecture that combines the best features of both approaches. The new Register File has the large size of the centralized File and its ability to quickly recover from branch misprediction. It has the advantage of the future File in that it is accessed in the front end allowing about 1/3rd of the source operands that are ready when an instruction enters the window to be read immediately. The remaining operands come from bypass logic / instruction queues and do not require Register File access. The new architecture does require reservation stations for operand storage and it investigates two approaches in terms of power-efficiency. Another advantage of the new architecture is that banking is much easier to use in this case as compared to the centralized Register File. Banking further improves the scalability of the new architecture. A technique for early release of short-lived Registers called writehack filtering is used in combination with banking to further improve the new architecture. The use of a large front-end Register File results in significant power savings and a slight IPC degradation (less than 1%). Overall, the resulting energy-delay product is lower than in previous proposals.

  • a content aware integer Register File organization
    International Symposium on Computer Architecture, 2004
    Co-Authors: Gonzalez Gonzalez, A. Cristal, Alexander Veidenbaum, D. Ortega, M. Valero
    Abstract:

    A Register File is a critical component of a modernsuperscalar processor.It has a large number of entriesand read/write ports in order to enable high levels ofinstruction parallelism.As a result, the Register File'sarea, access time, and energy consumption increasedramatically, significantly affecting the overallsuperscalar processor's performance and energyconsumption.This is especially true in 64-bitprocessors.This paper presents a new integer Register Fileorganization, which reduces energy consumption,area, and access time of the Register File with a minimal effect on overall IPC.This is accomplished byexpoiting a new concept, partial value locality, whichis defined as occurence of mutiple live valueinstances identical in a subset of their bits.A possibleimplementation of the new Register File is describedand shown to obtain proposed optimized Register Filedesigns.Overall, an energy reduction of over 50%, a18% decreas in area, and a 15% reduction in the accesstime are achieved in the new Register File.Theenergy and area savings are achieved with a 1.7%reduction in IPC for integer applications and anegligible 0.3% in numerical applications, assumingthe same clock frequency.A performance increase ofup to 13% is possible if the clcok frequency can beincreases due to a reduction in the Register File accesstime.This approach enables other, very promisingoptimizations, three of which are outlined in the paper.

  • Power-Aware Compilation for Register File Energy Reduction
    International Journal of Parallel Programming, 2003
    Co-Authors: José L. Ayala, Alexander Veidenbaum, Marisa López-vallejo
    Abstract:

    Most power reduction techniques have focused on gating the clock to unused functional units to minimize static power consumption, while system level optimizations have been used to deal with dynamic power consumption. Once these techniques are applied, Register File power consumption becomes a dominant factor in the processor. This paper proposes a power-aware reconfiguration mechanism in the Register File driven by a compiler. Optimal usage of the Register File in terms of size is achieved and unused Registers are put into a low-power state. Total energy consumption in the Register File is reduced by 65% with no appreciable performance penalty for MiBench benchmarks on an embedded processor. The effect of reconfiguration granularity on energy savings is also analyzed, and the compiler approach to optimize energy results is presented.

  • A content aware integer Register File organization
    Proceedings. 31st Annual International Symposium on Computer Architecture 2004., 1
    Co-Authors: Ruben Gonzalez, Alexander Veidenbaum, A. Cristal, D. Ortega, M. Valero
    Abstract:

    A Register File is a critical component of a modern superscalar processor. It has a large number of entries and read/write ports in order to enable high levels of instruction parallelism. As a result, the Register File's area, access time, and energy consumption increase dramatically, significantly affecting the overall superscalar processor's performance and energy consumption. This is especially true in 64-bit processors. This paper presents a new integer Register File organization, which reduces energy consumption, area, and access time of the Register File with a minimal effect on overall IPC. This is accomplished by exploiting a new concept, partial value locality, which is defined as occurrence of multiple live value instances identical in a subset of their bits. A possible implementation of the new Register File is described and shown to obtain proposed optimized Register File designs. Overall, an energy reduction of over 50%, a 18% decrease in area, and a 15% reduction in the access time are achieved in the new Register File. The energy and area savings are achieved with a 1.7% reduction in IPC for integer applications and a negligible 0.3% in numerical applications, assuming the same clock frequency. A performance increase of up to 13% is possible if the clock frequency can be increases due to a reduction in the Register File access time. This approach enables other, very promising optimizations, three of which are outlined in the paper.Peer ReviewedPostprint (published version