Register Set

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 306 Experts worldwide ranked by ideXlab platform

Vivek Sarkar - One of the best experts on this subject based on the ideXlab platform.

  • RegMutex: Inter-Warp GPU Register Time-Sharing
    2018 ACM IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018
    Co-Authors: Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-farahani, Nuwan Jayasena, Vivek Sarkar
    Abstract:

    Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large Register files. However, to avoid over-complicating the hardware, Registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live Registers at any given point in the GPU binary although the points at which all the requested Registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the Register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subSet of physical Registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected Register Set into a base Register Set and an extended Register Set. While physical Registers corresponding to the base Register Set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical Registers across warps to provision their extended Register Set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of Registers hence yielding higher performance per dollar. For programs that require a large number of Registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their Register allocations with each other, leading to a higher device occupancy. Since some aspects of Register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the Register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of Registers.

  • ISCA - RegMutex: inter-warp GPU Register time-sharing
    2018 ACM IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018
    Co-Authors: Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-farahani, Nuwan Jayasena, Vivek Sarkar
    Abstract:

    Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large Register files. However, to avoid over-complicating the hardware, Registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live Registers at any given point in the GPU binary although the points at which all the requested Registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the Register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subSet of physical Registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected Register Set into a base Register Set and an extended Register Set. While physical Registers corresponding to the base Register Set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical Registers across warps to provision their extended Register Set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of Registers hence yielding higher performance per dollar. For programs that require a large number of Registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their Register allocations with each other, leading to a higher device occupancy. Since some aspects of Register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the Register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of Registers.

B. Wess - One of the best experts on this subject based on the ideXlab platform.

  • list coloring of interval graphs with application to Register assignment for heterogeneous Register Set architectures
    Signal Processing, 2003
    Co-Authors: T. Zeitlhofer, B. Wess
    Abstract:

    This article focuses on Register assignment problems for heterogeneous Register-Set VLIW-DSP architectures. It is assumed that an instruction schedule has already been generated. The Register assignment problem is equivalent to the well-known coloring of an interference graph. Typically, machine-related constraints are mapped onto the structure of the interference graph. Thereby favorable characteristics with regard to coloring, the interval graph properties, get lost. In contrast, we present an approach that does not change the structure of the interference graph. Constraints implied by heterogeneous architectures are mapped to a specific coloring problem that is known as list-coloring. Exploiting the interval graph properties of the interference graph, we derive a list-coloring algorithm that allows us to generate optimum solutions even for large basic blocks. The proposed technique can also be applied to similar resource assignment problems like functional unit assignment.

  • Optimum Register assignment for heterogeneous Register-Set architectures
    Proceedings of the 2003 International Symposium on Circuits and Systems 2003. ISCAS '03., 2003
    Co-Authors: T. Zeitlhofer, B. Wess
    Abstract:

    This paper focuses on the Register assignment problem for basic blocks assuming a given instruction schedule. This is equivalent to the well-known coloring of an interference graph which satisfies the interval graph properties if no other constraints are considered. For a Set of equivalent colors (homogeneous Registers), an interval graph is colorable with linear complexity. In contrast, however, we assume a heterogeneous Register-Set as often found in general purpose digital signal processors. In this case, Register assignment corresponds to a list-coloring problem which is NP-complete even for interval graphs. Topically, heuristics have to be applied. However, we present a search space pruning technique based on a graph decomposition into maximum cliques that allows us to find proper colorings if there are any. Our optimum technique is applicable even to large graphs since the maximum number of colorings that have to be investigated just depend on the maximum clique size.

  • ISCAS (3) - Optimum Register assignment for heterogeneous Register-Set architectures
    Proceedings of the 2003 International Symposium on Circuits and Systems 2003. ISCAS '03., 2003
    Co-Authors: T. Zeitlhofer, B. Wess
    Abstract:

    This paper focuses on the Register assignment problem for basic blocks assuming a given instruction schedule. This is equivalent to the well-known coloring of an interference graph which satisfies the interval graph properties if no other constraints are considered. For a Set of equivalent colors (homogeneous Registers), an interval graph is colorable with linear complexity. In contrast, however, we assume a heterogeneous Register-Set as often found in general purpose digital signal processors. In this case, Register assignment corresponds to a list-coloring problem which is NP-complete even for interval graphs. Topically, heuristics have to be applied. However, we present a search space pruning technique based on a graph decomposition into maximum cliques that allows us to find proper colorings if there are any. Our optimum technique is applicable even to large graphs since the maximum number of colorings that have to be investigated just depend on the maximum clique size.

  • Code Generation for Embedded Processors - Code Generation Based on Trellis Diagrams
    Code Generation for Embedded Processors, 2002
    Co-Authors: B. Wess
    Abstract:

    In this chapter, the task of code generation is considered as a search problem for optimal weighted paths in trellis trees. These trees are made up of trellis diagrams containing information about the target machine’s instructions along with the required Registers. An algorithm is discussed which integrates the highly interdependent tasks of scheduling, Register allocation, and instruction selection. The trellis diagram concept is particularly useful when generating code for heterogeneous Register Set machines. It has been successfully applied to implement compilers for general purpose digital signal processors.

  • Integrated scheduling and Register assignment for VLIW-DSP architectures
    Proceedings 14th Annual IEEE International ASIC SOC Conference (IEEE Cat. No.01TH8558), 2001
    Co-Authors: T. Zeithofer, B. Wess
    Abstract:

    This paper describes code generation techniques for VLIW-DSP architectures. We focus on architectures with heterogeneous functional units and heterogeneous Register Sets. When generating code, scheduling and Register allocation/assignment are typically done in separate steps. This is due to the fact that these tasks are complex combinatorial optimization problems particularly in case of irregular data-paths. However, these phases are strongly interdependent and therefore traditional approaches are often suboptimal. This paper proposes a new technique to integrate scheduling and Register assignment. Our approach ensures that only schedules are produced for which a Register assignment is guaranteed to exist. This is achieved by mapping the Register Set of the architecture onto a Set of virtual resources. The concept of virtual resources provides a powerful methodology to easily check whether a Register assignment for a specific schedule exists without the necessity to generate one. This allows to apply any schedule generation or optimization strategy and Register assignment to be done only for the optimized final schedule for which a solution is known to exist.

Farzad Khorasani - One of the best experts on this subject based on the ideXlab platform.

  • RegMutex: Inter-Warp GPU Register Time-Sharing
    2018 ACM IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018
    Co-Authors: Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-farahani, Nuwan Jayasena, Vivek Sarkar
    Abstract:

    Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large Register files. However, to avoid over-complicating the hardware, Registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live Registers at any given point in the GPU binary although the points at which all the requested Registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the Register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subSet of physical Registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected Register Set into a base Register Set and an extended Register Set. While physical Registers corresponding to the base Register Set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical Registers across warps to provision their extended Register Set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of Registers hence yielding higher performance per dollar. For programs that require a large number of Registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their Register allocations with each other, leading to a higher device occupancy. Since some aspects of Register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the Register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of Registers.

  • ISCA - RegMutex: inter-warp GPU Register time-sharing
    2018 ACM IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018
    Co-Authors: Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-farahani, Nuwan Jayasena, Vivek Sarkar
    Abstract:

    Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large Register files. However, to avoid over-complicating the hardware, Registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live Registers at any given point in the GPU binary although the points at which all the requested Registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the Register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subSet of physical Registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected Register Set into a base Register Set and an extended Register Set. While physical Registers corresponding to the base Register Set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical Registers across warps to provision their extended Register Set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of Registers hence yielding higher performance per dollar. For programs that require a large number of Registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their Register allocations with each other, leading to a higher device occupancy. Since some aspects of Register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the Register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of Registers.

M. Schlett - One of the best experts on this subject based on the ideXlab platform.

  • A 32-b RISC/DSP microprocessor with reduced complexity
    IEEE Journal of Solid-State Circuits, 1997
    Co-Authors: M. Dolle, S. Jhand, W. Lehner, O. Muller, M. Schlett
    Abstract:

    This paper presents a new 32-b reduced instruction Set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same Register Set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction Set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous.

Hodjat Asghari Esfeden - One of the best experts on this subject based on the ideXlab platform.

  • RegMutex: Inter-Warp GPU Register Time-Sharing
    2018 ACM IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018
    Co-Authors: Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-farahani, Nuwan Jayasena, Vivek Sarkar
    Abstract:

    Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large Register files. However, to avoid over-complicating the hardware, Registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live Registers at any given point in the GPU binary although the points at which all the requested Registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the Register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subSet of physical Registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected Register Set into a base Register Set and an extended Register Set. While physical Registers corresponding to the base Register Set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical Registers across warps to provision their extended Register Set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of Registers hence yielding higher performance per dollar. For programs that require a large number of Registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their Register allocations with each other, leading to a higher device occupancy. Since some aspects of Register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the Register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of Registers.

  • ISCA - RegMutex: inter-warp GPU Register time-sharing
    2018 ACM IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018
    Co-Authors: Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-farahani, Nuwan Jayasena, Vivek Sarkar
    Abstract:

    Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large Register files. However, to avoid over-complicating the hardware, Registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live Registers at any given point in the GPU binary although the points at which all the requested Registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the Register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subSet of physical Registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected Register Set into a base Register Set and an extended Register Set. While physical Registers corresponding to the base Register Set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical Registers across warps to provision their extended Register Set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of Registers hence yielding higher performance per dollar. For programs that require a large number of Registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their Register allocations with each other, leading to a higher device occupancy. Since some aspects of Register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the Register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of Registers.