Loop Tiling

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 1881 Experts worldwide ranked by ideXlab platform

Chun Jason Xue - One of the best experts on this subject based on the ideXlab platform.

  • Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile Processors
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019
    Co-Authors: Keni Qiu, Mengying Zhao, Yongpan Liu, Yong Guan, Chun Jason Xue
    Abstract:

    As power failures often occur in energy harvesting powered nonvolatile processors (NVPs), checkpointing is needed during program execution. It is observed that checkpointing is implemented with high overhead in applications with Loops, because a large amount of data needs backup during Loop execution. As such, we are motivated to reduce the amount of checkpointing data by analyzing data locality and shortening data lifetime in Loops. This paper proposes a checkpointing-aware Loop Tiling technique which targets to reduce the checkpointing and recovering overheads for Loops. Specifically, we first derive the optimal tile size for nested Loops considering checkpointing distance and data dependencies. Then, the implementations of checkpointing and recovering for tiled Loops are presented. Finally, the experiments are conducted to evaluate the effectiveness of the proposed method. The experimental results show that compared to the no-Tiling method, the checkpointing-aware Loop Tiling method reduces the checkpointing and recovering data by 36.2% on average and reduces the total execution time and dynamic energy for checkpointing and recovering by 27.2% and 22.9% on average, respectively.

  • an adaptive non uniform Loop Tiling for dma based bulk data transfers on many core processor
    International Conference on Computer Design, 2016
    Co-Authors: Keni Qiu, Weigong Zhang, Jing Wang, Chun Jason Xue
    Abstract:

    Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop Tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of Loop Tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal Tiling factors for each core family. In this way, different core families are assigned non-uniform Tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform Tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a Loop nest.

  • Write Mode Aware Loop Tiling for High Performance Low Power Volatile PCM in Embedded Systems
    IEEE Transactions on Computers, 2016
    Co-Authors: Keni Qiu, Weigong Zhang, Chun Jason Xue
    Abstract:

    Architecting PCM, especially MLC PCM, as main memory for MCUs is a promising technique to replace conventional DRAM deployment. However, PCM/MLC PCM suffers from long write latency and large write energy. Recent work has proposed a compiler directed dual-write (CDDW) scheme to combat the drawbacks of PCM by adopting fast or slow mode for different write operations. For large-scale Loops, we observe that write instances' lifetime is very long and can only be written by the expensive slow mode. This paper proposes a write mode aware Loop Tiling approach to effectively reduce the lifetime of write instances and maximize the number of efficient fast writes in Loops. The experimental results show that the proposed approach improves performance by 50.8 percent and reduces dynamic energy by 32.0 percent across a set of benchmarks compared to the CDDW approach on average.

  • ICCD - An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor
    2016 IEEE 34th International Conference on Computer Design (ICCD), 2016
    Co-Authors: Keni Qiu, Weigong Zhang, Jing Wang, Chun Jason Xue
    Abstract:

    Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop Tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of Loop Tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal Tiling factors for each core family. In this way, different core families are assigned non-uniform Tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform Tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a Loop nest.

  • write mode aware Loop Tiling for high performance low power volatile pcm
    Design Automation Conference, 2014
    Co-Authors: Keni Qiu, Chun Jason Xue
    Abstract:

    Architecting PCM, especially MLC PCM, as main memory for MCUs is a promising technique to replace conventional DRAM deployment. However, PCM/MLC PCM suffers from long write latency and large write energy. Recent work has proposed a compiler directed dual-write (CDDW) scheme to combat the drawbacks of PCM by adopting fast or slow write mode for different write operations. We observe that write instances' lifetime is very long and can only be written by the expensive slow mode for large-scale Loops. This paper proposes a write mode aware Loop Tiling approach to effectively reduce the lifetime of write instances and maximize the number of efficient fast writes in Loops. The experimental results show that the proposed approach improves performance by 50.8% and reduces dynamic energy by 32.0% across a set of benchmarks compared to the CDDW approach on average.

Yan Solihin - One of the best experts on this subject based on the ideXlab platform.

  • wet write efficient Loop Tiling for non volatile main memory
    Design Automation Conference, 2020
    Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin
    Abstract:

    Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

  • DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory
    2020 57th ACM IEEE Design Automation Conference (DAC), 2020
    Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin
    Abstract:

    Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

Keni Qiu - One of the best experts on this subject based on the ideXlab platform.

  • Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile Processors
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019
    Co-Authors: Keni Qiu, Mengying Zhao, Yongpan Liu, Yong Guan, Chun Jason Xue
    Abstract:

    As power failures often occur in energy harvesting powered nonvolatile processors (NVPs), checkpointing is needed during program execution. It is observed that checkpointing is implemented with high overhead in applications with Loops, because a large amount of data needs backup during Loop execution. As such, we are motivated to reduce the amount of checkpointing data by analyzing data locality and shortening data lifetime in Loops. This paper proposes a checkpointing-aware Loop Tiling technique which targets to reduce the checkpointing and recovering overheads for Loops. Specifically, we first derive the optimal tile size for nested Loops considering checkpointing distance and data dependencies. Then, the implementations of checkpointing and recovering for tiled Loops are presented. Finally, the experiments are conducted to evaluate the effectiveness of the proposed method. The experimental results show that compared to the no-Tiling method, the checkpointing-aware Loop Tiling method reduces the checkpointing and recovering data by 36.2% on average and reduces the total execution time and dynamic energy for checkpointing and recovering by 27.2% and 22.9% on average, respectively.

  • low power driven Loop Tiling for rram crossbar based cnn
    ACM Symposium on Applied Computing, 2018
    Co-Authors: Keni Qiu, Weiwen Chen, Lixue Xia, Yu Wang
    Abstract:

    Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Multiply and accumulate (MAC) operations serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design presents a very high overhead on peripheral circuits and memory accesses, limiting the gains of RCS. Addressing the problem, recently a Multi-CLP (Convolutional Layer Processor) structure has been proposed, where the FPGA controlling resources can be shared by multiple computation units. Exploiting this idea, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed, with the underlying idea is to put the expensive AD/DAs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. This paper adopts the above structures. It is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different CLPs. A Loop Tiling technique is proposed to enable memory accesses bypassing and further improve the energy of RCS. And to guarantee correct data dependency between layers, the safe starting time for a layer is discussed if its previous layer is tiled in a different CLP. The experiments of two convolutional applications validate that the Loop Tiling technique integrated with the Multi-CLP structure can efficiently meet power budgets and further reduce energy consumption by 61.7%.

  • SAC - Low power driven Loop Tiling for RRAM crossbar-based CNN
    Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018
    Co-Authors: Keni Qiu, Weiwen Chen, Lixue Xia, Yu Wang
    Abstract:

    Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Multiply and accumulate (MAC) operations serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design presents a very high overhead on peripheral circuits and memory accesses, limiting the gains of RCS. Addressing the problem, recently a Multi-CLP (Convolutional Layer Processor) structure has been proposed, where the FPGA controlling resources can be shared by multiple computation units. Exploiting this idea, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed, with the underlying idea is to put the expensive AD/DAs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. This paper adopts the above structures. It is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different CLPs. A Loop Tiling technique is proposed to enable memory accesses bypassing and further improve the energy of RCS. And to guarantee correct data dependency between layers, the safe starting time for a layer is discussed if its previous layer is tiled in a different CLP. The experiments of two convolutional applications validate that the Loop Tiling technique integrated with the Multi-CLP structure can efficiently meet power budgets and further reduce energy consumption by 61.7%.

  • power optimization through peripheral circuit reusing integrated with Loop Tiling for rram crossbar based cnn
    Design Automation and Test in Europe, 2018
    Co-Authors: Weiwen Chen, Wenjuan Cui, Yuanchun Zhou, Keni Qiu
    Abstract:

    Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Prior studies have shown that convolutional computations which consist of numbers of multiply and accumulate (MAC) operations, serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design is energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed to meet given power budget. In this paper, it is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different PeriCUs. In this way, memory accesses can be reduced and thus the performance and power can be improved. A Loop Tiling technique is proposed to save memory accesses. The experiments of two convolutional applications validate that the proposed Loop Tiling technique can reduce energy consumption by 61.7%.

  • DATE - Power optimization through peripheral circuit reusing integrated with Loop Tiling for RRAM crossbar-based CNN
    2018 Design Automation & Test in Europe Conference & Exhibition (DATE), 2018
    Co-Authors: Weiwen Chen, Wenjuan Cui, Yuanchun Zhou, Keni Qiu
    Abstract:

    Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Prior studies have shown that convolutional computations which consist of numbers of multiply and accumulate (MAC) operations, serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design is energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed to meet given power budget. In this paper, it is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different PeriCUs. In this way, memory accesses can be reduced and thus the performance and power can be improved. A Loop Tiling technique is proposed to save memory accesses. The experiments of two convolutional applications validate that the proposed Loop Tiling technique can reduce energy consumption by 61.7%.

Mohammad Alshboul - One of the best experts on this subject based on the ideXlab platform.

  • wet write efficient Loop Tiling for non volatile main memory
    Design Automation Conference, 2020
    Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin
    Abstract:

    Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

  • DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory
    2020 57th ACM IEEE Design Automation Conference (DAC), 2020
    Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin
    Abstract:

    Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

James Tuck - One of the best experts on this subject based on the ideXlab platform.

  • wet write efficient Loop Tiling for non volatile main memory
    Design Automation Conference, 2020
    Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin
    Abstract:

    Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

  • DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory
    2020 57th ACM IEEE Design Automation Conference (DAC), 2020
    Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin
    Abstract:

    Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.