Loop Tiling

The Experts below are selected from a list of 1881 Experts worldwide ranked by ideXlab platform

Chun Jason Xue - One of the best experts on this subject based on the ideXlab platform.

Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile Processors

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019

Co-Authors: Keni Qiu, Mengying Zhao, Yongpan Liu, Yong Guan, Chun Jason Xue

Abstract:

As power failures often occur in energy harvesting powered nonvolatile processors (NVPs), checkpointing is needed during program execution. It is observed that checkpointing is implemented with high overhead in applications with Loops, because a large amount of data needs backup during Loop execution. As such, we are motivated to reduce the amount of checkpointing data by analyzing data locality and shortening data lifetime in Loops. This paper proposes a checkpointing-aware Loop Tiling technique which targets to reduce the checkpointing and recovering overheads for Loops. Specifically, we first derive the optimal tile size for nested Loops considering checkpointing distance and data dependencies. Then, the implementations of checkpointing and recovering for tiled Loops are presented. Finally, the experiments are conducted to evaluate the effectiveness of the proposed method. The experimental results show that compared to the no-Tiling method, the checkpointing-aware Loop Tiling method reduces the checkpointing and recovering data by 36.2% on average and reduces the total execution time and dynamic energy for checkpointing and recovering by 27.2% and 22.9% on average, respectively.

15 days free trial to Access Article
an adaptive non uniform Loop Tiling for dma based bulk data transfers on many core processor

International Conference on Computer Design, 2016

Co-Authors: Keni Qiu, Weigong Zhang, Jing Wang, Chun Jason Xue

Abstract:

Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop Tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of Loop Tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal Tiling factors for each core family. In this way, different core families are assigned non-uniform Tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform Tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a Loop nest.

15 days free trial to Access Article
Write Mode Aware Loop Tiling for High Performance Low Power Volatile PCM in Embedded Systems

IEEE Transactions on Computers, 2016

Co-Authors: Keni Qiu, Weigong Zhang, Chun Jason Xue

Abstract:

Architecting PCM, especially MLC PCM, as main memory for MCUs is a promising technique to replace conventional DRAM deployment. However, PCM/MLC PCM suffers from long write latency and large write energy. Recent work has proposed a compiler directed dual-write (CDDW) scheme to combat the drawbacks of PCM by adopting fast or slow mode for different write operations. For large-scale Loops, we observe that write instances' lifetime is very long and can only be written by the expensive slow mode. This paper proposes a write mode aware Loop Tiling approach to effectively reduce the lifetime of write instances and maximize the number of efficient fast writes in Loops. The experimental results show that the proposed approach improves performance by 50.8 percent and reduces dynamic energy by 32.0 percent across a set of benchmarks compared to the CDDW approach on average.

15 days free trial to Access Article
ICCD - An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor

2016 IEEE 34th International Conference on Computer Design (ICCD), 2016

Co-Authors: Keni Qiu, Weigong Zhang, Jing Wang, Chun Jason Xue

Abstract:

Mesh Network-on-Chip (NoC) is a key fabric to interconnect many cores with desirable scalability, reliability and interoperability. We observe that DMA-based bulk data block transfer exhibits non-negligible NoC latency due to heavy congestions. Loop Tiling is an effective way to partition data space for SPM+DMA-based data block transfer. Nevertheless, we observe that the unbalanced NoC latency can degrade the effectiveness of Loop Tiling in a uniform fashion. In this paper, we propose a NoC-aware Non-Uniform Loop Tiling (NULT) scheme to improve DMA performance. A NULT framework is built on the proposed model to adaptively hide DMA latency into computation time and reduce the overall execution time. The framework first groups cores into different families taking into account their distance-to-data in NoC. Then a heuristic method is presented to solve the near optimal Tiling factors for each core family. In this way, different core families are assigned non-uniform Tiling sizes. We evaluate the NULT scheme on the NIRGAM platform. Compared to the traditional uniform Tiling approach, the proposed NULT technique shows more benefit to overlap memory access time and computation time and thus reduce the overall execution time of a Loop nest.

15 days free trial to Access Article
write mode aware Loop Tiling for high performance low power volatile pcm

Design Automation Conference, 2014

Co-Authors: Keni Qiu, Chun Jason Xue

Abstract:

Architecting PCM, especially MLC PCM, as main memory for MCUs is a promising technique to replace conventional DRAM deployment. However, PCM/MLC PCM suffers from long write latency and large write energy. Recent work has proposed a compiler directed dual-write (CDDW) scheme to combat the drawbacks of PCM by adopting fast or slow write mode for different write operations. We observe that write instances' lifetime is very long and can only be written by the expensive slow mode for large-scale Loops. This paper proposes a write mode aware Loop Tiling approach to effectively reduce the lifetime of write instances and maximize the number of efficient fast writes in Loops. The experimental results show that the proposed approach improves performance by 50.8% and reduces dynamic energy by 32.0% across a set of benchmarks compared to the CDDW approach on average.

15 days free trial to Access Article

Yan Solihin - One of the best experts on this subject based on the ideXlab platform.

wet write efficient Loop Tiling for non volatile main memory

Design Automation Conference, 2020

Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin

Abstract:

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

15 days free trial to Access Article
DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory

2020 57th ACM IEEE Design Automation Conference (DAC), 2020

Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin

Abstract:

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

15 days free trial to Access Article

Keni Qiu - One of the best experts on this subject based on the ideXlab platform.

Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile Processors

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019

Co-Authors: Keni Qiu, Mengying Zhao, Yongpan Liu, Yong Guan, Chun Jason Xue

Abstract:

As power failures often occur in energy harvesting powered nonvolatile processors (NVPs), checkpointing is needed during program execution. It is observed that checkpointing is implemented with high overhead in applications with Loops, because a large amount of data needs backup during Loop execution. As such, we are motivated to reduce the amount of checkpointing data by analyzing data locality and shortening data lifetime in Loops. This paper proposes a checkpointing-aware Loop Tiling technique which targets to reduce the checkpointing and recovering overheads for Loops. Specifically, we first derive the optimal tile size for nested Loops considering checkpointing distance and data dependencies. Then, the implementations of checkpointing and recovering for tiled Loops are presented. Finally, the experiments are conducted to evaluate the effectiveness of the proposed method. The experimental results show that compared to the no-Tiling method, the checkpointing-aware Loop Tiling method reduces the checkpointing and recovering data by 36.2% on average and reduces the total execution time and dynamic energy for checkpointing and recovering by 27.2% and 22.9% on average, respectively.

15 days free trial to Access Article
low power driven Loop Tiling for rram crossbar based cnn

ACM Symposium on Applied Computing, 2018

Co-Authors: Keni Qiu, Weiwen Chen, Lixue Xia, Yu Wang

Abstract:

Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Multiply and accumulate (MAC) operations serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design presents a very high overhead on peripheral circuits and memory accesses, limiting the gains of RCS. Addressing the problem, recently a Multi-CLP (Convolutional Layer Processor) structure has been proposed, where the FPGA controlling resources can be shared by multiple computation units. Exploiting this idea, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed, with the underlying idea is to put the expensive AD/DAs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. This paper adopts the above structures. It is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different CLPs. A Loop Tiling technique is proposed to enable memory accesses bypassing and further improve the energy of RCS. And to guarantee correct data dependency between layers, the safe starting time for a layer is discussed if its previous layer is tiled in a different CLP. The experiments of two convolutional applications validate that the Loop Tiling technique integrated with the Multi-CLP structure can efficiently meet power budgets and further reduce energy consumption by 61.7%.

15 days free trial to Access Article
SAC - Low power driven Loop Tiling for RRAM crossbar-based CNN

Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018

Co-Authors: Keni Qiu, Weiwen Chen, Lixue Xia, Yu Wang

Abstract:

Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Multiply and accumulate (MAC) operations serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design presents a very high overhead on peripheral circuits and memory accesses, limiting the gains of RCS. Addressing the problem, recently a Multi-CLP (Convolutional Layer Processor) structure has been proposed, where the FPGA controlling resources can be shared by multiple computation units. Exploiting this idea, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed, with the underlying idea is to put the expensive AD/DAs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. This paper adopts the above structures. It is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different CLPs. A Loop Tiling technique is proposed to enable memory accesses bypassing and further improve the energy of RCS. And to guarantee correct data dependency between layers, the safe starting time for a layer is discussed if its previous layer is tiled in a different CLP. The experiments of two convolutional applications validate that the Loop Tiling technique integrated with the Multi-CLP structure can efficiently meet power budgets and further reduce energy consumption by 61.7%.

15 days free trial to Access Article
power optimization through peripheral circuit reusing integrated with Loop Tiling for rram crossbar based cnn

Design Automation and Test in Europe, 2018

Co-Authors: Weiwen Chen, Wenjuan Cui, Yuanchun Zhou, Keni Qiu

Abstract:

Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Prior studies have shown that convolutional computations which consist of numbers of multiply and accumulate (MAC) operations, serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design is energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed to meet given power budget. In this paper, it is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different PeriCUs. In this way, memory accesses can be reduced and thus the performance and power can be improved. A Loop Tiling technique is proposed to save memory accesses. The experiments of two convolutional applications validate that the proposed Loop Tiling technique can reduce energy consumption by 61.7%.

15 days free trial to Access Article
DATE - Power optimization through peripheral circuit reusing integrated with Loop Tiling for RRAM crossbar-based CNN

2018 Design Automation & Test in Europe Conference & Exhibition (DATE), 2018

Co-Authors: Weiwen Chen, Wenjuan Cui, Yuanchun Zhou, Keni Qiu

Abstract:

Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Prior studies have shown that convolutional computations which consist of numbers of multiply and accumulate (MAC) operations, serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design is energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed to meet given power budget. In this paper, it is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different PeriCUs. In this way, memory accesses can be reduced and thus the performance and power can be improved. A Loop Tiling technique is proposed to save memory accesses. The experiments of two convolutional applications validate that the proposed Loop Tiling technique can reduce energy consumption by 61.7%.

15 days free trial to Access Article

Mohammad Alshboul - One of the best experts on this subject based on the ideXlab platform.

wet write efficient Loop Tiling for non volatile main memory

Design Automation Conference, 2020

Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin

Abstract:

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

15 days free trial to Access Article
DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory

2020 57th ACM IEEE Design Automation Conference (DAC), 2020

Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin

Abstract:

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

15 days free trial to Access Article

James Tuck - One of the best experts on this subject based on the ideXlab platform.

wet write efficient Loop Tiling for non volatile main memory

Design Automation Conference, 2020

Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin

Abstract:

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

15 days free trial to Access Article
DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory

2020 57th ACM IEEE Design Automation Conference (DAC), 2020

Co-Authors: Mohammad Alshboul, James Tuck, Yan Solihin

Abstract:

Future systems are expected to increasingly include a Non-Volatile Main Memory (NVMM). However, due to the limited NVMM write endurance, the number of writes must be reduced. While new architectures and algorithms have been proposed to reduce writes to NVMM, few or no studies have looked at the effect of compiler optimizations on writes.In this paper, we investigate the impact of one popular compiler optimization (Loop Tiling) on a very important computation kernel (matrix multiplication). Our novel observation includes that Tiling on matrix multiplication causes a 25× write amplification. Furthermore, we investigate techniques to make tilling more NVMM friendly, through choosing the right tile size and employing hierarchical Tiling. Our method Write-Efficient Tiling (WET) adds a new outer tile designed for fitting the write working set to the Last Level Cache (LLC) to reduce the number of writes to NVMM. Our experiments reduce writes by 81% while simultaneously improve performance.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Chun Jason Xue - One of the best experts on this subject based on the ideXlab platform.

Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile Processors

an adaptive non uniform Loop Tiling for dma based bulk data transfers on many core processor

Write Mode Aware Loop Tiling for High Performance Low Power Volatile PCM in Embedded Systems

ICCD - An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor

write mode aware Loop Tiling for high performance low power volatile pcm

Yan Solihin - One of the best experts on this subject based on the ideXlab platform.

wet write efficient Loop Tiling for non volatile main memory

DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory

Keni Qiu - One of the best experts on this subject based on the ideXlab platform.

Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile Processors

low power driven Loop Tiling for rram crossbar based cnn

SAC - Low power driven Loop Tiling for RRAM crossbar-based CNN

power optimization through peripheral circuit reusing integrated with Loop Tiling for rram crossbar based cnn

DATE - Power optimization through peripheral circuit reusing integrated with Loop Tiling for RRAM crossbar-based CNN

Mohammad Alshboul - One of the best experts on this subject based on the ideXlab platform.

wet write efficient Loop Tiling for non volatile main memory

DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory

James Tuck - One of the best experts on this subject based on the ideXlab platform.

wet write efficient Loop Tiling for non volatile main memory

DAC - WET: Write Efficient Loop Tiling for Non-Volatile Main Memory

Loop Tiling

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Chun Jason Xue - One of the best experts on this subject based on the ideXlab platform.

Yan Solihin - One of the best experts on this subject based on the ideXlab platform.

Keni Qiu - One of the best experts on this subject based on the ideXlab platform.

Mohammad Alshboul - One of the best experts on this subject based on the ideXlab platform.

James Tuck - One of the best experts on this subject based on the ideXlab platform.

Related terms