Memory Latency

The Experts below are selected from a list of 4152 Experts worldwide ranked by ideXlab platform

Chia-lin Yang - One of the best experts on this subject based on the ideXlab platform.

MICRO - Memory Latency Reduction via Thread Throttling

2010 43rd Annual IEEE ACM International Symposium on Microarchitecture, 2010

Co-Authors: Hsiang-yun Cheng, Jian Li, Chia-lin Yang

Abstract:

Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since the Memory resource is shared by all cores. Interferences among requests from different cores may prolong the Latency of Memory accesses thereby degrading the system performance. To tackle the problem, this paper proposes to decouple application threads into compute and Memory tasks, and restrict the number of concurrent Memory tasks to avoid the interference among Memory requests. Yet with this scheduling restriction, a CPU core may unnecessarily stay idle, which incurs adverse impact on the overall performance. Therefore, we develop a Memory thread throttling mechanism that tunes the allowable Memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors Memory and computation ratios of a program for phase detection. It then decides the Memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance on real machines. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and match with the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.

15 days free trial to Access Article
Memory Latency Reduction via Thread Throttling

2010 43rd Annual IEEE ACM International Symposium on Microarchitecture, 2010

Co-Authors: Hsiang-yun Cheng, Jian Li, Chia-lin Yang

Abstract:

Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since the Memory resource is shared by all cores. Interferences among requests from different cores may prolong the Latency of Memory accesses thereby degrading the system performance. To tackle the problem, this paper proposes to decouple application threads into compute and Memory tasks, and restrict the number of concurrent Memory tasks to avoid the interference among Memory requests. Yet with this scheduling restriction, a CPU core may unnecessarily stay idle, which incurs adverse impact on the overall performance. Therefore, we develop a Memory thread throttling mechanism that tunes the allowable Memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors Memory and computation ratios of a program for phase detection. It then decides the Memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance on real machines. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and match with the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.

15 days free trial to Access Article
Tolerating Memory Latency through push prefetching for pointer-intensive applications

ACM Transactions on Architecture and Code Optimization, 2004

Co-Authors: Chia-lin Yang, Alvin R. Lebeck, Hung-wei Tseng

Abstract:

Prefetching is often used to overlap Memory Latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular Memory access pattern and pointer-chasing problem. In this paper, we proposed a cooperative hardware/software prefetching framework, the push architecture, which is designed specifically for linked data structures. The push architecture exploits program structure for future address generation instead of relying on past address history. It identifies the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution. This allows the prefetch engine to successfully generate future addresses. To overcome the serial nature of LDS address generation, the push architecture employs a novel data movement model. It attaches the prefetch engine to each level of the Memory hierarchy and pushes, rather than pulls, data to the CPU. This push model decouples the pointer dereference from the transfer of the current node up to the processor. Thus a series of pointer dereferences becomes a pipelined process rather than a serial process. Simulation results show that the push architecture can reduce up to 100% of Memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 15%.

15 days free trial to Access Article

Sanjay J. Patel - One of the best experts on this subject based on the ideXlab platform.

OUTRIDER: Efficient Memory Latency tolerance with decoupled strands

2011 38th Annual International Symposium on Computer Architecture (ISCA), 2011

Co-Authors: Neal C. Crago, Sanjay J. Patel

Abstract:

We present Outrider, an architecture for throughput-oriented processors that provides Memory Latency tolerance to improve performance on highly threaded workloads. Out-rider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate Memory-accessing and Memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate Memory Latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, Outrider can tolerate Memory Latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that Outrider can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

15 days free trial to Access Article
outrider efficient Memory Latency tolerance with decoupled strands

International Symposium on Computer Architecture, 2011

Co-Authors: Neal C. Crago, Sanjay J. Patel

Abstract:

We present OUTRIDER, an architecture for throughput-oriented processors that provides Memory Latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate Memory-accessing and Memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate Memory Latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate Memory Latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

15 days free trial to Access Article
ISCA - OUTRIDER: efficient Memory Latency tolerance with decoupled strands

Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11, 2011

Co-Authors: Neal C. Crago, Sanjay J. Patel

Abstract:

We present OUTRIDER, an architecture for throughput-oriented processors that provides Memory Latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate Memory-accessing and Memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate Memory Latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate Memory Latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

15 days free trial to Access Article

Shih-lien Lu - One of the best experts on this subject based on the ideXlab platform.

Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency

2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013

Co-Authors: Jianmin Chen, Zhen Yang, Jih-kwon Peir, Xiaoyuan Li, Shih-lien Lu

Abstract:

Modern General-Purpose computation on Graphics Processing Units (GPGPUs) explore parallelism in applications by building massively parallel architecture and apply multithreading technology to hide the instruction and Memory latencies. Such architectures become increasingly popular for parallel applications using CUDA/OpenCL programming languages. In this paper, we investigate thread scheduling algorithms on such highly-threaded GPGPUs. The traditional round-robin scheduling schemes are inefficient in handling instruction execution and Memory accesses with disparate latencies. We introduce a new GPGPU thread (warp) scheduling algorithm which enables flexible roundrobin distance for efficiently utilizing multithread parallelism and use program-guided priority shift among concurrent threads (warps) to allow more overlaps between short-Latency compute instructions and long-Latency Memory accesses. Performance evaluations demonstrate that the new scheduling algorithm improves a set of kernel execution times by an average of 12% with 52% reduction on scheduler stall cycles over the fine-granularity round-robin scheme. In this paper, we also accomplish a thorough evaluation of various thread scheduling algorithms based on the amount of hardware threads, the scheduling overhead, and the global Memory Latency.

15 days free trial to Access Article
IPDPS - Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency

2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013

Co-Authors: Jianmin Chen, Zhen Yang, Jih-kwon Peir, Xiaoyuan Li, Shih-lien Lu

Abstract:

Modern General-Purpose computation on Graphics Processing Units (GPGPUs) explore parallelism in applications by building massively parallel architecture and apply multithreading technology to hide the instruction and Memory latencies. Such architectures become increasingly popular for parallel applications using CUDA/OpenCL programming languages. In this paper, we investigate thread scheduling algorithms on such highly-threaded GPGPUs. The traditional round-robin scheduling schemes are inefficient in handling instruction execution and Memory accesses with disparate latencies. We introduce a new GPGPU thread (warp) scheduling algorithm which enables flexible roundrobin distance for efficiently utilizing multithread parallelism and use program-guided priority shift among concurrent threads (warps) to allow more overlaps between short-Latency compute instructions and long-Latency Memory accesses. Performance evaluations demonstrate that the new scheduling algorithm improves a set of kernel execution times by an average of 12% with 52% reduction on scheduler stall cycles over the fine-granularity round-robin scheme. In this paper, we also accomplish a thorough evaluation of various thread scheduling algorithms based on the amount of hardware threads, the scheduling overhead, and the global Memory Latency.

15 days free trial to Access Article

Mohamed Hassan - One of the best experts on this subject based on the ideXlab platform.

RTSS - On the Off-Chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?

2018 IEEE Real-Time Systems Symposium (RTSS), 2018

Co-Authors: Mohamed Hassan

Abstract:

Predictable execution time upon accessing shared memories in multi-core real-time systems is a stringent requirement. A plethora of existing works focus on the analysis of Double Data Rate Dynamic Random Access Memories (DDR DRAMs), or redesigning its Memory to provide predictable Memory behavior. In this paper, we show that DDR DRAMs by construction suffer inherent limitations associated with achieving such predictability. These limitations lead to 1) highly variable access latencies that fluctuate based on various factors such as access patterns and Memory state from previous accesses, and 2) overly pessimistic Latency bounds. As a result, DDR DRAMs can be ill-suited for some real-time systems that mandate a strict predictable performance with tight timing constraints. Targeting these systems, we promote an alternative off-chip Memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol, and propose a predictable Memory controller (RLDC) managing accesses to this Memory. Comparing with the state-of-the-art predictable DDR controllers, the proposed solution provides up to 11× less timing variability and 6.4× reduction in the worst case Memory Latency.

15 days free trial to Access Article
On the Off-Chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?

2018 IEEE Real-Time Systems Symposium (RTSS), 2018

Co-Authors: Mohamed Hassan

Abstract:

Predictable execution time upon accessing shared memories in multi-core real-time systems is a stringent requirement. A plethora of existing works focus on the analysis of Double Data Rate Dynamic Random Access Memories (DDR DRAMs), or redesigning its Memory to provide predictable Memory behavior. In this paper, we show that DDR DRAMs by construction suffer inherent limitations associated with achieving such predictability. These limitations lead to 1) highly variable access latencies that fluctuate based on various factors such as access patterns and Memory state from previous accesses, and 2) overly pessimistic Latency bounds. As a result, DDR DRAMs can be ill-suited for some real-time systems that mandate a strict predictable performance with tight timing constraints. Targeting these systems, we promote an alternative off-chip Memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol, and propose a predictable Memory controller (RLDC) managing accesses to this Memory. Comparing with the state-of-the-art predictable DDR controllers, the proposed solution provides up to 11× less timing variability and 6.4× reduction in the worst case Memory Latency.

15 days free trial to Access Article

Jeanloup Baer - One of the best experts on this subject based on the ideXlab platform.

Effective hardware-based data prefetching for high-performance processors

IEEE Transactions on Computers, 1995

Co-Authors: Tienfu Chen, Jeanloup Baer

Abstract:

Memory Latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that Memory Latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead program counter that ideally stays one Memory Latency time ahead of the real program counter and that is used as the control mechanism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels. These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise.

15 days free trial to Access Article
reducing Memory Latency via non blocking and prefetching caches

Architectural Support for Programming Languages and Operating Systems, 1992

Co-Authors: Tienfu Chen, Jeanloup Baer

Abstract:

Non-blocking caches and prefetehing caches are two techniques for hiding Memory Latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in Memory Latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the Memory Latency penalty for many applications.

15 days free trial to Access Article
ASPLOS - Reducing Memory Latency via non-blocking and prefetching caches

Proceedings of the fifth international conference on Architectural support for programming languages and operating systems - ASPLOS-V, 1992

Co-Authors: Tienfu Chen, Jeanloup Baer

Abstract:

Non-blocking caches and prefetehing caches are two techniques for hiding Memory Latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in Memory Latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the Memory Latency penalty for many applications.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Chia-lin Yang - One of the best experts on this subject based on the ideXlab platform.

MICRO - Memory Latency Reduction via Thread Throttling

Memory Latency Reduction via Thread Throttling

Tolerating Memory Latency through push prefetching for pointer-intensive applications

Sanjay J. Patel - One of the best experts on this subject based on the ideXlab platform.

OUTRIDER: Efficient Memory Latency tolerance with decoupled strands

outrider efficient Memory Latency tolerance with decoupled strands

ISCA - OUTRIDER: efficient Memory Latency tolerance with decoupled strands

Shih-lien Lu - One of the best experts on this subject based on the ideXlab platform.

Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency

IPDPS - Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency

Mohamed Hassan - One of the best experts on this subject based on the ideXlab platform.

RTSS - On the Off-Chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?

On the Off-Chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?

Jeanloup Baer - One of the best experts on this subject based on the ideXlab platform.

Effective hardware-based data prefetching for high-performance processors

reducing Memory Latency via non blocking and prefetching caches

ASPLOS - Reducing Memory Latency via non-blocking and prefetching caches

Memory Latency

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Chia-lin Yang - One of the best experts on this subject based on the ideXlab platform.

Sanjay J. Patel - One of the best experts on this subject based on the ideXlab platform.

Shih-lien Lu - One of the best experts on this subject based on the ideXlab platform.

Mohamed Hassan - One of the best experts on this subject based on the ideXlab platform.

Jeanloup Baer - One of the best experts on this subject based on the ideXlab platform.

Related terms