Memory Latency

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 4152 Experts worldwide ranked by ideXlab platform

Chia-lin Yang - One of the best experts on this subject based on the ideXlab platform.

  • MICRO - Memory Latency Reduction via Thread Throttling
    2010 43rd Annual IEEE ACM International Symposium on Microarchitecture, 2010
    Co-Authors: Hsiang-yun Cheng, Jian Li, Chia-lin Yang
    Abstract:

    Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since the Memory resource is shared by all cores. Interferences among requests from different cores may prolong the Latency of Memory accesses thereby degrading the system performance. To tackle the problem, this paper proposes to decouple application threads into compute and Memory tasks, and restrict the number of concurrent Memory tasks to avoid the interference among Memory requests. Yet with this scheduling restriction, a CPU core may unnecessarily stay idle, which incurs adverse impact on the overall performance. Therefore, we develop a Memory thread throttling mechanism that tunes the allowable Memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors Memory and computation ratios of a program for phase detection. It then decides the Memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance on real machines. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and match with the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.

  • Memory Latency Reduction via Thread Throttling
    2010 43rd Annual IEEE ACM International Symposium on Microarchitecture, 2010
    Co-Authors: Hsiang-yun Cheng, Jian Li, Chia-lin Yang
    Abstract:

    Memory Wall is a well-known obstacle to processor performance improvement. The popularity of multi-core architecture will further exaggerate the problem since the Memory resource is shared by all cores. Interferences among requests from different cores may prolong the Latency of Memory accesses thereby degrading the system performance. To tackle the problem, this paper proposes to decouple application threads into compute and Memory tasks, and restrict the number of concurrent Memory tasks to avoid the interference among Memory requests. Yet with this scheduling restriction, a CPU core may unnecessarily stay idle, which incurs adverse impact on the overall performance. Therefore, we develop a Memory thread throttling mechanism that tunes the allowable Memory threads dynamically under workload variation to improve system performance. The proposed run-time mechanism monitors Memory and computation ratios of a program for phase detection. It then decides the Memory thread constraint for the next program phase based on an analytical model that can estimate system performance under different constraint values. To prove the concept, we prototype the mechanism in some real-world applications as well as synthetic workloads. We evaluate their performance on real machines. The experimental results demonstrate up to 20% speedup with a pool of synthetic workloads on an Intel i7 (Nehalem) machine and match with the speedup estimated by the proposed analytical model. Furthermore, the intelligent run-time scheduling leads to a geometric mean of 12% performance improvement for real-world applications on the same hardware.

  • Tolerating Memory Latency through push prefetching for pointer-intensive applications
    ACM Transactions on Architecture and Code Optimization, 2004
    Co-Authors: Chia-lin Yang, Alvin R. Lebeck, Hung-wei Tseng
    Abstract:

    Prefetching is often used to overlap Memory Latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular Memory access pattern and pointer-chasing problem. In this paper, we proposed a cooperative hardware/software prefetching framework, the push architecture, which is designed specifically for linked data structures. The push architecture exploits program structure for future address generation instead of relying on past address history. It identifies the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution. This allows the prefetch engine to successfully generate future addresses. To overcome the serial nature of LDS address generation, the push architecture employs a novel data movement model. It attaches the prefetch engine to each level of the Memory hierarchy and pushes, rather than pulls, data to the CPU. This push model decouples the pointer dereference from the transfer of the current node up to the processor. Thus a series of pointer dereferences becomes a pipelined process rather than a serial process. Simulation results show that the push architecture can reduce up to 100% of Memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 15%.

Sanjay J. Patel - One of the best experts on this subject based on the ideXlab platform.

  • OUTRIDER: Efficient Memory Latency tolerance with decoupled strands
    2011 38th Annual International Symposium on Computer Architecture (ISCA), 2011
    Co-Authors: Neal C. Crago, Sanjay J. Patel
    Abstract:

    We present Outrider, an architecture for throughput-oriented processors that provides Memory Latency tolerance to improve performance on highly threaded workloads. Out-rider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate Memory-accessing and Memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate Memory Latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, Outrider can tolerate Memory Latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that Outrider can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

  • outrider efficient Memory Latency tolerance with decoupled strands
    International Symposium on Computer Architecture, 2011
    Co-Authors: Neal C. Crago, Sanjay J. Patel
    Abstract:

    We present OUTRIDER, an architecture for throughput-oriented processors that provides Memory Latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate Memory-accessing and Memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate Memory Latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate Memory Latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

  • ISCA - OUTRIDER: efficient Memory Latency tolerance with decoupled strands
    Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11, 2011
    Co-Authors: Neal C. Crago, Sanjay J. Patel
    Abstract:

    We present OUTRIDER, an architecture for throughput-oriented processors that provides Memory Latency tolerance to improve performance on highly threaded workloads. OUTRIDER enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams that separate Memory-accessing and Memory-consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate Memory Latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Moreover, instead of adding more threads as is done in modern GPUs, OUTRIDER can tolerate Memory Latency with fewer threads and reduced contention for resources shared amongst threads. We demonstrate that OUTRIDER can outperform single threaded cores by 23-131% and a 4-way simultaneous multithreaded core by up to 87% on data parallel applications in a 1024-core system. Moreover, OUTRIDER achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved area efficiency compared to a multithreaded core.

Shih-lien Lu - One of the best experts on this subject based on the ideXlab platform.

  • Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency
    2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
    Co-Authors: Jianmin Chen, Zhen Yang, Jih-kwon Peir, Xiaoyuan Li, Shih-lien Lu
    Abstract:

    Modern General-Purpose computation on Graphics Processing Units (GPGPUs) explore parallelism in applications by building massively parallel architecture and apply multithreading technology to hide the instruction and Memory latencies. Such architectures become increasingly popular for parallel applications using CUDA/OpenCL programming languages. In this paper, we investigate thread scheduling algorithms on such highly-threaded GPGPUs. The traditional round-robin scheduling schemes are inefficient in handling instruction execution and Memory accesses with disparate latencies. We introduce a new GPGPU thread (warp) scheduling algorithm which enables flexible roundrobin distance for efficiently utilizing multithread parallelism and use program-guided priority shift among concurrent threads (warps) to allow more overlaps between short-Latency compute instructions and long-Latency Memory accesses. Performance evaluations demonstrate that the new scheduling algorithm improves a set of kernel execution times by an average of 12% with 52% reduction on scheduler stall cycles over the fine-granularity round-robin scheme. In this paper, we also accomplish a thorough evaluation of various thread scheduling algorithms based on the amount of hardware threads, the scheduling overhead, and the global Memory Latency.

  • IPDPS - Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency
    2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
    Co-Authors: Jianmin Chen, Zhen Yang, Jih-kwon Peir, Xiaoyuan Li, Shih-lien Lu
    Abstract:

    Modern General-Purpose computation on Graphics Processing Units (GPGPUs) explore parallelism in applications by building massively parallel architecture and apply multithreading technology to hide the instruction and Memory latencies. Such architectures become increasingly popular for parallel applications using CUDA/OpenCL programming languages. In this paper, we investigate thread scheduling algorithms on such highly-threaded GPGPUs. The traditional round-robin scheduling schemes are inefficient in handling instruction execution and Memory accesses with disparate latencies. We introduce a new GPGPU thread (warp) scheduling algorithm which enables flexible roundrobin distance for efficiently utilizing multithread parallelism and use program-guided priority shift among concurrent threads (warps) to allow more overlaps between short-Latency compute instructions and long-Latency Memory accesses. Performance evaluations demonstrate that the new scheduling algorithm improves a set of kernel execution times by an average of 12% with 52% reduction on scheduler stall cycles over the fine-granularity round-robin scheme. In this paper, we also accomplish a thorough evaluation of various thread scheduling algorithms based on the amount of hardware threads, the scheduling overhead, and the global Memory Latency.

Mohamed Hassan - One of the best experts on this subject based on the ideXlab platform.

  • RTSS - On the Off-Chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?
    2018 IEEE Real-Time Systems Symposium (RTSS), 2018
    Co-Authors: Mohamed Hassan
    Abstract:

    Predictable execution time upon accessing shared memories in multi-core real-time systems is a stringent requirement. A plethora of existing works focus on the analysis of Double Data Rate Dynamic Random Access Memories (DDR DRAMs), or redesigning its Memory to provide predictable Memory behavior. In this paper, we show that DDR DRAMs by construction suffer inherent limitations associated with achieving such predictability. These limitations lead to 1) highly variable access latencies that fluctuate based on various factors such as access patterns and Memory state from previous accesses, and 2) overly pessimistic Latency bounds. As a result, DDR DRAMs can be ill-suited for some real-time systems that mandate a strict predictable performance with tight timing constraints. Targeting these systems, we promote an alternative off-chip Memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol, and propose a predictable Memory controller (RLDC) managing accesses to this Memory. Comparing with the state-of-the-art predictable DDR controllers, the proposed solution provides up to 11× less timing variability and 6.4× reduction in the worst case Memory Latency.

  • On the Off-Chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?
    2018 IEEE Real-Time Systems Symposium (RTSS), 2018
    Co-Authors: Mohamed Hassan
    Abstract:

    Predictable execution time upon accessing shared memories in multi-core real-time systems is a stringent requirement. A plethora of existing works focus on the analysis of Double Data Rate Dynamic Random Access Memories (DDR DRAMs), or redesigning its Memory to provide predictable Memory behavior. In this paper, we show that DDR DRAMs by construction suffer inherent limitations associated with achieving such predictability. These limitations lead to 1) highly variable access latencies that fluctuate based on various factors such as access patterns and Memory state from previous accesses, and 2) overly pessimistic Latency bounds. As a result, DDR DRAMs can be ill-suited for some real-time systems that mandate a strict predictable performance with tight timing constraints. Targeting these systems, we promote an alternative off-chip Memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol, and propose a predictable Memory controller (RLDC) managing accesses to this Memory. Comparing with the state-of-the-art predictable DDR controllers, the proposed solution provides up to 11× less timing variability and 6.4× reduction in the worst case Memory Latency.

Jeanloup Baer - One of the best experts on this subject based on the ideXlab platform.

  • Effective hardware-based data prefetching for high-performance processors
    IEEE Transactions on Computers, 1995
    Co-Authors: Tienfu Chen, Jeanloup Baer
    Abstract:

    Memory Latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that Memory Latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead program counter that ideally stays one Memory Latency time ahead of the real program counter and that is used as the control mechanism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels. These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise.

  • reducing Memory Latency via non blocking and prefetching caches
    Architectural Support for Programming Languages and Operating Systems, 1992
    Co-Authors: Tienfu Chen, Jeanloup Baer
    Abstract:

    Non-blocking caches and prefetehing caches are two techniques for hiding Memory Latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in Memory Latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the Memory Latency penalty for many applications.

  • ASPLOS - Reducing Memory Latency via non-blocking and prefetching caches
    Proceedings of the fifth international conference on Architectural support for programming languages and operating systems - ASPLOS-V, 1992
    Co-Authors: Tienfu Chen, Jeanloup Baer
    Abstract:

    Non-blocking caches and prefetehing caches are two techniques for hiding Memory Latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in Memory Latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the Memory Latency penalty for many applications.