Memory Access

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 125664 Experts worldwide ranked by ideXlab platform

Torsten Hoefler - One of the best experts on this subject based on the ideXlab platform.

  • Enabling highly scalable remote Memory Access programming with MPI-3 one sided
    Communications of the ACM, 2018
    Co-Authors: Robert Gerstenberger, Maciej Besta, Torsten Hoefler
    Abstract:

    Modern high-performance networks offer remote direct Memory Access (RDMA) that exposes a process' virtual address space to other processes in the network. The Message Passing Interface (MPI) specification has recently been extended with a programming interface called MPI-3 Remote Memory Access (MPI-3 RMA) for efficiently exploiting state-of-the-art RDMA features. MPI-3 RMA enables a powerful programming model that alleviates many message passing downsides. In this work, we design and develop bufferless protocols that demonstrate how to implement this interface and support scaling to millions of cores with negligible Memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for RMA functions that enable rigorous mathematical analysis of application performance and facilitate the development of codes that solve given tasks within specified time and energy budgets. We validate the usability of our library and models with several application studies with up to half a million processes. In a wider sense, our work illustrates how to use RMA principles to accelerate computation- and data-intensive codes.

  • Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization
    2016
    Co-Authors: Roberto Belli, Torsten Hoefler
    Abstract:

    Abstract—Remote Memory Access (RMA) programming enables direct Access to low-level hardware features to achieve high performance for distributed-Memory programs. However, the design of RMA programming schemes focuses on the Memory Access and less on process synchronization. For exam-ple, in contemporary RMA programming systems, the widely used producer-consumer pattern can only be implemented inefficiently, incurring the overhead of an additional round-trip message. We propose Notified Access, a scheme where the target process of an Access can receive a completion notification. This scheme enables direct and efficient synchronization with a minimum number of messages. We implement our scheme in an open source MPI-3 RMA library and demonstrate lower overheads (two cache misses) than other point-to-point syn-chronization mechanisms. We also evaluate our implementation on three real-world benchmarks: a stencil computation, a tree computation, and a Cholesky factorization implemented with tasks. Our scheme always performs better than traditional message passing and other existing RMA synchronization schemes, providing up to 50 % speedup on small messages. Our analysis shows that Notified Access is a valuable primitive for any RMA system. Furthermore, we provide guidance for the design of low-level network interfaces to support Notified Access efficiently. Keywords-RMA; synchronization; notification; MP

  • notified Access extending remote Memory Access programming models for producer consumer synchronization
    International Parallel and Distributed Processing Symposium, 2015
    Co-Authors: Roberto Belli, Torsten Hoefler
    Abstract:

    Remote Memory Access (RMA) programming enables direct Access to low-level hardware features to achieve high performance for distributed-Memory programs. However, the design of RMA programming schemes focuses on the Memory Access and less on the synchronization. For example, in contemporary RMA programming systems, the widely used producer-consumer pattern can only be implemented inefficiently, incurring in an overhead of an additional round-trip message. We propose Notified Access, a scheme where the target process of an Access can receive a completion notification. This scheme enables direct and efficient synchronization with a minimum number of messages. We implement our scheme in an open source MPI-3 RMA library and demonstrate lower overheads (two cache misses) than other point-to-point synchronization mechanisms for each notification. We also evaluate our implementation on three real-world benchmarks, a stencil computation, a tree computation, and a Colicky factorization implemented with tasks. Our scheme always performs better than traditional message passing and other existing RMA synchronization schemes, providing up to 50% speedup on small messages. Our analysis shows that Notified Access is a valuable primitive for any RMA system. Furthermore, we provide guidance for the design of low-level network interfaces to support Notified Access efficiently.

  • fault tolerance for remote Memory Access programming models
    High Performance Distributed Computing, 2014
    Co-Authors: Maciej Besta, Torsten Hoefler
    Abstract:

    Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault tolerance for RMA and show that it is fundamentally different from resilience mechanisms targeting the message passing (MP) model. We design a model for reasoning about fault tolerance for RMA, addressing both flat and hierarchical hardware. We use this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-Memory checkpointing, transparent logging of remote Memory Accesses, and a scheme for transparent recovery of failed processes. Our protocols take into account diminishing amounts of Memory per core, one of the major features of future exascale machines. The implementation of our fault-tolerance scheme entails negligible additional overheads. Our reliability model shows that in-Memory checkpointing and logging provide high resilience. This study enables highly-scalable resilience mechanisms for RMA and fills a research gap between fault tolerance and emerging RMA programming models.

Roger D Chamberlain - One of the best experts on this subject based on the ideXlab platform.

  • a Memory Access model for highly threaded many core architectures
    Future Generation Computer Systems, 2014
    Co-Authors: Kunal Agrawal, Roger D Chamberlain
    Abstract:

    A number of highly-threaded, many-core architectures hide Memory-Access latency by low-overhead context switching among a large number of threads. The speedup of a program on these machines depends on how well the latency is hidden. If the number of threads were infinite, theoretically, these machines could provide the performance predicted by the PRAM analysis of these programs. However, the number of threads per processor is not infinite, and is constrained by both hardware and algorithmic limits. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to provide a more fine-grained and accurate performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the dynamic programming algorithm and Johnson's algorithm have the same performance in the PRAM model. However, our model predicts different performance for large enough Memory-Access latency and validates the intuition that the dynamic programming algorithm performs better on these machines. We validate several predictions made by our model using empirical measurements on an instantiation of a highly-threaded, many-core machine, namely the NVIDIA GTX?480. We design a Memory model to analyze algorithms for highly-threaded many-core systems.The model captures significant factors of performance: work, span, and Memory Accesses.We show the model is better than PRAM by applying both to 4 shortest paths algorithms.Empirical performance is effectively predicted by our model in many circumstances.It is the first formalized asymptotic model helpful for algorithm design on many-cores.

  • a Memory Access model for highly threaded many core architectures
    International Conference on Parallel and Distributed Systems, 2012
    Co-Authors: Kunal Agrawal, Roger D Chamberlain
    Abstract:

    Many-core architectures are excellent in hiding Memory-Access latency by low-overhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more fine-grained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the Floyd-Warshall algorithm and Johnson's algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough Memory-Access latency and validates the intuition that the Floyd-Warshall algorithm performs better on these machines.

Thomas Moscibroda - One of the best experts on this subject based on the ideXlab platform.

  • stall time fair Memory Access scheduling for chip multiprocessors
    International Symposium on Microarchitecture, 2007
    Co-Authors: Onur Mutlu, Thomas Moscibroda
    Abstract:

    DRAM Memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing Memory Access scheduling techniques try to optimize the overall data throughput obtained from the DRAM and thus do not take into account inter-thread interference. Therefore, different threads running together on the same chip can ex- perience extremely different Memory system performance: one thread can experience a severe slowdown or starvation while another is un- fairly prioritized by the Memory scheduler. This paper proposes a new Memory Access scheduler, called the Stall-Time Fair Memory scheduler (STFM), that provides quality of service to different threads sharing the DRAM Memory system. The goal of the proposed scheduler is to "equalize" the DRAM-related slowdown experienced by each thread due to interference from other threads, without hurting overall system performance. As such, STFM takes into account inherent Memory characteristics of each thread and does not unfairly penalize threads that use the DRAM system without interfering with other threads. We show that STFM significantly reduces the unfairness in the DRAM system while also improving system throughput (i.e., weighted speedup of threads) on a wide variety of workloads and systems. For example, averaged over 32 different workloads running on an 8-core CMP, the ratio between the highest DRAM-related slowdown and the lowest DRAM-related slowdown reduces from 5.26X to 1.4X, while the average system throughput improves by 7.6%. We qualitatively and quantitatively compare STFM to one new and three previously- proposed Memory Access scheduling algorithms, including network fair queueing. Our results show that STFM provides the best fairness, system throughput, and scalability.

M Kandemir - One of the best experts on this subject based on the ideXlab platform.

  • addressing end to end Memory Access latency in noc based multicores
    International Symposium on Microarchitecture, 2012
    Co-Authors: Akbar Sharifi, Emre Kultursay, M Kandemir
    Abstract:

    To achieve high performance in emerging multicores, it is crucial to reduce the number of Memory Accesses that suffer from very high latencies. However, this should be done with care as improving latency of an Access can worsen the latency of another as a result of resource sharing. Therefore, the goal should be to balance latencies of Memory Accesses issued by an application in an execution phase, while ensuring a low average latency value. Targeting Network-on-Chip (NoC) based multicores, we propose two network prioritization schemes that can cooperatively improve performance by reducing end-to-end Memory Access latencies. Our first scheme prioritizes Memory response messages such that, in a given period of time, messages of an application that experience higher latencies than the average message latency for that application are expedited and a more uniform Memory latency pattern is achieved. Our second scheme prioritizes the request messages that are destined for idle Memory banks over others, with the goal of improving bank utilization and preventing long queues from being built in front of the Memory banks. These two network prioritization-based optimizations together lead to uniform Memory Access latencies with a low average value. Our experiments with a 4x8 mesh network-based multicore show that, when applied together, our schemes can achieve 15%, 10% and 13% performance improvement on Memory intensive, Memory non-intensive, and mixed multiprogrammed workloads, respectively.

Lizy K John - One of the best experts on this subject based on the ideXlab platform.

  • statistical pattern based modeling of gpu Memory Access streams
    Design Automation Conference, 2017
    Co-Authors: Reena Panda, Xinnian Zheng, Jiajun Wang, Andreas Gerstlauer, Lizy K John
    Abstract:

    Recent research studies have shown that modern GPU performance is often limited by the Memory system performance. Optimizing Memory hierarchy performance requires GPU designers to draw design insights based on the cache & Memory behavior of end-user applications. Unfortunately, it is often difficult to get Access to end-user workloads due to the confidential or proprietary nature of the software/data. Furthermore, the efficiency of early design space exploration of cache & Memory systems is often limited due to either the slow speed of detailed simulation techniques or limited scope of state-of-the-art cache analytical models. To enable efficient GPU Memory system exploration, we present a novel methodology and framework that statistically models the GPU Memory Access stream locality. The proposed G-MAP (GPU Memory Access Proxy) framework models the regularity in code-localized Memory Access patterns of GPGPU applications and the parallelism in GPU's execution model to create miniaturized Memory proxies. We evaluate G-MAP using 18 GPGPU benchmarks and show that G-MAP proxies can replicate cache/Memory performance of original applications with over 90% accuracy across over 5000 different L1/L2 cache, prefetcher and Memory configurations.