Data Locality

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

Michelle Mills Strout - One of the best experts on this subject based on the ideXlab platform.

  • matrox modular approach for improving Data Locality in hierarchical mat rix app rox imation
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020
    Co-Authors: Bangtian Liu, Michelle Mills Strout, Kazem Cheshmi, Saeed Soori, Maryam Mehri Dehnavi
    Abstract:

    Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve Locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. MatRox is a framework that uses novel structure analysis strategies with code specialization and a storage format to improve Locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98X, 1.60X, and 5.98X faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64X improvement with MatRox over five changes to accuracy using GOFMM.

  • using the loop chain abstraction to schedule across loops in existing code
    International Journal of High Performance Computing and Networking, 2019
    Co-Authors: Ian J Bertolacci, Michelle Mills Strout, Jordan Riley, Stephen M Guzik, Eddie C Davis, Catherine Olschanowsky
    Abstract:

    Exposing opportunities for parallelisation while explicitly managing Data Locality is the primary challenge to porting and optimising computational science simulation codes to improve performance. OpenMP provides mechanisms for expressing parallelism, but it remains the programmer's responsibility to group computations to improve Data Locality. The loop chain abstraction, where a summary of Data access patterns is included as pragmas associated with parallel loops, provides compilers with sufficient information to automate the parallelism versus Data Locality trade-off. We present the syntax and semantics of loop chain pragmas for indicating information about loops belonging to the loop chain and specification of a high-level schedule for the loop chain. We show example usage of the pragmas, detail attempts to automate the transformation of a legacy scientific code written with specific language constraints to loop chain codes, describe the compiler implementation for loop chain pragmas, and exhibit performance results for a computational fluid dynamics benchmark.

  • Generalizing Run-Time Tiling with the Loop Chain Abstraction
    Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014
    Co-Authors: Michelle Mills Strout, C.D. Krieger, Jeyakumar Ramanujam, G. T. Bercea, Catherine Olschanowsky, Carlo Bertolli, Fabio Luporini, P.h.j. Kelly
    Abstract:

    Many scientific applications are organized in a Data parallel way: as sequences of parallel and/or reduction loops. This exposes parallelism well, but does not convert Data reuse between loops into Data Locality. This paper focuses on this issue in parallel loops whose loop-to-loop dependence structure is Data-dependent due to indirect references such as A[B[i]]. Such references are a common occurrence in sparse matrix computations, molecular dynamics simulations, and unstructured-mesh computational fluid dynamics (CFD). Previously, sparse tiling approaches were developed for individual benchmarks to group iterations across such loops to improve Data Locality. These approaches were shown to benefit applications such as moldyn, Gauss-Seidel, and the sparse matrix powers kernel, however the run-time routines for performing sparse tiling were hand coded per application. In this paper, we present a generalized full sparse tiling algorithm that uses the newly developed loop chain abstraction as input, improves inter-loop Data Locality, and creates a task graph to expose shared-memory parallelism at runtime. We evaluate the overhead and performance impact of the generalized full sparse tiling algorithm on two codes: a sparse Jacobi iterative solver and the Airfoil CFD benchmark.

  • A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers
    SC '14: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, 2014
    Co-Authors: Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, Jeffrey Hittinger
    Abstract:

    Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of cells or faces in a structured grid. In the Chombo framework, the box sizes are typically 163 or 323, but larger box sizes such as 1283 would result in less surface area and therefore less storage, copying, and/or ghost cells communication overhead. Unfortunately, current on node parallelization schemes perform poorly for these larger box sizes. In this paper, we investigate 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 1283 boxes result in close to ideal parallel scaling and come close to matching the performance of 163 boxes on three different multicore systems for a benchmark that is a proxy for program idioms found in Computational Fluid Dynamic (CFD) codes.

Maryam Mehri Dehnavi - One of the best experts on this subject based on the ideXlab platform.

  • matrox modular approach for improving Data Locality in hierarchical mat rix app rox imation
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020
    Co-Authors: Bangtian Liu, Michelle Mills Strout, Kazem Cheshmi, Saeed Soori, Maryam Mehri Dehnavi
    Abstract:

    Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve Locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. MatRox is a framework that uses novel structure analysis strategies with code specialization and a storage format to improve Locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98X, 1.60X, and 5.98X faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64X improvement with MatRox over five changes to accuracy using GOFMM.

Bangtian Liu - One of the best experts on this subject based on the ideXlab platform.

  • matrox modular approach for improving Data Locality in hierarchical mat rix app rox imation
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020
    Co-Authors: Bangtian Liu, Michelle Mills Strout, Kazem Cheshmi, Saeed Soori, Maryam Mehri Dehnavi
    Abstract:

    Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve Locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. MatRox is a framework that uses novel structure analysis strategies with code specialization and a storage format to improve Locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98X, 1.60X, and 5.98X faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64X improvement with MatRox over five changes to accuracy using GOFMM.

Catherine Olschanowsky - One of the best experts on this subject based on the ideXlab platform.

  • using the loop chain abstraction to schedule across loops in existing code
    International Journal of High Performance Computing and Networking, 2019
    Co-Authors: Ian J Bertolacci, Michelle Mills Strout, Jordan Riley, Stephen M Guzik, Eddie C Davis, Catherine Olschanowsky
    Abstract:

    Exposing opportunities for parallelisation while explicitly managing Data Locality is the primary challenge to porting and optimising computational science simulation codes to improve performance. OpenMP provides mechanisms for expressing parallelism, but it remains the programmer's responsibility to group computations to improve Data Locality. The loop chain abstraction, where a summary of Data access patterns is included as pragmas associated with parallel loops, provides compilers with sufficient information to automate the parallelism versus Data Locality trade-off. We present the syntax and semantics of loop chain pragmas for indicating information about loops belonging to the loop chain and specification of a high-level schedule for the loop chain. We show example usage of the pragmas, detail attempts to automate the transformation of a legacy scientific code written with specific language constraints to loop chain codes, describe the compiler implementation for loop chain pragmas, and exhibit performance results for a computational fluid dynamics benchmark.

  • Generalizing Run-Time Tiling with the Loop Chain Abstraction
    Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014
    Co-Authors: Michelle Mills Strout, C.D. Krieger, Jeyakumar Ramanujam, G. T. Bercea, Catherine Olschanowsky, Carlo Bertolli, Fabio Luporini, P.h.j. Kelly
    Abstract:

    Many scientific applications are organized in a Data parallel way: as sequences of parallel and/or reduction loops. This exposes parallelism well, but does not convert Data reuse between loops into Data Locality. This paper focuses on this issue in parallel loops whose loop-to-loop dependence structure is Data-dependent due to indirect references such as A[B[i]]. Such references are a common occurrence in sparse matrix computations, molecular dynamics simulations, and unstructured-mesh computational fluid dynamics (CFD). Previously, sparse tiling approaches were developed for individual benchmarks to group iterations across such loops to improve Data Locality. These approaches were shown to benefit applications such as moldyn, Gauss-Seidel, and the sparse matrix powers kernel, however the run-time routines for performing sparse tiling were hand coded per application. In this paper, we present a generalized full sparse tiling algorithm that uses the newly developed loop chain abstraction as input, improves inter-loop Data Locality, and creates a task graph to expose shared-memory parallelism at runtime. We evaluate the overhead and performance impact of the generalized full sparse tiling algorithm on two codes: a sparse Jacobi iterative solver and the Airfoil CFD benchmark.

  • A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers
    SC '14: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, 2014
    Co-Authors: Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, Jeffrey Hittinger
    Abstract:

    Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of cells or faces in a structured grid. In the Chombo framework, the box sizes are typically 163 or 323, but larger box sizes such as 1283 would result in less surface area and therefore less storage, copying, and/or ghost cells communication overhead. Unfortunately, current on node parallelization schemes perform poorly for these larger box sizes. In this paper, we investigate 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 1283 boxes result in close to ideal parallel scaling and come close to matching the performance of 163 boxes on three different multicore systems for a benchmark that is a proxy for program idioms found in Computational Fluid Dynamic (CFD) codes.

Umar Ibrahim - One of the best experts on this subject based on the ideXlab platform.

  • Efficient concurrent search trees using portable fine-grained Locality
    'Institute of Electrical and Electronics Engineers (IEEE)', 2019
    Co-Authors: Ha, Hoai Phuong, Anshus Otto, Umar Ibrahim
    Abstract:

    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksConcurrent search trees are crucial Data abstractions widely used in many important systems such as Databases, file systems and Data storage. Like other fundamental abstractions for energy-efficient computing, concurrent search trees should support both high concurrency and fine-grained Data Locality in a platform-independent manner. However, existing portable fine-grained Locality-aware search trees such as ones based on the van Emde Boas layout (vEB-based trees) poorly support concurrent update operations while existing highly-concurrent search trees such as non-blocking search trees do not consider fine-grained Data Locality. In this paper, we first present a novel methodology to achieve both portable fine-grained Data Locality and high concurrency for search trees. Based on the methodology, we devise a novel Locality-aware concurrent search tree called GreenBST. To the best of our knowledge, GreenBST is the first practical search tree that achieves both portable fine-grained Data Locality and high concurrency. We analyze and compare GreenBST energy efficiency (in operations/Joule) and performance (in operations/second) with seven prominent concurrent search trees on a high performance computing (HPC) platform (Intel Xeon), an embedded platform (ARM), and an accelerator platform (Intel Xeon Phi) using parallel micro- benchmarks (Synchrobench). Our experimental results show that GreenBST achieves the best energy efficiency and performance on all the different platforms. GreenBST achieves up to 50 percent more energy efficiency and 60 percent higher throughput than the best competitor in the parallel benchmarks. These results confirm the viability of our new methodology to achieve both portable fine-grained Data Locality and high concurrency for search trees

  • Efficient concurrent search trees using portable fine-grained Locality
    IEEE, 2019
    Co-Authors: Ha, Hoai Phuong, Anshus Otto, Umar Ibrahim
    Abstract:

    Concurrent search trees are crucial Data abstractions widely used in many important systems such as Databases, file systems and Data storage. Like other fundamental abstractions for energy-efficient computing, concurrent search trees should support both high concurrency and fine-grained Data Locality in a platform-independent manner. However, existing portable fine-grained Locality-aware search trees such as ones based on the van Emde Boas layout (vEB-based trees) poorly support concurrent update operations while existing highly-concurrent search trees such as non-blocking search trees do not consider fine-grained Data Locality. In this paper, we first present a novel methodology to achieve both portable fine-grained Data Locality and high concurrency for search trees. Based on the methodology, we devise a novel Locality-aware concurrent search tree called GreenBST. To the best of our knowledge, GreenBST is the first practical search tree that achieves both portable fine-grained Data Locality and high concurrency. We analyze and compare GreenBST energy efficiency (in operations/Joule) and performance (in operations/second) with seven prominent concurrent search trees on a high performance computing (HPC) platform (Intel Xeon), an embedded platform (ARM), and an accelerator platform (Intel Xeon Phi) using parallel micro- benchmarks (Synchrobench). Our experimental results show that GreenBST achieves the best energy efficiency and performance on all the different platforms. GreenBST achieves up to 50 percent more energy efficiency and 60 percent higher throughput than the best competitor in the parallel benchmarks. These results confirm the viability of our new methodology to achieve both portable fine-grained Data Locality and high concurrency for search trees