Load Imbalance

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 10383 Experts worldwide ranked by ideXlab platform

Santosh G Abraham - One of the best experts on this subject based on the ideXlab platform.

  • modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors
    Hawaii International Conference on System Sciences, 1995
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

  • impact of Load Imbalance on the design of software barriers
    International Conference on Parallel Processing, 1995
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

  • ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers
    1995
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

  • HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors
    Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, 1
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

A E Eichenberger - One of the best experts on this subject based on the ideXlab platform.

  • modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors
    Hawaii International Conference on System Sciences, 1995
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

  • impact of Load Imbalance on the design of software barriers
    International Conference on Parallel Processing, 1995
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

  • ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers
    1995
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

  • HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors
    Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, 1
    Co-Authors: A E Eichenberger, Santosh G Abraham
    Abstract:

    Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

Laxmi N. Bhuyan - One of the best experts on this subject based on the ideXlab platform.

  • Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan
    Abstract:

    Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

  • IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan
    Abstract:

    Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

Farzad Khorasani - One of the best experts on this subject based on the ideXlab platform.

  • Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan
    Abstract:

    Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

  • IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan
    Abstract:

    Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

Bryan Rowe - One of the best experts on this subject based on the ideXlab platform.

  • Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan
    Abstract:

    Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

  • IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
    2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
    Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan
    Abstract:

    Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.