Load Imbalance

The Experts below are selected from a list of 10383 Experts worldwide ranked by ideXlab platform

Santosh G Abraham - One of the best experts on this subject based on the ideXlab platform.

modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors

Hawaii International Conference on System Sciences, 1995

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

15 days free trial to Access Article
impact of Load Imbalance on the design of software barriers

International Conference on Parallel Processing, 1995

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

15 days free trial to Access Article
ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers

1995

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

15 days free trial to Access Article
HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors

Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, 1

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

15 days free trial to Access Article

A E Eichenberger - One of the best experts on this subject based on the ideXlab platform.

modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors

Hawaii International Conference on System Sciences, 1995

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

15 days free trial to Access Article
impact of Load Imbalance on the design of software barriers

International Conference on Parallel Processing, 1995

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

15 days free trial to Access Article
ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers

1995

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.

15 days free trial to Access Article
HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors

Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, 1

Co-Authors: A E Eichenberger, Santosh G Abraham

Abstract:

Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >

15 days free trial to Access Article

Laxmi N. Bhuyan - One of the best experts on this subject based on the ideXlab platform.

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan

Abstract:

Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

15 days free trial to Access Article
IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan

Abstract:

Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

15 days free trial to Access Article

Farzad Khorasani - One of the best experts on this subject based on the ideXlab platform.

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan

Abstract:

Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

15 days free trial to Access Article
IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan

Abstract:

Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

15 days free trial to Access Article

Bryan Rowe - One of the best experts on this subject based on the ideXlab platform.

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan

Abstract:

Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

15 days free trial to Access Article
IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. Bhuyan

Abstract:

Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Santosh G Abraham - One of the best experts on this subject based on the ideXlab platform.

modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors

impact of Load Imbalance on the design of software barriers

ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers

HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors

A E Eichenberger - One of the best experts on this subject based on the ideXlab platform.

modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors

impact of Load Imbalance on the design of software barriers

ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers

HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors

Laxmi N. Bhuyan - One of the best experts on this subject based on the ideXlab platform.

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Farzad Khorasani - One of the best experts on this subject based on the ideXlab platform.

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Bryan Rowe - One of the best experts on this subject based on the ideXlab platform.

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Load Imbalance

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Santosh G Abraham - One of the best experts on this subject based on the ideXlab platform.

A E Eichenberger - One of the best experts on this subject based on the ideXlab platform.

Laxmi N. Bhuyan - One of the best experts on this subject based on the ideXlab platform.

Farzad Khorasani - One of the best experts on this subject based on the ideXlab platform.

Bryan Rowe - One of the best experts on this subject based on the ideXlab platform.

Related terms