The Experts below are selected from a list of 10383 Experts worldwide ranked by ideXlab platform
Santosh G Abraham - One of the best experts on this subject based on the ideXlab platform.
-
modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors
Hawaii International Conference on System Sciences, 1995Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >
-
impact of Load Imbalance on the design of software barriers
International Conference on Parallel Processing, 1995Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.
-
ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers
1995Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.
-
HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors
Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, 1Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >
A E Eichenberger - One of the best experts on this subject based on the ideXlab platform.
-
modeling Load Imbalance and fuzzy barriers for scalable shared memory multiprocessors
Hawaii International Conference on System Sciences, 1995Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >
-
impact of Load Imbalance on the design of software barriers
International Conference on Parallel Processing, 1995Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.
-
ICPP (2) - Impact of Load Imbalance on the Design of Software Barriers
1995Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processors reach the synchronization point simultaneously. When relaxing this assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the Load Imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a range of degrees. We also investigate a dynamic placement barrier where slow processors migrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degree and using dynamic placement, software barriers that are scalable to large numbers of processors can be constructed. We demonstrate the applicability of our results by performing measurements on a small SOR relaxation program running on a 56-processor KSR1.
-
HICSS (1) - Modeling Load Imbalance and fuzzy barriers for scalable shared-memory multiprocessors
Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences, 1Co-Authors: A E Eichenberger, Santosh G AbrahamAbstract:Proposes an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic Load Imbalance introduced by network contention and by a random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a 64-processor KSR (Kendall Square Research) system which has random first-level caches confirms the general nature of the analytic results. >
Laxmi N. Bhuyan - One of the best experts on this subject based on the ideXlab platform.
-
Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. BhuyanAbstract:Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.
-
IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. BhuyanAbstract:Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.
Farzad Khorasani - One of the best experts on this subject based on the ideXlab platform.
-
Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. BhuyanAbstract:Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.
-
IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. BhuyanAbstract:Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.
Bryan Rowe - One of the best experts on this subject based on the ideXlab platform.
-
Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. BhuyanAbstract:Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.
-
IPDPS - Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016Co-Authors: Farzad Khorasani, Bryan Rowe, Rajiv Gupta, Laxmi N. BhuyanAbstract:Nested patterns are one of the most frequently occurring algorithmic themes in GPU applications where coarse-grained tasks are constituted from a number of fine-grained ones. However, efficient execution of irregular nested patterns, with coarse-grained tasks that substantially vary in size, has remained an open problem for the GPU's SIMT architecture. Existing methods rely on static task decomposition where one or a fixed number of threads inside the SIMD grouping (warp) carry out the fine-grained tasks. These approaches fail to provide portable performance across diversity of irregular inputs. Moreover, due to intra-warp Load Imbalance, they incur warp underutilization. In this paper, we introduce a novel software technique called Collaborative Task Engagement (CTE) that, unlike previous methods, achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance. CTE assigns a group of coarse-grained tasks to the warp and allows threads inside the warp carry out the expanded list of fine-grained tasks collaboratively. In multiple rounds, all the warp threads perform mapping portion of fine-grained tasks and participate in a reduction phase with appropriate lanes to reduce calculated values. This scheme avoids over-subscription or under-subscription of threads while preserving the benefits of parallel reduction. We prepared a CUDA C++ device-side template library for developers to easily express nested patterns in GPU kernels using our technique. Our experiments show that CTE delivers up to 37% warp execution efficiency improvement and gives up to 1.51x speedup over sub-warp decomposition with the best sub-warp width.