The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform
Erven Rohou - One of the best experts on this subject based on the ideXlab platform.
-
towards automatic binary runtime loop de Parallelization using on stack replacement
Information Processing Letters, 2019Co-Authors: Marwa Yusuf, Ahmed Elmahdy, Erven RohouAbstract:Abstract Runtime compilation has opportunities to parallelize code which are generally not available using static Parallelization approaches. However, the parallelized code can possibly slowdown the performance due to unforeseen parallel overheads such as synchronization and speculation support pertaining to the chosen Parallelization strategy and the underlying parallel platform. Moreover, with the wide usage of heterogeneous architectures, such choice options become more pronounced. In this paper, we consider an adaptive form of the Parallelization operation, for the first time. We propose a method for performing on-stack de-Parallelization for a parallelized binary loop at runtime, thereby allowing for rapid loop replacement with a more optimized one. For this paper, we consider a loop Parallelization strategy and propose a corresponding de-Parallelization method. The method relies on stopping the execution at safe points, gathering threads' states, producing a corresponding serial code, and continuing execution serially. The decision to de-parallelize or not is taken based on the anticipated speedup. To assess the extent of our approach, we have conducted an initial study on a small set of programs with various Parallelization overheads. Results show up to 4× performance improvement for a synchronization intense program on a 4-core Intel processor.
-
runtime speculative on stack Parallelization of for loops in binary programs
IEEE Letters of the Computer Society, 2018Co-Authors: Marwa Yusuf, Ahmed Elmahdy, Erven RohouAbstract:Nowadays almost every device has parallel architecture, hence Parallelization is almost always desirable. However, parallelizing legacy running programs is very challenging. That is due to the fact that usually source code is not available, and runtime Parallelization, without program restarting is challenging. Also, detecting parallelizable code is difficult, due to possible dependencies and different execution paths that are undecidable statically. Therefore, speculation is a typical approach whereby wrongly parallelized code is detected and rolled back at runtime. This paper considers utilizing processes to implement speculative Parallelization using on-stack replacement, allowing for generally simple and portable design where forking a new process enters the speculative state, and killing a faulty process simply performs the roll back operation. While the cost of such operations are high, the approach is promising for cases where the parallel section is long and dependency issues are rare. Also, our proposed system performs speculative Parallelization on binary code at runtime, without the need for source code, restarting the program, or special hardware support. Initial experiments show about 2× to 3× speedup for speculative execution over serial one, when three fourth of loop iterations are parallelizable. Also, maximum measured speculation overhead over pure parallel execution is 5.8 percent.
Marwa Yusuf - One of the best experts on this subject based on the ideXlab platform.
-
towards automatic binary runtime loop de Parallelization using on stack replacement
Information Processing Letters, 2019Co-Authors: Marwa Yusuf, Ahmed Elmahdy, Erven RohouAbstract:Abstract Runtime compilation has opportunities to parallelize code which are generally not available using static Parallelization approaches. However, the parallelized code can possibly slowdown the performance due to unforeseen parallel overheads such as synchronization and speculation support pertaining to the chosen Parallelization strategy and the underlying parallel platform. Moreover, with the wide usage of heterogeneous architectures, such choice options become more pronounced. In this paper, we consider an adaptive form of the Parallelization operation, for the first time. We propose a method for performing on-stack de-Parallelization for a parallelized binary loop at runtime, thereby allowing for rapid loop replacement with a more optimized one. For this paper, we consider a loop Parallelization strategy and propose a corresponding de-Parallelization method. The method relies on stopping the execution at safe points, gathering threads' states, producing a corresponding serial code, and continuing execution serially. The decision to de-parallelize or not is taken based on the anticipated speedup. To assess the extent of our approach, we have conducted an initial study on a small set of programs with various Parallelization overheads. Results show up to 4× performance improvement for a synchronization intense program on a 4-core Intel processor.
-
runtime speculative on stack Parallelization of for loops in binary programs
IEEE Letters of the Computer Society, 2018Co-Authors: Marwa Yusuf, Ahmed Elmahdy, Erven RohouAbstract:Nowadays almost every device has parallel architecture, hence Parallelization is almost always desirable. However, parallelizing legacy running programs is very challenging. That is due to the fact that usually source code is not available, and runtime Parallelization, without program restarting is challenging. Also, detecting parallelizable code is difficult, due to possible dependencies and different execution paths that are undecidable statically. Therefore, speculation is a typical approach whereby wrongly parallelized code is detected and rolled back at runtime. This paper considers utilizing processes to implement speculative Parallelization using on-stack replacement, allowing for generally simple and portable design where forking a new process enters the speculative state, and killing a faulty process simply performs the roll back operation. While the cost of such operations are high, the approach is promising for cases where the parallel section is long and dependency issues are rare. Also, our proposed system performs speculative Parallelization on binary code at runtime, without the need for source code, restarting the program, or special hardware support. Initial experiments show about 2× to 3× speedup for speculative execution over serial one, when three fourth of loop iterations are parallelizable. Also, maximum measured speculation overhead over pure parallel execution is 5.8 percent.
Scott Rixner - One of the best experts on this subject based on the ideXlab platform.
-
predictive Parallelization taming tail latencies in web search
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2014Co-Authors: Myeongjae Jeon, Saehoon Kim, Seungwon Hwang, Sameh Elnikety, Alan L Cox, Scott RixnerAbstract:Web search engines are optimized to reduce the high-percentile response time to consistently provide fast responses to almost all user queries. This is a challenging task because the query workload exhibits large variability, consisting of many short-running queries and a few long-running queries that significantly impact the high-percentile response time. With modern multicore servers, parallelizing the processing of an individual query is a promising solution to reduce query execution time, but it gives limited benefits compared to sequential execution since most queries see little or no speedup when parallelized. The root of this problem is that short-running queries, which dominate the workload, do not benefit from Parallelization. They incur a large Parallelization overhead, taking scarce resources from long-running queries. On the other hand, Parallelization substantially reduces the execution time of long-running queries with low overhead and high Parallelization efficiency. Motivated by these observations, we propose a predictive Parallelization framework with two parts: (1) predicting long-running queries, and (2) selectively parallelizing them. For the first part, prediction should be accurate and efficient. For accuracy, we study a comprehensive feature set covering both term features (reflecting dynamic pruning efficiency) and query features (reflecting query complexity). For efficiency, to keep overhead low, we avoid expensive features that have excessive requirements such as large memory footprints. For the second part, we use the predicted query execution time to parallelize long-running queries and process short-running queries sequentially. We implement and evaluate the predictive Parallelization framework in Microsoft Bing search. Our measurements show that under moderate to heavy load, the predictive strategy reduces the 99th-percentile response time by 50% (from 200 ms to 100 ms) compared with prior approaches that parallelize all queries.
-
adaptive parallelism for web search
European Conference on Computer Systems, 2013Co-Authors: Myeongjae Jeon, Sameh Elnikety, Alan L Cox, Scott RixnerAbstract:A web search query made to Microsoft Bing is currently parallelized by distributing the query processing across many servers. Within each of these servers, the query is, however, processed sequentially. Although each server may be processing multiple queries concurrently, with modern multicore servers, parallelizing the processing of an individual query within the server may nonetheless improve the user's experience by reducing the response time. In this paper, we describe the issues that make the Parallelization of an individual query within a server challenging, and we present a Parallelization approach that effectively addresses these challenges. Since each server may be processing multiple queries concurrently, we also present a adaptive resource management algorithm that chooses the degree of parallelism at run-time for each query, taking into account system load and Parallelization efficiency. As a result, the servers now execute queries with a high degree of parallelism at low loads, gracefully reduce the degree of parallelism with increased load, and choose sequential execution under high load. We have implemented our Parallelization approach and adaptive resource management algorithm in Bing servers and evaluated them experimentally with production workloads. The experimental results show that the mean and 95th-percentile response times for queries are reduced by more than 50% under light or moderate load. Moreover, under high load where Parallelization adversely degrades the system performance, the response times are kept the same as when queries are executed sequentially. In all cases, we observe no degradation in the relevance of the search results.
Weng Cho Chew - One of the best experts on this subject based on the ideXlab platform.
-
analysis and performance of a distributed memory multilevel fast multipole algorithm
IEEE Transactions on Antennas and Propagation, 2005Co-Authors: S Velamparambil, Weng Cho ChewAbstract:In this paper, we analyze the communication pattern and study the scalability of a distributed memory implementation of the multilevel fast multipole algorithm (MLFMA) called ScaleME. ScaleME uses the message passing interface (MPI) for communication between processors. The Parallelization of MLFMA uses a novel a hybrid scheme for distributing the workload across the processors. We study the communication and computational behavior and demonstrate the effectiveness of the Parallelization scheme using realistic problems.
David I August - One of the best experts on this subject based on the ideXlab platform.
-
commutative set a language extension for implicit parallel programming
Programming Language Design and Implementation, 2011Co-Authors: Prakash Prabhu, Soumyadeep Ghosh, Yun Zhang, Nick P Johnson, David I AugustAbstract:Sequential programming models express a total program order, of which a partial order must be respected. This inhibits parallelizing tools from extracting scalable performance. Programmer written semantic commutativity assertions provide a natural way of relaxing this partial order, thereby exposing parallelism implicitly in a program. Existing implicit parallel programming models based on semantic commutativity either require additional programming extensions, or have limited expressiveness. This paper presents a generalized semantic commutativity based programming extension, called Commutative Set (COMMSET), and associated compiler technology that enables multiple forms of parallelism. COMMSET expressions are syntactically succinct and enable the programmer to specify commutativity relations between groups of arbitrary structured code blocks. Using only this construct, serializing constraints that inhibit Parallelization can be relaxed, independent of any particular Parallelization strategy or concurrency control mechanism. COMMSET enables well performing Parallelizations in cases where they were inapplicable or non-performing before. By extending eight sequential programs with only 8 annotations per program on average, COMMSET and the associated compiler technology produced a geomean speedup of 5.7x on eight cores compared to 1.5x for the best non-COMMSET Parallelization.
-
decoupled software pipelining creates Parallelization opportunities
Symposium on Code Generation and Optimization, 2010Co-Authors: Jialu Huang, Arun Raman, Yun Zhang, Thomas B Jablin, Tzuhan Hung, David I AugustAbstract:Decoupled Software Pipelining (DSWP) is one approach to automatically extract threads from loops. It partitions loops into long-running threads that communicate in a pipelined manner via inter-core queues. This work recognizes that DSWP can also be an enabling transformation for other loop Parallelization techniques. This use of DSWP, called DSWP+, splits a loop into new loops with dependence patterns amenable to Parallelization using techniques that were originally either inapplicable or poorly-performing. By parallelizing each stage of the DSWP+ pipeline using (potentially) different techniques, not only is the benefit of DSWP increased, but the applicability and performance of other Parallelization techniques are enhanced. This paper evaluates DSWP+ as an enabling framework for other transformations by applying it in conjunction with DOALL, LOCALWRITE, and SpecDOALL to individual stages of the pipeline. This paper demonstrates significant performance gains on a commodity 8-core multicore machine running a variety of codes transformed with DSWP+.
-
parallel stage decoupled software pipelining
Symposium on Code Generation and Optimization, 2008Co-Authors: Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J Bridges, David I AugustAbstract:In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that compilers automatically extract thread-level parallelism from single-threaded applications. DOALL is a popular automatic technique for loop-level Parallelization employed successfully in the domains of scientific and numeric computing. While DOALL generally scales well with the number of iterations of the loop, its applicability is limited by the presence of loop-carried dependences. A Parallelization technique with greater applicability is decoupled software pipelining (DSWP), which parallelizes loops even in the presence of loop-carried dependences. However, the scalability of DSWP is limited by the size of the loop body and the number of recurrences it contains, which are usually smaller than the loop iteration count. This work proposes a novel non-speculative compiler Parallelization technique called parallel-stage decoupled software pipelining (PS-DSWP). The goal of PS-DSWP is to combine the applicability of DSWP with the scalability of DOALL Parallelization. A key insight of PS-DSWP is that, after isolating the recurrences in their own stages in DSWP, portions of the loop suitable for DOALL Parallelization may be exposed. PS-DSWP extends DSWP to benefit from these opportunities, utilizing multiple threads to execute the same stage of a DSWPed loop in parallel. This paper describes the PS-DSWP transformation in detail and discusses its implementation in a research compiler. PS-DSWP produces an average speedup of 114% (up to a maximum of 155%) with 6 threads on loops from a set of 5 applications. Our experiments also demonstrate that PS-DSWP achieves better scalability with the number of threads than DSWP.