Thread Scheduling

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 5502 Experts worldwide ranked by ideXlab platform

Walter Binder - One of the best experts on this subject based on the ideXlab platform.

  • improving execution unit occupancy on smt based processors through hardware aware Thread Scheduling
    Future Generation Computer Systems, 2014
    Co-Authors: Achille Peternier, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso, Walter Binder
    Abstract:

    Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one Thread. We target the AMD Bulldozer and IBM POWER7?processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves Thread Scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver?is a user-space monitoring tool that automatically identifies FPU-intensive Threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver?to monitor applications and schedule their Threads, without any modification of the workload. We present WorkOver to improve Thread-Scheduling for better performance.We use performance counters to profile integer- and floating-point Threads.Threads are scheduled according to hardware execution unit availability.WorkOver optimizes unit occupancy on AMD Bulldozer and IBM P7 processors.We measured up to 20% speedup using Spec CPU and Scimark 2.0.

  • hardware aware Thread Scheduling the case of asymmetric multicore processors
    International Conference on Parallel and Distributed Systems, 2012
    Co-Authors: Achille Peternier, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso, Walter Binder
    Abstract:

    Modern processor architectures are increasingly complex and heterogeneous, often requiring solutions tailored to the specific characteristics of each processor model. In this paper we address this problem by targeting the AMD Bulldozer processor as case study for specific hardware-oriented performance optimizations. The Bulldozer architecture features an asymmetric simultaneous multiThreading implementation with shared floating point units (FPUs) and per-core arithmetic logic units (ALUs). Bulld Over, presented in this paper, improves Thread Scheduling by exploiting this hardware characteristic to increase performance of floating point-intensive workloads on Linux-based operating systems. Bulld Over is a user-space monitoring tool that automatically identifies FPU-intensive Threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 10% can be achieved by simply allowing Bulld Over to monitor applications, without any modification of the workload.

Achille Peternier - One of the best experts on this subject based on the ideXlab platform.

  • improving execution unit occupancy on smt based processors through hardware aware Thread Scheduling
    Future Generation Computer Systems, 2014
    Co-Authors: Achille Peternier, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso, Walter Binder
    Abstract:

    Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one Thread. We target the AMD Bulldozer and IBM POWER7?processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves Thread Scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver?is a user-space monitoring tool that automatically identifies FPU-intensive Threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver?to monitor applications and schedule their Threads, without any modification of the workload. We present WorkOver to improve Thread-Scheduling for better performance.We use performance counters to profile integer- and floating-point Threads.Threads are scheduled according to hardware execution unit availability.WorkOver optimizes unit occupancy on AMD Bulldozer and IBM P7 processors.We measured up to 20% speedup using Spec CPU and Scimark 2.0.

  • hardware aware Thread Scheduling the case of asymmetric multicore processors
    International Conference on Parallel and Distributed Systems, 2012
    Co-Authors: Achille Peternier, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso, Walter Binder
    Abstract:

    Modern processor architectures are increasingly complex and heterogeneous, often requiring solutions tailored to the specific characteristics of each processor model. In this paper we address this problem by targeting the AMD Bulldozer processor as case study for specific hardware-oriented performance optimizations. The Bulldozer architecture features an asymmetric simultaneous multiThreading implementation with shared floating point units (FPUs) and per-core arithmetic logic units (ALUs). Bulld Over, presented in this paper, improves Thread Scheduling by exploiting this hardware characteristic to increase performance of floating point-intensive workloads on Linux-based operating systems. Bulld Over is a user-space monitoring tool that automatically identifies FPU-intensive Threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 10% can be achieved by simply allowing Bulld Over to monitor applications, without any modification of the workload.

Alberto Scionti - One of the best experts on this subject based on the ideXlab platform.

  • A scalable Thread Scheduling co-processor based on data-flow principles
    Future Generation Computer Systems, 2015
    Co-Authors: Roberto Giorgi, Alberto Scionti
    Abstract:

    Large synchronization and communication overhead will become a major concern in future extreme-scale machines (e.g., HPC systems, supercomputers). These systems will push upwards performance limits by adopting chips equipped with one order of magnitude more cores than today. Alternative execution models can be explored in order to exploit the high parallelism offered by future massive many-core chips. This paper proposes the integration of standard cores with dedicated co-processing units that enable the system to support a fine-grain data-flow execution model developed within the TERAFLUX project. An instruction set architecture extension for supporting fine-grain Thread Scheduling and execution is proposed. This instruction set extension is supported by the co-processor that provides hardware units for accelerating Thread Scheduling and distribution among the available cores. Two fundamental aspects are at the base of the proposed system: the programmers can adopt their preferred programming model, and the compilation tools can produce a large set of Threads mainly communicating in a producer-consumer fashion, hence enabling data-flow execution. Experimental results demonstrate the feasibility of the proposed approach and its capability of scaling with the increasing number of cores. We present a data-flow based co-processor supporting the execution of fine-grain Threads.We propose a minimalistic core ISA extension for data-flow Threads.We propose a two-level hierarchical Scheduling co-processor that implements the ISA extension.We show the scalability of the proposed system through a set of experimental results.

  • simulating a multi core x8664 architecture with hardware isa extension supporting a data flow execution model
    International Conference on Artificial Intelligence, 2014
    Co-Authors: Antoni Portero, Alberto Scionti, Marcos Solinas, Andrea Mondelli, Paolo Faraboschi, Roberto Giorgi
    Abstract:

    The trend to develop increasingly more intelligent systems leads directly to a considerable demand for more and more computational power. Programming models that aid to exploit the application parallelism with current multi-core systems exist but with limitations. From this perspective, new execution models are arising to surpass limitations to scale up the number of processing elements, while dedicated hardware can help the Scheduling of the Threads in many-core systems. This paper depicts a data-flow based execution model that exposes to the multi-core x8664 architecture up to millions of fine-grain Threads. We propose to augment the existing architecture with a hardware Thread Scheduling unit. The functionality of this unit is exposed by means of four dedicated instructions. Results with a pure data-flow application (i.e., Recursive Fibonacci) show that the hardware Scheduling unit can load the computing cores (up to 32 in our tests) in a more efficient way than run-time managed Threads generated by programming models (e.g., OpenMP and Cilk). Further, our solution shows better scaling and smaller saturation when the number of workers increases.

Sandip Kundu - One of the best experts on this subject based on the ideXlab platform.

  • an opportunistic prediction based Thread Scheduling to maximize throughput watt in amps
    International Conference on Parallel Architectures and Compilation Techniques, 2013
    Co-Authors: Arunachalam Annamalai, Rance Rodrigues, Israel Koren, Sandip Kundu
    Abstract:

    The importance of dynamic Thread Scheduling is increasing with the emergence of Asymmetric Multicore Processors (AMPs). Since the computing needs of a Thread often vary during its execution, a fixed Thread-to-core assignment is sub-optimal. Reassigning Threads to cores (Thread swapping) when the Threads start a new phase with different computational needs, can significantly improve the energy efficiency of AMPs. Although identifying phase changes in the Threads is not difficult, determining the appropriate Thread-to-core assignment is a challenge. Furthermore, the problem of Thread reassignment is aggravated by the multiple power states that may be available in the cores. To this end, we propose a novel technique to dynamically assess the program phase needs and determine whether swapping Threads between core-types and/or changing the voltage/frequency levels (DVFS) of the cores will result in higher throughput/Watt. This is achieved by predicting the expected throughput/Watt of the current program phase at different voltage/frequency levels on all the available core-types in the AMP. We show that the benefits from Thread swapping and DVFS are orthogonal, demonstrating the potential of the proposed scheme to achieve significant benefits by seamlessly combining the two. We illustrate our approach using a dual-core High-Performance (HP)/Low-Power (LP) AMP with two power states and demonstrate significant throughput/Watt improvement over different baselines.

  • scalable Thread Scheduling in asymmetric multicores for power efficiency
    Symposium on Computer Architecture and High Performance Computing, 2012
    Co-Authors: Rance Rodrigues, Arunachalam Annamalai, Israel Koren, Sandip Kundu
    Abstract:

    The emergence of asymmetric multicore processors(AMPs) has elevated the problem of Thread Scheduling in such systems. The computing needs of a Thread often vary during its execution (phases) and hence, reassigning Threads to cores(Thread swapping) upon detection of such a change, can significantly improve the AMP's power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the Thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best Thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current Thread to core assignment and determine whether swapping the Threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed Thread Scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic Thread Scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed Thread Scheduling decisions in AMPs.

  • dynamic Thread Scheduling in asymmetric multicores to maximize performance per watt
    International Parallel and Distributed Processing Symposium, 2012
    Co-Authors: Arunachalam Annamalai, Rance Rodrigues, Israel Koren, Sandip Kundu
    Abstract:

    Recent trends in technology scaling have enabled the incorporation of multiple processor cores on a single die. Depending on the characteristics of the cores, the multicore may be either symmetric (SMP) or asymmetric (AMP). Several studies have shown that in general, for a given resource and power budget, AMPs are likely to outperform their SMP counterparts. However, due to the heterogeneity in AMPs, Scheduling Threads is always a challenge. To address the issue of Thread Scheduling in AMP, we propose a novel dynamic Thread Scheduling scheme that continuously monitors the current characteristics of the executing Threads and determines the best Thread to core assignment. The real-time monitoring is done using hardware performance counters that capture several micro architecture independent characteristics of the Threads in order to determine the Thread to core affinity. By controlling Thread Scheduling in hardware, the Operating System (OS) need not be aware of the underlying micro architecture, significantly simplifying the OS scheduler for an AMP architecture. The proposed scheme is compared against a simple Round Robin Scheduling and a recently published dynamic Thread Scheduling technique that allows swapping of Threads (between asymmetric cores) at coarse grain time intervals, once every context switch ( 20 ms for the Linux scheduler). The presented results indicate that our proposed scheme is able to achieve, on average, a performance/watt benefit of 10.5% over the previously published dynamic Scheduling scheme and about 12.9% over the Round Robin scheme.

Markovic Nikola - One of the best experts on this subject based on the ideXlab platform.

  • Thread Lock Section-Aware Scheduling on Asymmetric Single-ISA Multi-Core
    'Institute of Electrical and Electronics Engineers (IEEE)', 2016
    Co-Authors: Markovic Nikola, Nemirovsky Daniel, Unsal Osman, Valero Mateo, Cristal Adrian
    Abstract:

    As Thread level parallelism in applications has continued to expand, so has research in chip multi-core processors. As more and more applications become multi-Threaded we expect to find a growing number of Threads executing on a machine. As a consequence, the operating system will require increasingly larger amounts of CPU time to schedule these Threads efficiently. Instead of perpetuating the trend of performing more complex Thread Scheduling in the operating system, we propose a Scheduling mechanism that can be efficiently implemented in hardware as well. Our approach of identifying multi-Threaded application bottlenecks such as Thread synchronization sections complements the Fairness-aware Scheduler method. It achieves an average speed up of 11.5 percent (geometric mean) compared to the state-of-the-art Fairness-aware Scheduler.Peer Reviewe

  • Hardware Thread Scheduling algorithms for single-ISA asymmetric CMPs
    Universitat Politècnica de Catalunya, 2015
    Co-Authors: Markovic Nikola
    Abstract:

    Through the past several decades, based on the Moore's law, the semiconductor industry was doubling the number of transistors on the single chip roughly every eighteen months. For a long time this continuous increase in transistor budget drove the increase in performance as the processors continued to exploit the instruction level parallelism (ILP) of the sequential programs. This pattern hit the wall in the early years of the twentieth century when designing larger and more complex cores became difficult because of the power and complexity reasons. Computer architects responded by integrating many cores on the same die thereby creating Chip Multicore Processors (CMP). In the last decade, the computing technology experienced tremendous developments, Chip Multiprocessors (CMP) expanded from the symmetric and homogeneous to the asymmetric or heterogeneous Multiprocessors. Having cores of different types in a single processor enables optimizing performance, power and energy efficiency for a wider range of workloads. It enables chip designers to employ specialization (that is, we can use each type of core for the type of computation where it delivers the best performance/energy trade-off). The benefits of Asymmetric Chip Multiprocessors (ACMP) are intuitive as it is well known that different workloads have different resource requirements. The CMPs improve the performance of applications by exploiting the Thread Level Parallelism (TLP). Parallel applications relying on multiple Threads must be efficiently managed and dispatched for execution if the parallelism is to be properly exploited. Since more and more applications become multi-Threaded we expect to find a growing number of Threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these Threads efficiently. Thus, dynamic Thread Scheduling techniques are of paramount importance in ACMP designs since they can make or break performance benefits derived from the asymmetric hardware or parallel software. Several Thread Scheduling methods have been proposed and applied to ACMPs. In this thesis, we first study the state of the art Thread Scheduling techniques and identify the main reasons limiting the Thread level parallelism in an ACMP systems. We propose three novel approaches to schedule and manage Threads and exploit Thread level parallelism implemented in hardware, instead of perpetuating the trend of performing more complex Thread Scheduling in the operating system. Our first goal is to improve the performance of an ACMP systems by improving Thread Scheduling at the hardware level. We also show that the hardware Thread Scheduling reduces the energy consumption of an ACMP systems by allowing better utilization of the underlying hardware.A través de las últimas décadas, con base en la ley de Moore, la industria de semiconductores duplica el número de transistores en el chip alrededor de una vez cada dieciocho meses. Durante mucho tiempo, este aumento continuo en el número de transistores impulsó el aumento en el rendimiento de los procesadores solo explotando el paralelismo a nivel de instrucción (ILP) y el aumento de la frecuencia de los procesadores, permitiendo un aumento del rendimiento de los programas secuenciales. Este patrón llego a su limite en los primeros años del siglo XX, cuando el diseño de procesadores más grandes y complejos se convirtió en una tareá difícil debido a las debido al consumo requerido. La respuesta a este problema por parte de los arquitectos fue la integración de muchos núcleos en el mismo chip creando así chip multinúcleo Procesadores (CMP). En la última década, la tecnología de la computación experimentado enormes avances, sobre todo el en chip multiprocesadores (CMP) donde se ha pasado de diseños simetricos y homogeneous a sistemas asimétricos y heterogeneous. Tener núcleos de diferentes tipos en un solo procesador permite optimizar el rendimiento, la potencia y la eficiencia energética para una amplia gama de cargas de trabajo. Permite a los diseñadores de chips emplear especialización (es decir, podemos utilizar un tipo de núcleo diferente para distintos tipos de cálculo dependiendo del trade-off respecto del consumo y rendimiento). Los beneficios de la asimétrica chip multiprocesadores (ACMP) son intuitivos, ya que es bien sabido que diferentes cargas de trabajo tienen diferentes necesidades de recursos. Los CMP mejoran el rendimiento de las aplicaciones mediante la explotación del paralelismo a nivel de hilo (TLP). En las aplicaciones paralelas que dependen de múltiples hilos, estos deben ser manejados y enviados para su ejecución, y el paralelismo se debe explotar de manera eficiente. Cada día hay mas aplicaciones multi-hilo, por lo tanto encotraremos un numero mayor de hilos que se estaran ejecutando en la máquina. En consecuencia, el sistema operativo requerirá cantidades cada vez mayores de tiempo de CPU para organizar y ejecutar estos hilos de manera eficiente. Por lo tanto, las técnicas de optimizacion dinámica para la organizacion de la ejecucion de hilos son de suma importancia en los diseños ACMP ya que pueden incrementar o dsiminuir el rendimiento del hardware asimétrico o del software paralelo. Se han propuesto y aplicado a ACMPs varios métodos de organizar y ejecutar los hilos. En esta tesis, primero estudiamos el estado del arte en las técnicas para la gestionar la ejecucion de los hilos y hemos identificado las principales razones que limitan el paralelismo en sistemas ACMP. Proponemos tres nuevos enfoques para programar y gestionar los hilos y explotar el paralelismo a nivel de hardware, en lugar de perpetuar la tendencia actual de dejar esta gestion cada vez maas compleja al sistema operativo. Nuestro primer objetivo es mejorar el rendimiento de un sistema ACMP mediante la mejora en la gestion de los hilos a nivel de hardware. También mostramos que la gestion del los hilos a nivel de hardware reduce el consumo de energía de un sistemas de ACMP al permitir una mejor utilización del hardware subyacente

  • Hardware Thread Scheduling algorithms for single-ISA asymmetric CMPs
    Universitat Politècnica de Catalunya, 2015
    Co-Authors: Markovic Nikola
    Abstract:

    Through the past several decades, based on the Moore's law, the semiconductor industry was doubling the number of transistors on the single chip roughly every eighteen months. For a long time this continuous increase in transistor budget drove the increase in performance as the processors continued to exploit the instruction level parallelism (ILP) of the sequential programs. This pattern hit the wall in the early years of the twentieth century when designing larger and more complex cores became difficult because of the power and complexity reasons. Computer architects responded by integrating many cores on the same die thereby creating Chip Multicore Processors (CMP). In the last decade, the computing technology experienced tremendous developments, Chip Multiprocessors (CMP) expanded from the symmetric and homogeneous to the asymmetric or heterogeneous Multiprocessors. Having cores of different types in a single processor enables optimizing performance, power and energy efficiency for a wider range of workloads. It enables chip designers to employ specialization (that is, we can use each type of core for the type of computation where it delivers the best performance/energy trade-off). The benefits of Asymmetric Chip Multiprocessors (ACMP) are intuitive as it is well known that different workloads have different resource requirements. The CMPs improve the performance of applications by exploiting the Thread Level Parallelism (TLP). Parallel applications relying on multiple Threads must be efficiently managed and dispatched for execution if the parallelism is to be properly exploited. Since more and more applications become multi-Threaded we expect to find a growing number of Threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these Threads efficiently. Thus, dynamic Thread Scheduling techniques are of paramount importance in ACMP designs since they can make or break performance benefits derived from the asymmetric hardware or parallel software. Several Thread Scheduling methods have been proposed and applied to ACMPs. In this thesis, we first study the state of the art Thread Scheduling techniques and identify the main reasons limiting the Thread level parallelism in an ACMP systems. We propose three novel approaches to schedule and manage Threads and exploit Thread level parallelism implemented in hardware, instead of perpetuating the trend of performing more complex Thread Scheduling in the operating system. Our first goal is to improve the performance of an ACMP systems by improving Thread Scheduling at the hardware level. We also show that the hardware Thread Scheduling reduces the energy consumption of an ACMP systems by allowing better utilization of the underlying hardware.A través de las últimas décadas, con base en la ley de Moore, la industria de semiconductores duplica el número de transistores en el chip alrededor de una vez cada dieciocho meses. Durante mucho tiempo, este aumento continuo en el número de transistores impulsó el aumento en el rendimiento de los procesadores solo explotando el paralelismo a nivel de instrucción (ILP) y el aumento de la frecuencia de los procesadores, permitiendo un aumento del rendimiento de los programas secuenciales. Este patrón llego a su limite en los primeros años del siglo XX, cuando el diseño de procesadores más grandes y complejos se convirtió en una tareá difícil debido a las debido al consumo requerido. La respuesta a este problema por parte de los arquitectos fue la integración de muchos núcleos en el mismo chip creando así chip multinúcleo Procesadores (CMP). En la última década, la tecnología de la computación experimentado enormes avances, sobre todo el en chip multiprocesadores (CMP) donde se ha pasado de diseños simetricos y homogeneous a sistemas asimétricos y heterogeneous. Tener núcleos de diferentes tipos en un solo procesador permite optimizar el rendimiento, la potencia y la eficiencia energética para una amplia gama de cargas de trabajo. Permite a los diseñadores de chips emplear especialización (es decir, podemos utilizar un tipo de núcleo diferente para distintos tipos de cálculo dependiendo del trade-off respecto del consumo y rendimiento). Los beneficios de la asimétrica chip multiprocesadores (ACMP) son intuitivos, ya que es bien sabido que diferentes cargas de trabajo tienen diferentes necesidades de recursos. Los CMP mejoran el rendimiento de las aplicaciones mediante la explotación del paralelismo a nivel de hilo (TLP). En las aplicaciones paralelas que dependen de múltiples hilos, estos deben ser manejados y enviados para su ejecución, y el paralelismo se debe explotar de manera eficiente. Cada día hay mas aplicaciones multi-hilo, por lo tanto encotraremos un numero mayor de hilos que se estaran ejecutando en la máquina. En consecuencia, el sistema operativo requerirá cantidades cada vez mayores de tiempo de CPU para organizar y ejecutar estos hilos de manera eficiente. Por lo tanto, las técnicas de optimizacion dinámica para la organizacion de la ejecucion de hilos son de suma importancia en los diseños ACMP ya que pueden incrementar o dsiminuir el rendimiento del hardware asimétrico o del software paralelo. Se han propuesto y aplicado a ACMPs varios métodos de organizar y ejecutar los hilos. En esta tesis, primero estudiamos el estado del arte en las técnicas para la gestionar la ejecucion de los hilos y hemos identificado las principales razones que limitan el paralelismo en sistemas ACMP. Proponemos tres nuevos enfoques para programar y gestionar los hilos y explotar el paralelismo a nivel de hardware, en lugar de perpetuar la tendencia actual de dejar esta gestion cada vez maas compleja al sistema operativo. Nuestro primer objetivo es mejorar el rendimiento de un sistema ACMP mediante la mejora en la gestion de los hilos a nivel de hardware. También mostramos que la gestion del los hilos a nivel de hardware reduce el consumo de energía de un sistemas de ACMP al permitir una mejor utilización del hardware subyacente.Postprint (published version

  • Hardware Scheduling algorithms for asymmetric single-ISA CMPs
    Barcelona Supercomputing Center, 2015
    Co-Authors: Markovic Nikola, Nemirovsky Daniel, Unsal, Osman Sabri, Valero Cortés Mateo, Cristal Kestelman Adrián
    Abstract:

    As Thread level parallelism in applications has continued to expand, so has research in chip multi-core processors. Since more and more applications become multi-Threaded we expect to find a growing number of Threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these Threads efficiently. Instead of perpetuating the trend of performing more complex Thread Scheduling in the operating system, we propose a two lightweight hardware Thread Scheduling mechanisms. First is a Hardware Round-Robin Scheduling (HRRS) policy which is influenced by Fairness Scheduling techniques thereby reducing Thread serialization and improving parallel Thread performance. Second is a Thread Lock Section-aware Scheduling (TLSS) policy which extends HRRS policy. TLSS policy is influenced by the Fairness-aware Scheduling and bottleneck identification techniques. It complements the HRRS scheduler by identifying multiThreaded application bottlenecks such as Thread synchronization sections. We show that HRRS outperforms Fairness scheduler by 17 percent while TLSS outperforms HRRS by 11 percent on an ACMP consisted of one large (out-of-order) core and three small (in-order) cores