Home Computer Science
Thread-Level Parallelism: Multithreading
Table of Contents:
Instruction-level parallelism (ILP) can be effectively realised by means of pipelining, superpipelining, and then more aggressively by the use of superscalar architecture employing more hardware resources within the processor. To further improve the throughput, designers have created even more complex mechanisms, such as the execution of some instructions in an order different from the way they occur in the instruction stream (out-of-order execution). All these architectures, however, extract ILP by means of a single stream of executing instructions, i.e. a single thread of execution. These processors are often found to have suffered from many different types of dependencies among machine instructions that often limit the amount of parallelism thus obtained. The dependences may be resource dependencies, control dependencies caused mostly by the presence of conditional branch instructions, or true data dependencies (RAW). Here, we are not considering WAR and WAW, because they can be efficiently handled using some form of register renaming.
Out of many possible alternatives, one efficient way to minimise the burden of dependencies and thereby yield greater ILP without much increase in circuit complexity or power consumption is called multithreading. In essence, the instruction stream is divided into several smaller streams of execution, known as threads, such that the threads can be executed in parallel. With appropriate hardware support within the processor, it is possible to combine instructions from multiple independent threads of execution. Such hardware that supports in realising multithreading would provide the processor with a pool of instructions, in various stages of execution, which have a relatively lesser total number of data dependencies amongst them, since the threads are mostly independent of one another. Moreover, as critical control dependencies are also being separated into a number of threads, less aggressive branch prediction (a method to resolve control dependencies) is really needed.
For example, consider a processor with instruction pipeline consisting of eight stages and with targeted superscalar performance of four instructions completed in every clock cycle. Now, assume that these instructions come homfour independent threads of execution. Then, on average, the number of instructions within the processor at any point of time from one thread would be (4 x 8) +• 4 = 8.
One distinct advantage of such hardware-supported multithreading is that the pipeline stalls can be here effectively utilised. If one thread, say, for access to main memory,while running into a pipeline comes to a stall, then another thread that is ready can make use of the corresponding processor clock cycles, which would otherwise be simply wasted. Thus, a hardware-supported multithreading approach eventually emerges as an important latency-hiding technique.
To provide required support for multithreading, the processor must offer a separate program counter for each thread of execution to be executed concurrently. In fact, here the processor designs essentially differ in the amount and the type of additional hardware used to support concurrent thread execution. In general, instruction fetching here takes place in terms of a thread. The processor treats each thread separately and may use a number of techniques for optimizing single-thread execution including branch prediction, out-of-order execution, register renaming, and superscalar approaches. In fact, the thread- level parallelism (TLP), when appropriately combined with ILP is expected to yield an even more improved performance.
The processor design must also include the mechanisms to switch between threads, mostly either on the occurrence of a pipeline stall or while scheduling in a round robin manner. Similar to the operating system switching between running processes, in this case, the hardware context of a thread within the processor must be preserved. This basically includes the full set of registers (programmable registers and those used in register renaming), program counter, stack pointer, relevant memory map information, interrupt control bits, protection bits, and others. For N-way multithreading support, the processor must store at a time the thread contexts of N executing threads. When the processor switches, say, from thread X to thread У, the control logic must ensure that the execution of subsequent instruction(s) occurs with reference to the context of thread Y. It is to be noted that thread contexts need not be always saved and later restored. As long as the processor preserves multiple thread contexts within itself, all that is required is that the processor be able to switch between thread contexts from one cycle to the next.
The enormous varieties of multithreading designs have been, however, realised in both commercial systems as well as experimental systems. Depending mainly on the specific strategy that is adopted for switching between threads, hardware support for multithreading may be classified into the following four principal approaches:
It is important to note that in the case of the first two approaches mentioned above,i.e.in coarse-grained multithreading and fine-grained multithreading, instructions from different threads are not executed simultaneously. Instead, each processor is able to quickly switch from one thread to another, using a different set of registers and other context information. As a result, this somehow provides better utilisation of processor resources and helps to avoid significant penalty due to cache misses and other latency aspects. The third one, the SHMT approach, however, carries out true simultaneous execution of instructions from different threads, using available replicated resources needed for execution. Chip multiprocessing, however, also provides simultaneous execution of instructions from different threads.
In this connection, the following observations are of immense interest. Figures 8.24 to 8.28 illustrate some of the possible alternatives in pipeline architecture in different types of processors that involve multithreading and contrast these approaches with the corresponding ones that do not use multithreading. In all the figures, each horizontal row represents the potential issue slot or slots for a single execution cycle; that is the width of each row corresponds to the maximum number of instructions that can be issued in a single clock cycle. (Issue slots are defined as the positions from which instructions can be issued in a given clock cycle. Instruction issue is the process of initiating instructions for execution in the processor's functional units. But this occurs only when an instruction moves from the decoding stage of the pipeline to the first execution stage of the pipeline.) The vertical dimension, however, represents the time sequence of clock cycles. An empty (shaded) slot represents an unused execution slot, and an N indicates a NOP in one pipeline.
Figure 8.24 shows different approaches with a scalar (single-issue) processor:
Executing multiple threads in scalar pipelined processor, (a) Single-threaded scalar, (b) coarse-grained multithreading scalar, and (c) fine-grained multithreading scalar.
It is interesting to observe a difference between two different forms of multithreading as depicted in Figure 8.24b and 8.24c. Figure 8.24b shows a situation in which the time to perform a thread switch is one cycle (shown as shaded portion), whereas Figure 8.24c indicates that thread switching occurs at zero cycle. This is due to the fact that in fine-grained multithreading, it is assumed that there are no control or data dependencies between threads, which not only simplifies the pipeline design but at the same time also allows a thread switch to occur without any delay. On the other hand, coarse-grained multithreading, depending on the specific design and implementation, may require a clock cycle to perform a thread switch as shown in Figure 8.24b. This happens, if a fetched instruction triggers the thread switch and must be then thrown out of the pipeline.
While fine-grained multithreading appears to offer better processor utilisation than its counterpart course-grained multithreading, it does so at the time of single-thread execution. In case of multiple-thread execution, there may be a situation of contention of cache resources amongst the threads being executed which may eventually culminate into an unpleasant cache miss for a given thread(s).
For parallel execution, more options and opportunities can be obtained if the processor is able to issue multiple instructions per cycle. A number of principal alternative approaches in processor design are illustrated in the following section. In each case, it is presumed that the processor hardware is capable of issuing four instructions per clock cycle. However, in all these cases, only instructions from a single thread are issued in a single cycle. The popular alternatives are as follows.
• Traditional superscalar: This approach is the primary form of superscalar processor with a single thread of execution that we have already discussed in previous sections, i.e. without multithreading. This is depicted in Figure 8.25a. Here, it is presumed that the single-threaded superscalar processor hardware is capable of issuing four instructions per clock cycle (i.e. each row contains four columns).
Approaches to executing multiple threads in superscalar pipelined processors, (a) Single-threaded superscalar,
(b) coarse-grained multithreading superscalar, and (c) fine-grained multithreading superscalar.
However, in all these cases, only instructions from a single thread are issued in a single cycle. This architecture was probably the most powerful approach in providing parallelism within the processor in the relatively recent past. It is to be noted that during some cycles, not all of the available issue slots are occupied (e.g., first row of Figure 8.25a). In fact, during these cycles, the number of instructions issued is less than the maximum number for some reasons. This situation is sometimes referred to as horizontal loss. In some other instruction cycles, no issue slots are ever used (third and fifth row of Figure 8.25a). These are the cycles when no instruction at all can be issued; this is often referred to as vertical loss.
Schematic diagram of executing multiple threads in a VLIW processor, (a) Traditional VLIW, (b) coarse-grained
multithreading VLIW, and (c) fine-grained multithreading VLIW.
• Fine-grained VLIW: This approach should necessarily provide efficiencies similar to those provided by fine-grained multithreading on a superscalar architecture. Figure 8.26c depicts a representative example of fine-grained multithreading operation on this type of processor.
The most recent trend towards processor design by way of effective utilisation of chip area as well as efficient use of enormous number of transistors (gates) available within the chip after balancing all the critical design and operational factors, is to use two notable approaches, namely, SHMT and chip multiprocessors (also known as multicore processors). Both these approaches, however, enable the parallel, simultaneous execution of multiple threads. The details of these two approaches are discussed in detail in the following sections.
Simultaneous Hardware Multithreaded Processor (SHMT)
Figure 8.27 exhibits an SFIMT processor capable of issuing eight instructions at a time (i.e.eight columns in a row). If one thread has a high degree of ILP, it may be able to consume all of the horizontal slots on some cycles. However, on other cycles, instructions from two or more threads may be issued. If adequate threads are active, then it would usually be possible to issue the maximum number of instructions in any cycle and thereby yield a high level of processor efficiency.
Chip Multiprocessors (Multicore Processors)
Figure 8.28 illustrates a CPU chip consisting of four core processors-, each of these here is assumed to have a two-issue superscalar processor for the sake of convenience (two columns for each of these four cores). Each processor is assigned a thread, from which it can issue up to two instructions per cycle. However, each core processor could have been alternatively built with the SHMT approach also in place of a superscalar design. But determining which design strategy is to be exploited in the design of the processor core at the time of CPU fabrication is entirely a critical design trade-off after considering many other related important issues apart from the primary objectives that the
Simultaneous hardware multithreading (SHMT).
Chip multiprocessor (multicore processor).
processor would intend to meet. Various types of multicore processors have been discussed in detail in Section 8.12.
From the above discussions, it can be summarily concluded that the multithreading approach when exploited in any form of processor design would always provide better processor efficiency when compared to the corresponding design approach with no multithreading. It is found that a chip multiprocessor is always superior to any form of superscalar processor with the same instruction issue capability, because the horizontal losses will be greater for the superscalar processor. In addition, it is also observed by comparing Figures 8.27 and 8.28 that a chip multiprocessor with the same instruction issue capability as an SHMT processor cannot achieve the same degree of ILP (as there exist empty slots shown shaded in Figure 8.28). This is due to the fact that the chip multiprocessor is not able to hide latencies by issuing instructions from other threads. That is why the multithreading design approach within each of the processor cores of a chip multiprocessor (multicore processor) nowadays is ultimately a strategic design consideration. Some of the contemporary processors found in use implement this design approach.