Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font


<<   CONTENTS   >>

Thread-Level Parallelism: Multithreading

Instruction-level parallelism (ILP) can be effectively realised by means of pipelining, superpipelining, and then more aggressively by the use of superscalar architecture employing more hardware resources within the processor. To further improve the throughput, designers have created even more complex mechanisms, such as the execution of some instructions in an order different from the way they occur in the instruction stream (out-of-order execution). All these architectures, however, extract ILP by means of a single stream of executing instructions, i.e. a single thread of execution. These processors are often found to have suffered from many different types of dependencies among machine instructions that often limit the amount of parallelism thus obtained. The dependences may be resource dependencies, control dependencies caused mostly by the presence of conditional branch instructions, or true data dependencies (RAW). Here, we are not considering WAR and WAW, because they can be efficiently handled using some form of register renaming.

Out of many possible alternatives, one efficient way to minimise the burden of dependencies and thereby yield greater ILP without much increase in circuit complexity or power consumption is called multithreading. In essence, the instruction stream is divided into several smaller streams of execution, known as threads, such that the threads can be executed in parallel. With appropriate hardware support within the processor, it is possible to combine instructions from multiple independent threads of execution. Such hardware that supports in realising multithreading would provide the processor with a pool of instructions, in various stages of execution, which have a relatively lesser total number of data dependencies amongst them, since the threads are mostly independent of one another. Moreover, as critical control dependencies are also being separated into a number of threads, less aggressive branch prediction (a method to resolve control dependencies) is really needed.

For example, consider a processor with instruction pipeline consisting of eight stages and with targeted superscalar performance of four instructions completed in every clock cycle. Now, assume that these instructions come homfour independent threads of execution. Then, on average, the number of instructions within the processor at any point of time from one thread would be (4 x 8) +• 4 = 8.

One distinct advantage of such hardware-supported multithreading is that the pipeline stalls can be here effectively utilised. If one thread, say, for access to main memory,while running into a pipeline comes to a stall, then another thread that is ready can make use of the corresponding processor clock cycles, which would otherwise be simply wasted. Thus, a hardware-supported multithreading approach eventually emerges as an important latency-hiding technique.

To provide required support for multithreading, the processor must offer a separate program counter for each thread of execution to be executed concurrently. In fact, here the processor designs essentially differ in the amount and the type of additional hardware used to support concurrent thread execution. In general, instruction fetching here takes place in terms of a thread. The processor treats each thread separately and may use a number of techniques for optimizing single-thread execution including branch prediction, out-of-order execution, register renaming, and superscalar approaches. In fact, the thread- level parallelism (TLP), when appropriately combined with ILP is expected to yield an even more improved performance.

The processor design must also include the mechanisms to switch between threads, mostly either on the occurrence of a pipeline stall or while scheduling in a round robin manner. Similar to the operating system switching between running processes, in this case, the hardware context of a thread within the processor must be preserved. This basically includes the full set of registers (programmable registers and those used in register renaming), program counter, stack pointer, relevant memory map information, interrupt control bits, protection bits, and others. For N-way multithreading support, the processor must store at a time the thread contexts of N executing threads. When the processor switches, say, from thread X to thread У, the control logic must ensure that the execution of subsequent instruction(s) occurs with reference to the context of thread Y. It is to be noted that thread contexts need not be always saved and later restored. As long as the processor preserves multiple thread contexts within itself, all that is required is that the processor be able to switch between thread contexts from one cycle to the next.

The enormous varieties of multithreading designs have been, however, realised in both commercial systems as well as experimental systems. Depending mainly on the specific strategy that is adopted for switching between threads, hardware support for multithreading may be classified into the following four principal approaches:

  • Coarse-grained multithreading: This is also sometimes referred to as block multithreading. Here, the instructions of a thread are executed successively until any event that may cause delay occurs; one such event is a cache miss. As a result, this event then induces a switch to another thread. This approach is, however, found effective mostly on an in-order issue processor.
  • Fine-grained multithreading: This is also known as interleaved multithreading. Here, the processor deals with two or more thread contexts at a time, switching from one thread to another at each clock cycle. If a thread is blocked at any time for some reason, say, due to data dependencies or memory latencies, that thread is skipped and set aside, and a thread that is ready is then executed at once.
  • Simultaneous hardware multithreading (SHMT): Here the instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor. This, however, combines the capability of superscalar to issue higher instructions with the use of multiple thread contexts.
  • Chip multiprocessing: Here,a number of processors called cores are replicated into a single piece of CPU chip (a single piece of silicon, called a die). Typically, each processor core here consists of all of the components that an independent processor does have, and as such, each processor core is able to independently handle separate threads. The advantage of this design approach is that the available logic area and the on-hand transistor count are effectively used here without entering too far into an ever-increasing complexity in pipeline design. This design approach is historically referred to as multicore design, which has been described in detail in the next section.

It is important to note that in the case of the first two approaches mentioned above,i.e.in coarse-grained multithreading and fine-grained multithreading, instructions from different threads are not executed simultaneously. Instead, each processor is able to quickly switch from one thread to another, using a different set of registers and other context information. As a result, this somehow provides better utilisation of processor resources and helps to avoid significant penalty due to cache misses and other latency aspects. The third one, the SHMT approach, however, carries out true simultaneous execution of instructions from different threads, using available replicated resources needed for execution. Chip multiprocessing, however, also provides simultaneous execution of instructions from different threads.

In this connection, the following observations are of immense interest. Figures 8.24 to 8.28 illustrate some of the possible alternatives in pipeline architecture in different types of processors that involve multithreading and contrast these approaches with the corresponding ones that do not use multithreading. In all the figures, each horizontal row represents the potential issue slot or slots for a single execution cycle; that is the width of each row corresponds to the maximum number of instructions that can be issued in a single clock cycle. (Issue slots are defined as the positions from which instructions can be issued in a given clock cycle. Instruction issue is the process of initiating instructions for execution in the processor's functional units. But this occurs only when an instruction moves from the decoding stage of the pipeline to the first execution stage of the pipeline.) The vertical dimension, however, represents the time sequence of clock cycles. An empty (shaded) slot represents an unused execution slot, and an N indicates a NOP in one pipeline.

Scalar Processor

Figure 8.24 shows different approaches with a scalar (single-issue) processor:

  • • Single-threaded scalar: This is the simple pipeline found in traditional CISC and RISC machines as depicted in Figure 8.24a with a single thread of execution, i.e. with no multithreading.
  • • Coarse-grained multithreading: In this case, a single thread is executed until a latency event occurs that would stop the operation of the executing thread and the pipeline, at which time the processor switches to another thread that is ready for execution. This is shown in Figure 8.24b.
  • • Fine-grained multithreading: This is possibly the easiest approach to implement multithreading. Here, the processor switches from one thread to another thread at each clock cycle, thereby attempting to keep the pipeline stages fully occupied or close to fully occupied. The processor hardware here is adequately equipped to bring about such switching from one thread context to another between cycles. This is illustrated in Figure 8.24c.

FIGURE 8.24

Executing multiple threads in scalar pipelined processor, (a) Single-threaded scalar, (b) coarse-grained multithreading scalar, and (c) fine-grained multithreading scalar.

It is interesting to observe a difference between two different forms of multithreading as depicted in Figure 8.24b and 8.24c. Figure 8.24b shows a situation in which the time to perform a thread switch is one cycle (shown as shaded portion), whereas Figure 8.24c indicates that thread switching occurs at zero cycle. This is due to the fact that in fine-grained multithreading, it is assumed that there are no control or data dependencies between threads, which not only simplifies the pipeline design but at the same time also allows a thread switch to occur without any delay. On the other hand, coarse-grained multithreading, depending on the specific design and implementation, may require a clock cycle to perform a thread switch as shown in Figure 8.24b. This happens, if a fetched instruction triggers the thread switch and must be then thrown out of the pipeline.

While fine-grained multithreading appears to offer better processor utilisation than its counterpart course-grained multithreading, it does so at the time of single-thread execution. In case of multiple-thread execution, there may be a situation of contention of cache resources amongst the threads being executed which may eventually culminate into an unpleasant cache miss for a given thread(s).

For parallel execution, more options and opportunities can be obtained if the processor is able to issue multiple instructions per cycle. A number of principal alternative approaches in processor design are illustrated in the following section. In each case, it is presumed that the processor hardware is capable of issuing four instructions per clock cycle. However, in all these cases, only instructions from a single thread are issued in a single cycle. The popular alternatives are as follows.

Superscalar Processor

• Traditional superscalar: This approach is the primary form of superscalar processor with a single thread of execution that we have already discussed in previous sections, i.e. without multithreading. This is depicted in Figure 8.25a. Here, it is presumed that the single-threaded superscalar processor hardware is capable of issuing four instructions per clock cycle (i.e. each row contains four columns).

FIGURE 8.25

Approaches to executing multiple threads in superscalar pipelined processors, (a) Single-threaded superscalar,

(b) coarse-grained multithreading superscalar, and (c) fine-grained multithreading superscalar.

However, in all these cases, only instructions from a single thread are issued in a single cycle. This architecture was probably the most powerful approach in providing parallelism within the processor in the relatively recent past. It is to be noted that during some cycles, not all of the available issue slots are occupied (e.g., first row of Figure 8.25a). In fact, during these cycles, the number of instructions issued is less than the maximum number for some reasons. This situation is sometimes referred to as horizontal loss. In some other instruction cycles, no issue slots are ever used (third and fifth row of Figure 8.25a). These are the cycles when no instruction at all can be issued; this is often referred to as vertical loss.

  • • Coarse-grained multithreaded superscalar: In this approach, during each cycle, as many instructions as possible are issued from only one thread, and coarse-grained multithreading (as already discussed) is used. This is shown in Figure 8.25b.
  • • Fine-grained multithreaded superscalar: Here too, during each cycle, as many instructions as possible are issued from a single thread. In this approach, potential delays due to thread switches are eliminated as already mentioned. However, the number of instructions to be issued in any given cycle is again limited by dependencies that exist within any given thread. This is depicted in Figure 8.25c.

VLIW Processor

  • • Traditional VLIW: A VLIW architecture, namely, Intel Itanium 64-bit processors (IA-64), already described in detail in Section 8.10, places multiple instructions in a single word. Typically, a VLIW is realised essentially by the compiler, which places operations that may be executed in parallel in the same word. In a simple VLIW processor, as shown in Figure 8.26a, if it is not possible to completely fill the word with instructions to be issued in parallel, then NOP is used in the remaining space of the VLW word.
  • • Coarse-grained VLIW: This approach should necessarily provide efficiencies similar to those provided by coarse-grained multithreading on a superscalar architecture. Figure 8.26b illustrates a representative pattern of coarse-grained multithreading operation on this type of processor.

FIGURE 8.26

Schematic diagram of executing multiple threads in a VLIW processor, (a) Traditional VLIW, (b) coarse-grained

multithreading VLIW, and (c) fine-grained multithreading VLIW.

Fine-grained VLIW: This approach should necessarily provide efficiencies similar to those provided by fine-grained multithreading on a superscalar architecture. Figure 8.26c depicts a representative example of fine-grained multithreading operation on this type of processor.

The most recent trend towards processor design by way of effective utilisation of chip area as well as efficient use of enormous number of transistors (gates) available within the chip after balancing all the critical design and operational factors, is to use two notable approaches, namely, SHMT and chip multiprocessors (also known as multicore processors). Both these approaches, however, enable the parallel, simultaneous execution of multiple threads. The details of these two approaches are discussed in detail in the following sections.

Simultaneous Hardware Multithreaded Processor (SHMT)

Figure 8.27 exhibits an SFIMT processor capable of issuing eight instructions at a time (i.e.eight columns in a row). If one thread has a high degree of ILP, it may be able to consume all of the horizontal slots on some cycles. However, on other cycles, instructions from two or more threads may be issued. If adequate threads are active, then it would usually be possible to issue the maximum number of instructions in any cycle and thereby yield a high level of processor efficiency.

Chip Multiprocessors (Multicore Processors)

Figure 8.28 illustrates a CPU chip consisting of four core processors-, each of these here is assumed to have a two-issue superscalar processor for the sake of convenience (two columns for each of these four cores). Each processor is assigned a thread, from which it can issue up to two instructions per cycle. However, each core processor could have been alternatively built with the SHMT approach also in place of a superscalar design. But determining which design strategy is to be exploited in the design of the processor core at the time of CPU fabrication is entirely a critical design trade-off after considering many other related important issues apart from the primary objectives that the

FIGURE 8.27

Simultaneous hardware multithreading (SHMT).

FIGURE 8.28

Chip multiprocessor (multicore processor).

processor would intend to meet. Various types of multicore processors have been discussed in detail in Section 8.12.

From the above discussions, it can be summarily concluded that the multithreading approach when exploited in any form of processor design would always provide better processor efficiency when compared to the corresponding design approach with no multithreading. It is found that a chip multiprocessor is always superior to any form of superscalar processor with the same instruction issue capability, because the horizontal losses will be greater for the superscalar processor. In addition, it is also observed by comparing Figures 8.27 and 8.28 that a chip multiprocessor with the same instruction issue capability as an SHMT processor cannot achieve the same degree of ILP (as there exist empty slots shown shaded in Figure 8.28). This is due to the fact that the chip multiprocessor is not able to hide latencies by issuing instructions from other threads. That is why the multithreading design approach within each of the processor cores of a chip multiprocessor (multicore processor) nowadays is ultimately a strategic design consideration. Some of the contemporary processors found in use implement this design approach.

 
<<   CONTENTS   >>

Related topics