Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font


<<   CONTENTS   >>

Multipipeline Scheduling

Instruction-issue policy schedules the instruction to be fetched and subsequently completed in order to maximise the utilisation of various pipeline elements, thereby largely enhancing the performance of a superscalar processor. Although these policies always attempt to preserve the order of the instructions in the user's program as far as possible, but sometimes they may need to alter the orderings (fetching ordering, execution ordering, etc.) with respect to the ordering found in a strict sequential execution to keep most of the pipeline functional elements busy. In doing so, they also ensure that the result is correct and at the same time nullifies the bad effect of various dependencies and conflicts (as already discussed) that may also arise (Sima, D.).

When instructions are issued as per the program order, it is called an in-order issue; otherwise it is called an out-of-order issue. Likewise, if the instructions are completed as per the program order, it is called an in-order completion; otherwise, an out-of-order completion results. An in-order issue is comparatively easier to implement but may not result in optimal performance. Moreover, an in-order issue may sometimes end in an in-order or out-of-order completion. An out-of-order issue, however, usually results in an out-of-order completion. In general, instruction-issue policies can be summarily grouped into the following categories:

i. In-order issue with in-order completion

ii. In-order issue with out-of-order completion

iii. Out-of-order issue with out-of-order completion.

In-order issue with in-order completion is, however, the simplest one to implement, but it is seldom used as it is difficult to maintain both the order of issue and order of completion even in a conventional scalar processor. In-order issue with out-of-order completion policy requires more complex instruction-issue logic to implement out-of-order completion, but it has the advantage that out-of-order completion allows any number of instructions in the execution stages at any point of time up to of course the maximum degree of machine parallelism. This approach is found in use in both scalar and superscalar processors. In-order issue of instructions, as usual, causes the execution to stall when there is a resource conflict or a data dependency or a procedural dependency which is particularly more difficult to deal with, especially at the time of interrupt servicing and exception handling. Out-of-order issue with out-of-order completion has several distinct advantages. But it also suffers from the same constraints already described including output dependence and antidependence which are mainly due to storage conflicts. To maximise the use of registers, register renaming (duplication of resources) is used here, and more functional units are also added. Although this policy gives the processor more freedom to exploit parallelism, thereby offering enhanced performance, still it is very expensive to realise optimal scheduling. Intel Pentium Pro, and also the other upward versions from Intel, however, implemented this technique in their architecture.

Control parallelism, as obtained from pipelining or the use of multiple functional units, is once again limited by the pipeline length and by the extent of multiplicity of functional units. However, both pipelining and functional parallelism are handled by the hardware automatically, requiring no special software action to activate them.

Superscalar designs exploit spatial parallelism by way of duplicating hardware resources and hence are better served by CMOS technology (providing storage space) requiring more transistors for its circuit design, obviously making a compromise with the clock cycle rates (speed parameter). An ideal superscalar processor must have simple data-dependence checking, a small lookahead window, and an optimizing compiler with the provision to implement register renaming techniques, along with scoreboarding mechanism to extract maximum ILP using the available hardwired parallel pipelines.

A brief description of this topic with each of these three cases ((i), (ii), and (iii)) has been provided separately with appropriate figures in the website: http://routledge. com/9780367255732.

Superscalar Performance

With an ordinary scalar pipeline machine having к stages, the minimum time required to execute N independent instructions is (assuming pipeline cycle time for the completion of each stage is one clock cycle)

With a superscalar machine of m issues having the same к stages in the pipeline (assuming base pipeline cycle time for the completion of each stage is one clock cycle), the minimum time required to execute the same N instructions is

The second term corresponds to the time required to execute the remaining N — m instructions through m pipelines at the rate of m instructions per cycle. Thus, the ideal speed-up gained by a superscalar machine over the base machine is

It is obvious that the speed-up S (m, 1) —> m as N —» °°.

Superscalar Processors: Key Factors

Based on the discussions presented, it can be inferred that the superscalar operation requires some specific hardware support along with a few predefined policies and mechanisms to exploit ILP hidden in the program it executes. Some of the key factors that must be taken into consideration while implementing this approach are as follows:

Appropriate mechanisms for initiating, or issuing, multiple instructions in parallel.

  • • Instruction fetching strategies that simultaneously fetch multiple instructions require the presence of specific pipeline fetch and decode stages. These stages must be available to implement the already chalked-out strategies.
  • • Suitable befitting techniques are needed to implement branch prediction logic (icontrol dependencies) to reduce the negative impact of branch instructions on pipeline efficiency.
  • • Appropriate logic for determining true data dependencies involving register values among the operands of the active executing instructions must be implemented to avoid conflicting use of registers.
  • • Availability of sufficient hardware resources to carry out parallel execution of multiple instructions thus fetched, including different multiple pipelined functional units, and befitting memory hierarchies capable of simultaneously servicing concurrent multiple memory references.
  • • Appropriate ordering of the instructions of the programfsubmitted for execution) in an order other than that specified by the program (being executed),without damaging the semantics of the executing program, for the sake of smooth execution of the instructions to improve CPU's performance.
  • • Suitable mechanisms to keep track of process states and appropriate measures to keep the process states in correct order.

Implementation: Superscalar Processors

Although the superscalar architecture implementation was originally tied-up to and fully exploited in various ways befitting RISC philosophy, but its principles were later used in CISC machines also. The first commercially available superscalar processor introduced sometime in 1989 was Intel 80960 C A (RISC) operate at 25 MHz clock rate, and later came its enhanced version 80960 MM. But none of these was actually a full-fledged stand-alone processor, and they were used as an integrated part of a master equipment, mostly intended for real-time embedded system control and multiprocessor applications. Intel Pentium OverDrive (P24T) processor was probably the first one in the CISC line of processors with its in-built two-issue superscalar technology providing dual-integer-processing units using branch prediction mechanism for improved performance. From then, the superscalar technology was continuously nurtured by Intel, and at last, sometime in 1995, Intel launched Pentium Pro (also named as P6 processor) with a full-fledged true superscalar design. After that, each of the subsequent versions of Pentium till the arrival of the current one, the Pentium 4, has gone through constant enhancement and more and more fine-tuning to further improve the underlying superscalar design. Motorola implemented this superscalar architecture in the design of its top-of-the-line CISC processor, namely 68040, and thereafter in the latest member of its 68000 CISC family, the 68060 processor, with clock rates ranging from 50 to 75 MHz. The Motorola RISC 88110 was an early superscalar processor designed with 1.3 million transistors in a 299-pin package and driven by a 50-MHz clock. It actually used a combination of three chip set: a CPU chip (88100) and two cache chips (88200) in a single-chip implementation. The MIPS R10000 introduced in 1996 uses a 64-bit MIPS IV architecture, fully compatible with its older 32-bit versions R2000/3000 series, and is one of the most powerful single-chip superscalar microprocessors. This design uses around 6.7/6.8 million of transistors operating at a clock frequency of 200 MHz with the issue of four instructions per clock cycle using out-of-order issue and out-of-order completion policy that ultimately offers a net throughput of 800 million instructions per second. PA-RISC 8500 introduced by Hewlett Packard (HP) is one of the most powerful processors and is comparatively superior to most of the contemporary superscalar processors released by different vendors. All the useful and prominent features of a superscalar processor have been realised by a 0.25 pm fabrication process using a total transistor count of over 120 million that leads to an ultimate increase in the system clock frequency and created enough room for on-chip caches. The beauty of this architecture is that it is achieved in a single and simple hardware structure. The UltraSPARC is a superscalar processor (which is also superpipelined and is discussed in the next section) enriched with many salient features. The UltraSPARC III has been fabricated with 0.18 pm technology having a clock speed in the range of 750-900 MHz and is targeted to attain around 1.5 GHz in the forthcoming days. This chip is the most suitable for use in multiprocessor configuration, with a provision for attaching hundreds of such processors. IBM RISC System/6000 is a RISC-like superscalar system launched in 1990 built on the idea of IBM 801 system and the PC/RT architecture. It imposes more stress and places heavy demands on instructions and data flow between memory, registers, and the other functional units.

A brief description of each of these processors mentioned above has been provided separately with appropriate figures given in 8.8.5.1-8.8.5.7 in the website: http://routledge. com/9780367255732.

 
<<   CONTENTS   >>

Related topics