Home Computer Science
II. High-End Processor Organisation
Table of Contents:
Computer architecture is a joint venture of its hardware organisation and its supporting available software/programming facilities. The performance of a computer system is the result of the coordinated effort of its hardware resources and the system/application software. The hardware core is formed with processors, memory, peripheral devices, and the interconnecting buses. Software programs interface with the hardware thus implemented and extract the potential of the hardware to greatly enhance the speed, performance, and the portability of user programs while running on different machine architectures. After the release of Von Neumann's architectural design concept, computers were built with the fundamental resources in the form of a sequential machine executing scalar data. This has since then gone through a series of evolutionary changes to improve and enhance the performance but could not cross a limit due to its design limitations as well as the mandatory obligation of sequential execution of instructions in programs. Computer performance has since then been increasing and steadily approaching its physical limit by the use of available faster hardware technologies and improved design in processor architecture, following the conventional path.
Many important areas involving the execution of numerous computational problems still remain beyond the capabilities of the fastest contemporary machines, even after increasing the capacity and improving the speed of the resources. One way to handle this issue is to exploit functional parallelism. Functional parallelism can be realised in two ways. One possibility is to build computers using multiple functional units - perhaps hundreds of low-cost processors (or processing elements) and their allied circuits that can work in parallel on common tasks. This is known as processor-level parallelism. Another possibility is to speed up the single processor by arranging the hardware so that more than one operation can be performed at the same time. This is called instruction-level parallelism that ensures the increase in the number of operations performed per unit time, leading to a substantial increase (speedup) in performance although no single instruction is executed in less than its predefined allotted time. This approach, in other words, encourages the practice of pipelining at various processing levels.
The basic concept of a pipeline is very simple. A pipeline is similar to an assembly-line operation used in manufacturing plants. Henry Ford invented the assembly-line in the early 1890s to build all cars in stages. For example, in an automobile assembly-line, there are many steps being followed to build a car. Each step (e.g., preparing the chassis, installing the engine, adding the body, etc.) contributes something to the car production. Each step operates in parallel with the other steps operating on a different car. This means that while the second group of workers is just installing the engine on an already prepared chassis of one car done by the first group, the third group is adding the body on another car having the chassis and engine fitting completed. At the same time, the first and second groups are engaged with their own work on another new car assembly. As a result, it is possible to have new cars being rolled out of the assembly line in quick succession. It has been observed that some ideas have stood the test of time and have an immense enduring quality that can be applied equally well in many different ways in diverse environments. Incidentally, this assembly-line idea, in particular, has been implemented in designing a processor in the form of a pipeline (Kogge, P. M.).
Pipeline Approach: Instruction-Level Parallelism
The processor while executing a program follows an instruction cycle (fetch-decode- execute) for executing each instruction in the program, one after another. The pipeline technique splits up this sequential process (fetch-decode-execute cycle) of instruction execution into suboperations. Each subprocess is then executed in a special dedicated segment (stage) that operates concurrently with all other segments inline executing different suboperations on different instructions. Thus, pipelining is essentially an implementation technique whereby the execution of multiple instructions can be overlapped. Thus, the pipeline approach gives rise to an essence of parallelism, but only at the instruction level, and is thus legitimately called virtual parallelism.
A linear pipeline is visualised as a collection (cascade) of processing segments; each segment in the pipeline completes a part of an instruction execution in a way similar to how the task is partitioned. The result obtained from the computation in each segment is then passed to the next segment in the pipeline. Instructions enter into the pipeline at one end, progress through the stages (segments), and usually exit at the other end but not necessarily just as the assembling of cars would go in an assembly line. The pipelines being employed in the design of processors, however, may be of various types, as we will see in later sections. But whatever be the type, in this pipeline architecture, only one instruction is always issued to the pipeline at every clock cycle. That is why this pipeline is sometimes referred to as scalar pipeline.
A characteristic of pipelines is that several different computations can be in progress in different segments with different instructions at the same time. The overlapping of computation is made possible by associating a register (buffer) with each segment in the pipeline. The register provides isolation between (adjacent) segments so that each can operate on distinct data simultaneously.
To demonstrate the principle of pipelining, we use for the sake of simplicity, an instruction that can be implemented in at the most five clock cycles. The five clock cycles are described as follows. This principle of pipelining can be applied to even more complex instruction sets (CISC-like), such as RISC relatives, although the resulting pipelines would then naturally be more complex.
A five-segment pipeline.
A CPU as shown in Figure 8.1 would then comprise of five processing units P1-P5, in which each such unit is assumed to take one cycle to finish its execution (task); then the stages of execution look like the following:
1. Instruction Fetch Cycle (PI): IR <— MEM (PC), PC PC + 1
The content of the PC is the address of the instruction to be fetched from memory into the instruction register (IR). The content of the PC will then be incremented, and the new content of the PC will hold the address of the next sequential instruction needed in subsequent clock cycles.
2. Instruction Decode/Analysis/Register Fetch Cycle (P2)
The instruction thus fetched and available in IR is decoded, and register (IR) is read by two temporary registers (operands) which would be used in later clock cycles. Decoding is done in parallel with reading registers, and this is possible because these fields are at a fixed location in the instruction format. This technique is usually known as fixed-field decoding.
3. Effective Address Calculation Cycle (P3)
The address of the operands is now computed for all types of instructions (register-register, register-immediate, branch instruction, etc.), and the effective address thus obtained is placed into ALU output register.
4. Memory Access/Data Fetch/Branch Completion Cycle (P4)
The address of the operand thus obtained from the preceding stage (cycle) is used to access the memory, if needed. In case of load, store, and branch instructions, data either returns from memory and is placed in the LMD (Load Memory Data)/ MBR (Memory Buffer Register) register or is written into the memory. In case of branch instruction, the PC is replaced with the branch target address in the ALU output register or the PC remains as it is, with already incremented (step 1), targeting the next sequential instruction.
5. Instruction Execution/Write Back Cycle (P5)
The instruction will now be executed, and the result will be in the register file whether it comes from the memory system which is in LMD or from the ALU which is in ALU output register.
The above descriptions (P1-P5) show how an instruction flows through the datapath. At the end of each clock cycle, every value computed during that clock cycle and required for a later clock cycle (whether for the current instruction or for the next instruction) is written into a storage device, which may be memory, a general-purpose register, the PC, or a temporary register. The temporary registers hold values between clock cycles for the current instruction while the other storage elements hold values between successive instructions.
The pipelining approach increases the CPU instruction throughput, meaning the number of instructions completed per unit of time, but it does not reduce the execution time of an individual instruction as a whole. In fact, it usually increases the actual execution time slightly for each instruction due to the overheads being paid for controlling the pipeline. The increase in instruction throughput signifies that a program runs relatively faster and has lower total execution time, even though no single instruction runs faster.