Home Computer Science
Over the last couple of decades, enormous advancements and colossal achievements have been made in various areas of computer technology. These advancements have had an immense impact on processor designs and system developments. The processor implementation technology has also rapidly evolved over decades due to natural consequences of the steady advances in VLSI technology. As a result, various processor families have been introduced, with continuous exponential increase in execution performance. To achieve such performance improvement, major architectural upgradation as well as important organisational enhancements in processor designs is continuously progressing with no indication of any slowdown.
In regard to architectural development, the underlying processor design principle and related implementation strategy have emphasised two dominant parameters, namely, the clock rate (speed) and the CPI. The clock rates of various processors have moved from low to a considerably higher speed and already reached a few gigahertz. The other trend is that processor designers have been constantly striving to lower the CPI rate further using numerous innovative hardware and associated system software design approaches.
On the other hand, several organisational enhancements in processor design have taken place by this time. Numerous refinements have been also carried out in terms of organisational changes in processor design (mainly pipelining and related aspects) on the chip.
A broad spectrum of clock speed versus CPI characteristics of major categories of contemporary processors that have been implemented during the past decade or so is shown in Figure 8.29. All these processors architecturally belong mainly to either of the two distinct classes, namely RISC and CISC, or somewhere in between these two. While CISC design principle relies on increased clock speed, RISC design philosophy, however, religiously attempts to attain a lower CPI.
In the CISC tent, at present there is the dominant presence of top-of-the-line conventional processors like the Intel Pentium, Motorola M68060, VAX/8600, IBM 390, etc. With advanced implementation techniques, the clock rate of today's CISC processors has significantly increased to the tune of a few GHz. The CPI of different CISC processors, however, is different and may be even 20 on the higher side. That is why the CISC processors are located at the upper part of the design space (Figure 8.29).
In the RISC class, there are several examples of amazingly fast processors, such as, ARM, SPARC, Alpha, РА-RISC, MIPS, Power series, etc. With the use of efficient advanced pipeline approaches, the effective CPIs of RISC processors, on an average, have been reduced to one or two cycles.
The organisational changes that have taken place until now in processor design in both RISC and CISC processor architectures have primarily targeted in an ultimate increase in ILP to accomplish more work in each clock cycle. Several such changes that have been implemented most successfully include the following in chronological order:
CPI versus clock speed of major categories of processors.
Pipelining: This approach in processor design (already described in detail), in fact, gives birth to a concept known as ILP that eventually enhances the processor performance and thereby improves the throughput significantly without using any faster hardware components.
Superscalar: In order to further increase the ILP — obtained from a traditional pipeline design approach, designers have constructed multiple pipelines within a processor by way of replicating functional resources (multiple functional units). This has been made possible mostly due to the availability of the more superior contemporary VLSI technology. As a result, this facilitates parallel execution of more instructions per clock cycle using the available parallel pipelines (presence of more than one pipeline) as long as pipeline hazards are avoided. Superscalar processors of both RISC and CISC classes allow multiple instructions to be issued simultaneously during each clock cycle. Thus, the effective CPI of a superscalar processor should be lower than that of a scalar RISC processor. The clock rate of a superscalar processor, however, matches that of a scalar RISC processor.
The processors in vector supercomputers use multiple functional units for concurrent scalar and vector processing. The effective CPI of a processor used in a supercomputer should be very low, and hence the processor is positioned at the lower right corner of the design space (Figure 8.29).
SHMT: Introduction of the thread concept and subsequently its successful implementation in processor design once again lower the effective CPI. Use of replicated register banks in the pipeline organisation enables multiple threads to be concurrently executed by sharing pipeline resources during each clock cycle.
All these attempts that are made in the advancement of processor design, both architecturally and organisation-wise, however, have been ultimately targeted at somehow increasing the performance of the system but of course at the cost of increasing complexity. Implementation of pipelining in the organisation of processor design to obtain improved performance, however, has invited more complexity than that is in the design of a conventional nonpipelined processor. Inclusion of more stages in the pipelined design (superpipeline) while offers better performance but increases the underlying design complexity which, in turn, challenges its viability. However, there always exists a practical limit as to how far this trend of including even more stages in the pipeline can be continued.
In case of superscalar organisation, performance enhancement has been achieved primarily by means of increasing the number of parallel pipelines. This necessarily requires additional logic for managing the regular multipipeline operations and avoiding the hazards to extract as much output as possible. Still, the full use of multiple pipelines cannot be realized due to the presence of numerous types of hazards as well as prevailing resource dependencies. Consequently, there are diminishing returns as the number of parallel pipelines being provided gradually increases.
With SHMT organisation, managing multiple threads and appropriately scheduling them over a set of available pipelines using additional logic often limits the number of threads as well as the number of pipelines that can be effectively utilised. This also starts to pay diminishing returns as the number of concurrently executing threads and the number of parallel pipelines is gradually increased.
After the successful and effective introduction of ILP — in the processor design sometime in the late 1980s by means of exploiting pipelining, and then superscalar techniques, the SHMT approach and subsequently its far more fine-tuned implementation ultimately resulted in a steep rise in the performance improvement of processors of all kinds.
This was observed till the beginning of 2000 AD. After that, no appreciable improvement in the processor performance has ever been attained. This is due to the fact that the increasing difficulty in designing, fabricating, and debugging of the chips with contemporary technology to deal with the increasing complexities of all logical issues has by far reached its limit. Most of the chip area is now occupied with coordination and signal transfer logic. As a result, the effective implementation of ILP and machine-level parallelism appears to have reached an extent from which it would not be practically possible to extend it any further profitably.
Power Consumption Considerations: Another important aspect related to processor design following ILP philosophy is the power density consideration. While constantly attempting to maintain improved performance, the designers were bound to use more transistors on more densely packed chips to realise higher intelligence and higher clock frequencies. This consequently resulted in shortening the electrical path length which has considerably increased the operating speed. The negative impact is that the power requirements have increased exponentially and chip density and clock frequency have gradually gone up, and beyond a certain point, the power consumption of the chip rises disproportionately fast with clock speeds. Such increases in power density (W/cm2) will eventually result in generation of enormous heat. The teething problems of dissipating this huge amount of heat on such high-density, high-speed chips pose a serious design issue that firmly limits chip density from going further ahead beyond a certain point.
One possible way to regulate the high power density on a chip is to use more of the available chip area for cache memory. The reason is that memory transistors are relatively smaller and have a power density an order of magnitude lower than that of logic. This is illustrated in Figure 8.30a. Figure 8.30b shows that the percentage of chip area devoted to memory has grown to exceed 50% as the chip transistor density has increased (Borkar 2003).
With increasing chip density, the power consumption trend is constantly rising. As shown in Figure 8.31 (Borkar 2007), it is expected that in a couple of years, the microprocessor chip density would be 100 billion transistors on a 300 mm2 die. With normal assumption that about 50%-60% of the chip area would be devoted to memory, the chip will then support on-chip cache memory of about 100 MB and leave over 1 billion transistors available for logic use.
Power and memory considerations on processor chip, (a) Power density and (b) chip area.
Utilisation of transistors in a processor chip (Borkar, 2007).
In spite of the presence of all these conflicting aspects, processor clock speeds have, however, reached as high as 4 GHz in recent years. But it has also been observed that processor performance does not scale with clock speeds. Out of many prevailing reasons, one of the main ones for this is that the relative cost of a cache miss is greater at higher processor speeds.
In view of all these factors, there have been several attempts in relative leveling off of processor clock speeds in recent years, while greater attention is being paid to how best to design the chip to most effectively utilise such an enormous number of transistors available on it. As already mentioned, there are critical limits to the effective use of techniques such as superscalar processors and SHMT. Given that a large number of transistors can be fabricated on a chip, it follows that huge performance benefits can be derived in one way by integrating system functions on a chip, even if it is not possible to continue to push clock speeds higher. Another outcome of these technological factors is that system performance can be more easily enhanced by employing multicore processors, systems-on-a-chip (SoCs, e.g. Tilera's TILE64 system, a 64-core processor for embedded applications, in which each chip consists of a regular 8x8 grid of tiles. Each tile on this chip has its own general-purpose processor core, L2 cache, and a nonblocking mesh router to provide for communication with other tiles on the chip and for off-chip data traffic with main memory, I/O devices, and networks), stream processors, and larger two-level on-chip cache memories than by pushing a single processor to its technological performance limits. These are only a few of many other possibilities that can be cited as an outcome of ongoing architectural developments.
In general terms, all these experiences in recent decades have culminated in what has been encapsulated in a rule of thumb known as Pollack's rule (99), which states that performance increase is roughly proportional to square root of increase in complexity. In other words, if the logic is made double in a processor core, then it delivers only 40% more performance. In principle, the use of multiple cores has the potential to provide near-linear performance improvement with the increase in the number of cores. That is why the introduction of the multicore architecture in the evolution process of the processor design, in fact, became inevitable.
Multicore organisation is also preferred when power consumption of the chip is taken into active consideration. As already mentioned, that most of the chip area (around 50%-60% of the chip area) would be devoted to cache memory for the sake of controlled power consumption. This, in turn, gives rise to such a huge amount of on-chip cache memory that it becomes unlikely that any one thread of execution could ever use all that cache effectively. In fact, a major portion of this cache in this situation remains underutilised. Even with SHMT, multithreading is carried out in a relatively limited fashion and therefore such an enormous size of cache memory cannot be exploited fully, whereas a number of relatively independent threads or processes running on multiple cores have a fabulous opportunity to take full advantage of such an extremely large on-chip cache memory at hand.