Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font

<<   CONTENTS   >>

Multicore with Hardware Multithreading

With continuous advancement in VLSI technology, the aggregate functionality that can bebuilt into a single chip has been observed to be growing steadily with time. Consequently,multiple processor cores as well as provision for hardware support for multithreadingon a single chip can both be seen as natural consequences of the steady advances in VLSItechnology. Both these developments, however, eventually address the needs of importantsegments of modern computer applications and workloads. Depending on the specific strategy that is adopted for switching between threads, hardware support for multithreading may be classified under any one of the following: • Coarse-grain multithreading refers to switching between threads only on theoccurrence of a major pipeline stall, which may be caused by, say, access to mainmemory, with latencies of the order of a hundred processor clock cycles. • Fine-grain multithreading, also known as interleaved multithreading, refers toswitching between threads on the occurrence of any pipeline stall, which may becaused by,say,LI cache miss. But this term would also apply to designs in whichprocessor clock cycles are regularly being shared (i.e. switched) amongst executing threads, even in the absence of a pipeline stall.}} [1]

provides a natural software model to exploit the structural parallelism present in an application.

• A multicore system with hardware multithreading also supports the natural parallelism which is always present between two or more independent programs running on a system. Even two or more operating systems can share a common hardware, in effect, providing multiple virtual computing environments to users. Such virtualisation makes it possible for the system to support more complex and composite workloads, thereby resulting in better system utilisation and an appreciable return on investment.

With the continuously increasing power of VLSI technology in abundance, the development of multicore SoCs, even in the presence of ample multithreading, also became inevitable, since there are practical limits to the number of threads (multithreading) a single processor core can support. Each core on the advanced Sun UltraSPARC T2, a multicore SoC processor, for example, supports eight-way fine-grained multithreading, and the chip has eight such cores. Still, much of the available aggregate power of a multicore processor remained unutilised. Multicore chips, however, till now keep their promise of providing higher net processing performance per watt of power consumption.

SoCs are one of the examples of fascinating design trade-offs. However, for any actual task of practical processor design, it is necessary to make many design choices and tradeoffs, then validate the chosen design thus arrived at using simulations, and finally complete the design in detail to the level of logic circuits.

IBM Power 5

One of the recent past releases that came from IBM following the line of its Power architecture is the introduction of IBM Power5 processor chip, launched in 2004 in a system introduced by IBM. It was primarily intended to compete mostly against the Intel Itanium 2 and to a lesser extent the Sun Microsystems UltraSPARC IV and the Fujitsu SPARC64 in the high-end enterprise server market. It was eventually superseded in 2005 by an improved iteration, the Power5+.

Power5 is a second generation dual-core (multicore) processor which uses the SHMT approach on both the two separate processor cores. It is interesting to note that at the time of its design, with certain objectives to be attained, the designers worked with various possible alternative design approaches individually and simulated them accordingly. They observed that having two two-way SHMT processor cores on a single chip yielded a performance that was superior to that of a single four-way SHMT processor. They also examined the result of the numerous simulations thus obtained, which revealed that additional multithreading, if carried out even more, with available hardware resources that support two threads, might lead to the degradation of the processor's performance, mainly because of critical cache thrashing, as data from one thread always attempts to displace data needed by another thread. That is why the Power5 is built using SHMT with only two concurrent threads on each of two separate processor cores. The Power5 is equipped with power management facility that can reduce power consumption below the standard singlethread level, yet this facility has no performance impact as such. For low-priority threads, it can switch to low-power mode. Moreover, the SHMT can even be made disabled to optimise the current workload. Several Power5 processors in high-end systems, however, can be again coupled together to act as a single vector processor by a technology called ViVA (Virtual Vector Architecture).

Power5+: This processor is an improved upgrade of the existing Power5 with a newer 90nm fabrication process. It was initially made for only lowering the power consumption, so its die size was decreased from 389 mm2 to 243 mm2. Clock frequency was maintained between 1.5 and 1.9 GHz. Subsequent versions of Power5+, however, raised the clock frequency to 2.2 GHz in 2006 and further to 2.3 GHz later in 2006. This processor was packaged in the same way as previous Power5 microprocessor was packaged, but it was also available in a quad chip module (QCM) containing two Power5+ dies and two L3 cache dies, one for each Power5+ die. These QCM chips, however, ran at a clock frequency between 1.5 and 1.8GHz.

A brief description of Power5 architecture with relevant figures is given in the website:

Intel Core i7

The Intel Core i7 processor named Bloomfield was introduced in November 2008 using a fabrication technology of 445 nm. Soon after in July 2010, the Gulf town with a fabrication technology of 632 nm, then Scandy Bridge in January 2011 with a fabrication technology of 432 nm, and finally Ivy Bridge with a 422 nm fabrication technology in April 2012 were released. Each product in chronological order of its release came with some new marvelous features assisted largely by an enhanced core, increased bus system clock frequency and larger size of caches provided at different levels to support the continuously growing demands of current applications. However, all these chips have four X-86 SHMT processor cores, each with a dedicated split LI cache (32KB I-cache and a 32KB D-cache) and a dedicated unified 256KB L2 cache and with a large 8MB L3 cache shared by all four cores.

One elegant mechanism that Intel exploits to make its cache usage more effective is prefetching, to fill the caches speculatively with data which is likely to be required soon. It is observed that the Core i7, however, improves on L2 cache performance, better than that of Core 2 Quad shared L2 cache, with the use of the dedicated L2 cache supported by a shared L3 cache with a relatively high speed. The Core i7 chip supports two forms of external communication to other chips. One is the DDR3 memory controller that connects the DDR main memory onto the chip. The interface supports three channels; each one is 8 bytes (64 bit) wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s. With the memory controller on the chip, the front-side bus is not needed and hence is eliminated. The other one is the Quick Path Interconnect (QPI) which is a cache- coherent, point-to-point link-based electrical interconnect specification for Intel processors and chipsets. It provides high-speed communication between connected processor chips. The QPI links operate at 6.4 GT/s (giga transfers per second). At 16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total bandwidth is 25.6 GB/s.

Distinctive Features of Intel Core i7 900-Series Processors

• Intel, however, launched three Core i7 900-series processors in succession differing mostly in processor frequency and having model numbers Core i7 920 (2.66 GHz), Core i7 950 (3.06 GHz), and Core i7 975 (3.33GHz). All the major features of these three processors are otherwise more or less the same. Each of these processors provides four complete execution cores in a single processor package.

  • • Each processor using Intel Hyper-Threading (HT) technology delivers two processing threads per physical core for a total of eight simultaneously executing threads, thereby offering a massive computational throughput.
  • Intel Turbo Boost technology used in each processor facilitates the dynamic increase of the processor frequency as needed, by taking advantage of thermal and power headroom when operating below specified limits. This results in automatic yielding of more improved performance when it is needed the most. This Turbo Boost technology provides single-core performance up to 2.93 GHz for Core-i7 920 processor, 3.33 GHz for Core-i7 950 processor, and 3.6 GHz for Core i7 975 processor.
  • • The large last-level L3 (8MB) Intel Smart cache enables dynamic and efficient allocation of shared cache space to all four cores to almost match their every needs.
  • • An on-chip integrated memory controller supporting 3 DDR3 channels operated at 1066 MHz offers dramatic memory read/write performance through an efficient prefetching algorithm, lower latency, and higher memory bandwidth, thereby making the Intel Core i7 processor family ideal for data-intensive applications.
  • • All these processors include the full SSE4 (128-bit) instruction set; each instruction is issued at a throughput rate of one per clock cycle, thereby improving the performance of a broad range of multimedia and computation-intensive applications.
  • Intel Quick Path interconnect (Intel QPI) attached with the Intel Core i7 900 processor series increases bandwidth and lowers latency, achieving data transfer speeds as high as 25.6 GB/s.
  • • The type of socket being used with each such processor is LGA 1366.

A brief description of Intel Core i7 architecture with a relevant figure is given in the website:

Sun UltraSPARC T2 Processor

Since mid-1980s, many powerful RISC processors including SPARC family of processors that came from Sun Microsystems have been introduced with many innovative ideas using a relatively much simpler design with effective instruction pipelines and efficient usage of on-chip cache memory. The original SPARC processor was a 32-bit RISC processor with load-store architecture, relatively simple addressing modes, and register-to-register arithmetic logic machine instructions in three-address format (see Chapter 9, Sun SPARC, for more SPARC processors details).

The UltraSPARC T2, a 64-bit enhanced version of its predecessor UltraSPARC Tl, is the newest multicore processor with SHMT and SoC version of UltraSPARC with extensive on-chip support for multithreading, networking, I/O, and other key functions. The T2 chip has an area of just under 3.5ctn2, fabricated with 65 mm line width using about 500 million transistors, and can operate at 1.4 GHz with a 1.1 V supply. The normal power consumed by the chip is about 95 W; still, on a per thread basis, this power consumption works out to be quite low.

The architecture of T2 processor consists of an eight-processor core in which each core supports eight-way fine-grained multithreading. Each such core has its own data paths, register sets to support execution of multiple threads, two integer operation units, and a floating-point unit. In addition, each core has hardware provided for cryptography and graphics. Over all, the chip supports 64 parallel threads and exhibits comparatively a high level of thread - level parallelism (TLP) but not necessarily much ILP. Besides having as usual a dedicated LI instruction cache and data cache with each core shared by all eight threads executing on each core, the processor also contains a crossbar switch, a 4 MB shared L2 cache organised in the form of eight parallel banks for faster access, and extensive support for I/O and networking. Therefore, it is in fact an SoC. Since the threads run independently of each other (single thread of execution) and share all hardware resources, each thread behaves here as a virtual processor in its own right. Thus the single chip is considered to be able to support 64 virtual systems.

The design of the UltraSPARC T2 is primarily targeted towards computation-intensive applications with a high degree of multithreading. Apart from back-end servers, its capabilities include network devices such as packet routers, switches for local area networks (LANs), graphics and imaging applications, and other similar facilities.

Last but not least, one unique feature is that the complete design of UltraSPARC T2 chip has been made available on the web to researchers and developers under an open source agreement. The stated objective behind this decision made by Sun Microsystems is to encourage further innovations in processor design and its applications around the world.

A brief architectural description of Ultra SPARC T2 with a relevant figure is given in the website:

  • [1] Simultaneous multithreading refers to machine instructions that issue two (ormore) threads in parallel in each processor clock cycle. This would correspond to a multiple-issue processor where the multiple instructions being issued in a clock cyclecome from an equal number of independent execution threads. The recent scenario in the processor design approach is thus observed to shift towardsmulticore chips with hardware multithreading. This organisational design approach thusresults in mainly two types of important performance benefits: • Multicore chips with hardware multithreading can exploit a broader range ofstructural parallelism inherent in applications. The processor core in a multicorechip operates mainly in a shared memory mode. However, message passing,which works independently of physical locations of processes or threads, also
<<   CONTENTS   >>

Related topics