Home Computer Science
Multiprocessor are systems with multiple CPUs capable of independently executing different tasks in parallel. Apart from having shared common memory or unshared distributed memories, these processors also share resources such as, communication facilities, I/O devices, system utilities, program libraries, databases, and similar others. They are operated under the control of an integrated operating system that provides interaction between processors and their programs at the job, task, file, and even in data element level. Multiprocessors can be classified in a number of ways: (i) The number of CPUs present in a system. Modestly parallel systems contain 2 to about 30 processors, while massively parallel systems can contain even thousands of such processors, (ii) The patterns of interconnections that are created between the CPUs and the memory modules and that too, whether the memory modules would be centrally shared or distributed shared, (iii) The way the multiple CPUs themselves will be interconnected with one another, (iv) The form of interconnection networks (i.e. whether static or dynamic) to be used. Still, many other aspects remain that are considered at the time of multiprocessor implementation.
Multiprocessor systems are best suited for general purpose multi-user applications where major thrust is on programmability. Shared-memory multiprocessors can form a very cost-effective approach, but latency tolerance while accessing remote memory is considered a major shortcoming. Lack of scalability is also a key limitation of such a system. Distributed shared memory (DSM) multiprocessors, however, address all these issues, and resolve most of all these drawbacks to a considerable extent by way of providing an extended form of stringent shared-memory multiprocessor architecture.
The multiprocessor architecture in which the primary memory is shared is usually called shared-memory multiprocessor. In this multiprocessor, there is only a centrally shared global memory having a single virtual address space that is accessed and shared by all processors. In addition, each processor may also have an extra private cache (local memory) to further speed up the operation. Peripherals can be attached in some other form of sharing. These multiprocessors are mostly not fit to be scalable; they are sometimes referred to as tightly-coupled since high-bandwidth communication networks are used to extract a high degree of all types of resource sharing (Catanzaro, B.).
Shared memory does not always mean that there is only a single centralized memory that is to be shared. In fact, the advent of constantly emerging more powerful VLSI technologies since mid-1980s onwards, offered an abundance of hardware capabilities and numerous options within affordable cost. The conventional definition and the traditional architecture of the multiprocessor have been then radically changed. It became feasible to develop even one-chip powerful multiprocessor (multicore, also known as chip multiprocessor, see Section 8.12.2) and larger capacity RAM at reasonable cost. Large- scale multiprocessor architectures then started to evolve with multiple memories that are now distributed with the processors. Here, each CPU can quickly access its local memory, and accesses to the other memories connecting with other CPUs are also possible, but are relatively slower in operation. That is, these physically-separated memories can now be addressed as one logically shared address space, meaning that any memory location can be addressed by any processor, assuming that it has the approved access rights. This, however, does not discard the basic shared memory concept of multiprocessors; rather this extends it in a broader sense ventilating the concept that is known as distributed shared memory. Multiprocessors having distributed shared memory (DSM) are, however, relatively scalable and sometimes referred to as loosely-coupled multiprocessors. Here also, peripherals can be attached in some other suitable form of sharing (Vranesic Z.G., et al.).
These two types of multiprocessors, due to their inherent architectural differences, can also be differentiated primarily in terms of their speed and ease with which they can interact on common tasks. Irrespective of the organisation of the multiprocessor, there actually exist two primary points of contention, namely the shared memory, and the shared communication network through which all sorts of interactions are made. Common cache memory (apart from using extra private cache (local memory) to each processor to further speed up its own activities) is often employed to reduce such contentions. These two types of multiprocessors also differ in the types of interconnection network being used that eventually puts a significant impact on the bandwidth and saturation of system communications, thereby directly influencing the system performance as a whole (Mak, R, et al.). In addition, the other associated important issues, such as, cost, complexity, interprocessor communications, and above all, the scalability of the presented architecture need to be considered. However, the kind of interconnection network being used in both of these two types of multiprocessors may be in the form of:
Shared-bus systems are relatively simple and popular, but their scalability is limited by bus and memory contention. Crossbar systems allow fully parallel connections between processors and different memory modules, but their cost and complexity grow quadrati- cally with the increase in the number of nodes (see Crossbar switch]. Hypercubes and multilevel switches are scalable and their complexities grow only logarithmically with the increase in number of nodes.
Multiprocessors while using shared memory or distributed shared memory give rise to three different models that differ mainly in how the memory and peripheral resources are to be connected; shared or distributed. Three such common models are found, namely:
(i) Uniform Memory Access (UMA), (ii) Non-Uniform Memory Access (NUMA), (iii) No-Remote Memory Access (NORMA) (Yew R C, et al.).
Symmetric Multiprocessors (SMP): UMA Model
In a centralized shared-memory multiprocessor known as UMA (Uniform Memory Access), each of n processors can uniformly access any of the common m memory modules at any point of time. In Figure 10.16 a, the CPUs, the memory, and the I/O subsystems are connected using an interconnection network in the form of a common (hierarchical) bus. Here, each CPU chip contains an on-chip level 1 (LI) private cache, and in addition, a CPU may also have an on-chip dedicated or shared L2 cache apart from using an off-chip L3 cache. Whatever be the architectural improvement be carried out, of course, within the confines of the basic concept to speed up the execution, the shared bus can service only one request out of many already arrived at. Hence, the bus itself eventually becomes a hot spot creating a severe bottleneck that summarily limits the performance of the entire system. The CPU would then have to face an unpredictable delay while accessing the shared memory.
In order to minimize this hot-spot problem, Figure 10.16 b shows an alternative design that uses a crossbar network to connect the CPUs and the I/O subsystems with the memory units. Here, the CPUs and the I/O subsystems face relatively less delays in accessing memory, since the crossbar network as usual provides several alternative connections that can be used in parallel as long as they do not conflict in their source or destination entities. The delays caused in the crossbar switch would also be more predictable than those of the bus. System performance, however, would be appreciably better than that using a bus as an interconnection network.
The UMA model of multiprocessors can be again divided into two categories: Symmetric and Asymmetric. When all the processors in the system share equal access to m shared memory modules as well as to all shared I/O devices through the same channels or
A scheme of UMA model (shared-memory multiprocessor) (a) System bus and (b) Crossbar Switch.
through different channels that provide paths to the same devices, the multiprocessor is called a symmetric multiprocessor (SMP). This is illustrated in Figure 10.16a and b using interconnection network in the form of a common (hierarchical) bus and crossbar network, respectively. However, other forms (types) of interconnection network can also be employed in place of crossbar network. In this category, all the processors are allowed to run all sorts of interrupt service routines and other supervisor-related (kernel) programs. In the asymmetric category, not all but only one or a selective number of processors in the multiprocessor system are permitted to additionally handle all I/O and supervisor-related (kernel) activities. Those are treated as master processor(s) that supervises the execution activities of the other remaining processors, known as attached processors. However, all the processors here also have uniform access to any of m shared memory modules as usual.
UMA model is easy to implement and is found to be suitable in the general-purpose multi-user applications under time-sharing environment, as well as also in parallel processing applications. Parallel processes here must communicate by software using some form of message passing by putting messages into a buffer in the shared memory, or by using lock variables in the shared memory. Interprocessor communication and synchronization are normally carried out using shared variables to be located in the common memory.
One distinct advantage of SMP is that it continues to operate even in the event of certain failures of some CPUs, but, of course, affecting only with a graceful degradation in the performance of the entire system. Failure of a processor in most of the situations is not so severe to the operation of other processors present in the system if it is not executing the kernel code at the time of failure. At best, only the processes) availing the service of the failed processor would be affected, and the other processes henceforth will be barred from getting the service of the failed processor, and that may only affect, to some extent, the total performance of the entire system as a whole.
However, one of the major drawbacks of this model lies in the speed disparity between the processors and the interconnection networks. While the fast processors quickly complete its execution, the interconnection network cannot even complete its prescribed task within that same period. As a result, a comparatively long delay is required to synchronize these operations that consequently result in an overall degradation in system performance. Interconnection networks with speed comparable to the speed of the processor is, however, possible but are costly, and equally complex to implement.
Presence of caches at different levels (such as LI, L2, and L3) attached with more than one CPU will substantially improve the overall system performance, but may lead to a critical cache coherence problem which, in turn, requires the inclusion of additional arrangement (cache coherence protocol) to resolve. Consequently, the cost of the architecture increases, and it also increases the traffic load on the network while cache-coherence signal exchanges are serviced, which eventually may slow down the memory accesses, thereby leading to a perceptible degradation in the overall system performance.
Scalability is also a major issue (problem) inherent in this architecture irrespective of the type of interconnection network being used. It is found that the UMA model is not suitable to be scaled involving CPUs beyond a smaller number. Usually, this number lies somewhere in between 16 and 64 processors. For example, in the Power Challenge SMP system introduced by Silicon Graphics, the number of processors being used is limited to 64 MIPS R10000 CPUs in a single system; beyond this number the performance of the system is observed to degrade significantly.
The scalability issue that still persists with an SMP, however, eventually became one of the driving motivations behind the development of a large-scale multiprocessing system while retaining most of the flavour of SMP intact. The ultimate outcome is the emergence of a new architecture, what is known as the NUMA architecture. The basic objective of NUMA is to maintain a transparent system-wide memory while permitting even multiple multiprocessor nodes in the architecture; each such node however, is itself a multiprocessor consisting of a collection of multiple processors (CPUs) equipped with its own resources and local interconnect system.