Home Computer Science
Implementation of Multistage Networks
Table of Contents:
A high-speed 128-port Omega network have been used by IBM in its RP3 multiprocessor having 512 processors (nodes) that offers a bandwidth of 13 Gbytes/sec at 50 MHz clock rate. Cedar multiprocessor also used multistage Omega networks in their systems to build up the interconnection networks. The Alliant FXI2800 uses crossbar interconnects to communicate between seven 4-processor (Intel 860) boards plus one I/O board, and eight interleaved shared cache boards. Multilevel crossbar switch have been used to connect larger configuration in Fujitsu VPP500 (vector parallel processor), Hitachi SR 8000 and NEC SX-5 large systems. In all Crap multiprocessors, crossbar networks are used between the processors and memory banks. The banks use 64-, 128-, or 256-way of (in Cray Y-MP) interleaving that can be accessed via four ports. The BBN Butterfly processors (TC 2000 series) from BBN Advanced Computers use 8x8 crossbar switch modules to build a three- stage 512 x 512 butterfly network for a 512-processor model that offers a maximum interprocessor bandwidth of 2.4 Gbytes/sec at 38 MHz clock rate with a 1-byte data path. Using Motorola RISC processor 88100, the crossbar switch modules have been used to design a NUMA machine for real-time applications. IBM RS/6000 SP multiprocessor also encouraged the use of multistage network as one of its several options for interconnecting the groups of processors. IBM ESI9000 large system uses crossbar network to connect I/O channels and shared memory.
Comparison of Dynamic Networks
Shared-memory multiprocessors can be realized either by using multiple buses or multistage networks or crossbar switches in building dynamic networks. Obviously, the bus- based system is the cheapest to build, and gets most of its performance from snooping caches running complex cache coherence protocols. As a result, low' bandwidth is offered to each processor. Multistage networks, by virtue of having multiple paths and switches offers increased bandwidth. Crossbar switches allow speedy one-to-one communication. bkwever, each of these systems has its owm several distinct advantages as well as certain severe drawbacks, primarily depending on the environments where they will be actually used.
A brief detail of this topic is given on the website: http://routledge.com/9780367255732.
Static Connection Networks (Message-Passing Approach)
Static networks are easier to build and are cost-effective, while being able to control many processors involved in the design. Interprocessor communications are made by w'ay of message passing using distributed buses or I/O communication channels as links. Nearby processors can then interact at the maximum possible rate with little or almost no interference from other processors. In the simplest form, each processor has a direct link to every other processor, giving rise to a full interconnection pattern. It offers a bandwidth proportional to n2 (n is the number of nodes) and a minimal delay from any processor to any other. Unfortunately, the cost also grows with n2, it is, hence, found not suitable to be employed for large systems. The hypercube structures in this regard achieve a good balance as far as cost / performance (communication speed) ratio is considered. This topology results in an average delay that does not grow' proportionally to n.
Hypercube is a binary «-cube architecture in which an «-cube consists of N = 2" nodes that are connected in an «-dimensional cube with twro nodes per dimension. Each node is represented by small circle that consists of a processor, local memory and communication circuits. Each processor P; has bi-directional links to « other neighbouring processors; these links actually form the edges of the hypercube. (Hayes, J. P., et al.). In an «-dimensional hypercube, each node is directly connected to immediately adjacent « neighbours. The cube's side is of length 1, so each co-ordinate is either a 0 or a 1. Moving from any arbitrary vertex to any of its adjacent vertices changes exactly one co-ordinate, so adjacent vertices differ in exactly one position. This is illustrated in Figure 10.12a. A set of 2" distinct «-bit binary addresses can be assigned to the processors in such a way that the address of P, differs from that of each of its neighbours in exactly 1 bit. Figure 10.12b illustrates a 3-cube hypercube with 8 nodes. A 4-cube can be formed by interconnecting the corresponding nodes of twro 3-cubes as showm in Figure 10.12c. The node degree of an и-cube equals «.
The diameter of the и-cube hypercube is the distance betw'een the two furthest points w'hich is also equal to «. The diameter of an «-node hypercube is log2« compared to 2« for grid or mesh (discussed later) network. For example, when « = 1024, the diameter drops from 2048 (2и = 2 x 1024 = 2048) for grid or mesh to 10 (log, 1024 = 10) for hypercubes, thereby resulting a sharp decrease in the mean delay that appears to be significant in favour of hypercube topology.
(a) 2-cube two-dimensional; (b) 3-cube three-dimensional; (c) 4-cube formed by interconnecting two 3-cubes (four-dimensional).
Since alternate paths are available at each node, routing messages through the hypercube is particularly easy. If the node Nx wishes to communicate with node Ny it can proceeds in many ways. In general, an «-dimensional hypercube has n routing functions, defined by each bit of the «-bit addresses. One of them is as follows; the binary addresses of the source, x, and the destination y, are compared from least to most significant bits. Suppose that the nodes differ first in position p. Node Nx then sends the message to its neighbour whose address k, differs from x in bit position p. Node Nk then forwards the message to the appropriate neighbour using the same address comparison scheme. The maximum distance that any message needs to travel in an «-dimensional hypercube is « hops.
Hypercube has then constantly enhanced making use of contemporary technological developments, incorporating many useful features to make it more attractive, and thus passed through different generations with passing days. It has been used in Intel iPSCs with a seven-dimensional cube to connect up to 128 (= 27) nodes. NCUBE machines in NCUBE/10 had up to 1024 (= 210) nodes in ten-dimensional cube. Thinking Machine Corporation implemented hypercube networks in their popular CM-2 machine. But, due to poor scalability and difficulty in packaging higher-dimensional hypercubes, and above all, with the emergence of more attractive alternatives, the hypercube networks gradually lost much of their popularity in the early 1990s, and were finally replaced by other emerging better architectures.
A brief detail of the advantages of hypercube is given on the website: http://routledge. com/9780367255732.
Mesh and Torus
One of the most natural elegant ways of interconnecting a large number of nodes is by means of a mesh. A 4 x 4 mesh network with 16 nodes is shown in Figure 10.13a. The links between the nodes here also are bi-directional. Routing of a message in a mesh network can
(a) Mesh, (b) Illiac 4 x 4mesh, (c) Torus, and (d) Systolic array.
be carried out in several different ways. One of the simplest and most effective approaches is to choose the path between any source node Nx and a destination node Nv such that the transfer first takes place in the horizontal direction from Nx towards N„. When the column in which Nlf resides is arrived at, the transfer then starts to proceed vertically along this column. Examples of mesh-based multiprocessors are Intel Paragon (2-dimensional) replacing its hypercube predecessors, MPP, CM-2, DAP, Tandem Himalaya Series (1993) etc., and many other experimental machines.
Mesh-based architectures have a lot of variations by way of allowing wraparound connections. The Illiac IV assumed a 4 x 4 Illiac mesh with a constant node degree of 4, and a diameter of 7 as shown in Figure 10.13b. The Illiac mesh is topologically equivalent to a chordal ring of degree 4.
If the wraparound connection is made between the nodes at the opposite ends as shown in Figure 10.15c, the result is a network that consists of a set of bi-directional rings in the X-direction, connected by a similar set of rings in the Y-direction. This is called a Torus. This topology combines the ring and mesh, and can be extended to higher dimensions. In general, an nxn binary Torus has a node degree of 4, and a diameter of 2 Ln/2j which is one half of that of the mesh. This results in a considerable reduction in the average latency while information is transferred, but at the cost of greater complexity. Such a network has been found in use in Fujitsu's AP 3000 machines. Torus of higher dimension such as a three-dimensional Torus is observed in use in Cray MPP T3E multiprocessor.
This is especially a class of multidimensional pipelined array architectures designed for implementing fixed algorithms. In Figure 10.13d what is shown is a systolic array specially designed for performing matrix multiplication. In this example, the interior node degree is 5. Static systolic arrays, in general, are pipelined with multidirectional flow of data streams. With fixed interconnections and synchronous operation, a systolic array matches the communication of the underlined algorithm. For special applications like signal/image processing, the systolic arrays may offer a better performance / cost ratio. However, its structure has limited applicability, and can be very difficult to program. Still, systolic array topology may be used in building VLSI array processors, and also have been found in use in commercial machine Intel iWarp system.
One of the simplest topologies of interconnection networks is a ring which is obtained by connecting the two terminal nodes of a linear array with one extra link. This is illustrated in Figure 10.14a. The links used in a ring of N nodes may be unidirectional, and the diameter would be then N. When the link being used is bi-directional, the diameter is [N/2j. Ring is symmetric in topology with a constant node degree of 2. Ring is easy to construct and implement. The link of a node is kept sufficiently wide to accommodate a complete packet in parallel for both of its neighbours.
The token ring of IBM has a topology in which a message circulates along the ring through the nodes until it reaches the destination with a matching token. Whether pipelined or packet-switched, rings have been used as an interconnect in CDC multiprocessors and in the KSR-1 and KSR-2 computer systems from Kendal Square Research. Exemplar V2600 from Hewlett-Packard also uses ring networks.
It is not desirable to construct a long ring to connect many nodes because the latency in transfer would be unacceptably large. Rings can also be used as building blocks in other topologies such as hypercubes, meshes, trees and fat trees. When rings are used at any level of these topologies, the result is a hierarchy of rings that looks lucrative for communication but there may be a severe bottleneck at the highest-level ring.
Various forms of rings can be obtained by increasing the node degree. If the node degree can be increased from 2 to 3,4, and even more, we obtain different chordal rings. One such chordal ring with node degree 3 is shown in Figure 10.14b. If the extra links of a node in a ring go to those nodes having a distance equal to an integer power of 2, we obtain a ring known as barrel shifter. This implies that a node i is connected to j, if j-i =2k for к = 0, 1,2, ...,«-1 in a network having nodes N = 2". The node degree of a barrel shifter moderately increases with a notable decrease in diameter when compared to a chordal ring, and
Ring-based networks, (a) Ring, and (b) Chordal ring.
is not too complex to implement. When more links are added, the complexity increases, but higher the node degree, shorter the network diameter and that ultimately results in reduction of latency. In the extreme, in the ring having a completely connected network (each node is directly connected to all other nodes in the ring) consisting of N nodes, the node degree is the maximum, that is N - 1, with the shortest possible diameter of 1 but it would be most complex and equally costly to implement.
Tree and Star
Tree topology is another form of interconnection networks that is hierarchically structured. Tree may be of various types depending on the degree of its intermediate nodes but only one path at a time can be established through a given node in the tree. Binary tree in this regard has special importance when the communication aspect is concerned. A complete binary balance tree of /с-level can accommodate a total number of nodes N = 2k - 1. The maximum node degree is 3 and the diameter is 2(к - 1). Figure 10.15a illustrates a specimen of a binary tree. Figure 10.15b, however, shows a multi-way tree. Networks realized in the form of a tree work satisfactorily if most of the communications occur with nodes having close proximity, otherwise the root node may become a bottleneck. A tree structure having constant node degree can be attributed as scalable architecture; binary tree is such a tree, although its diameter is comparatively long, but at the same time easier to implement.
The star is basically a tree of two-level as shown in Figure 10.15c. The node degree is very high which is d = N - 1 for a star having N nodes. The star structure is best suited in systems having centralized architecture with a supervisor (master) node.
Tree-based networks, (a) Binary tree, (b) Multiway tree, (c) Star, (d) Binary fat tree, and (e) Multiway fat tree.
The drawbacks of a (binary) tree having the possibility of a bottleneck at the upper levels have been reduced by way of increasing the number of links in the upper levels of a tree. The modified tree is known as fat tree, as introduced by Leiserson in 1985 in which each node in the tree (except at the top level) has more than one parent. Two types of fat tree are shown in Figure 10.15d and e that indicates the channel width of a fat tree increases as one traverse from leaves to the root. The fat tree structure-wise resembles a natural tree in which branches get thicker towards the root. A fat tree structure has been applied in Connection Machine CM-5 from Thinking Machine Corporation replacing the previous hypercube implementation which was used in its earlier version CM-2.
Transputer is a tiny RISC-type microprocessor chip used as a building block to realize interconnection networks for parallel computers. It consists of a small processor having an extremely reduced instruction set consisting of only 16 instructions, like LOAD, STORE, ADD etc., and all instructions are 1 byte long having 4 bits opcode for each of 16 (24 = 16) instructions. It contains a total of six 32-bit registers, namely, the program counter, a stack pointer, an operand register, and a register file of 3 registers, an on-chip RAM of 4K bytes (depending on model), and four pairs of simplex (i.e. unidirectional) I/O channel interfaces. External memory can also be added to increase its capacity. Simple message-passing hardware protocol is used for interprocessor communications that provides a very small delay of about 6 ps for short messages of 32-bits, and a fairly high bandwidth (megabytes / sec) for long messages. The INMOS Transputer chip is such a compute-communication microprocessor. This design is much attractive from a packaging point of view due to its requirement of only a small number of pins (8 pins for 8 links) that can be packed tightly on a printed circuit board. Since most networks have a node degree less than 4, networks which have a constant node degree of 4 or less can fairly select a transputer (such as T800) as its building block.
A brief detail of the transputer with the figure is given on the website: http://routledge. com/9780367255732.
Comparison of Static Networks
Although most of the merits and drawbacks of each topology have been mentioned in their respective discussions, a few points still have been summarized here for a better understanding of static networks.
Low-dimensional networks such as mesh and ring being characterized by point-to-point links connecting adjacent nodes can operate better under non-uniform loads because they can have more resource (adjacent wires) sharing. These nodes can be driven at high clock rates due to having lower average block latency yielding a higher maximum throughput. Packet switching mechanism is employed using store-and-forward method. Ring network, by virtue of its topology, possesses broadcast capability, while broadcasting in a mesh network is more difficult due to its inherent topological limitations. Both work well in multiprocessors of smaller configuration, and have almost uniform scalability without difficulty. However, incremental expansion is comparatively limited in a ring than in a mesh network, but a ring is simpler to implement. Mesh systems are, however, found to be suitable for use in both small and in very large systems, while ring can support at best a moderate size.
In a high-dimensional network such as in binary n-cube (hypercube), all connecting wires are dedicated to a particular dimension, and cannot be shared between dimensions. This may give rise to uneven traffic load on certain wires when the others lying in different dimensions are relatively idle. Circuit switching mechanism is employed using a wormhole method (a hardware routing) in order to minimize communication delay caused by the increase in its diameter. The presence of a higher number of links affects the network cost, and creates difficulty in packaging. However, for larger multiprocessors, this network can be used when each of its nodes can be connected to a smaller multiprocessor having dedicated I/Os and / or using shared I/Os that can operate in parallel to satisfy massive computation demand. This gives rise to an architecture known as multi-multiprocessor.
Hybrid (Mixed Topology) Networks
Several different interconnection network topologies have been already discussed along with their merits and drawbacks. In quest of more enhanced performance than what a particular topology could offer, and that too at a competitive price, the designers of multiprocessors and of MPP (massively parallel processing) machines were inclined to develop a mixed topology to obtain a relatively superior performance. This mixed topology would be realized taking useful features and characteristics of different topologies with an appropriate blending in implementation. For low-end multiprocessors, bus and crossbar could be individually chosen as one member in the mixed topology to build an interconnection network. A cluster of processors (2-8) could be connected with each other using a bus or a crossbar to form a node; these nodes can then be interconnected using another suitable topology to form a larger system.
Hewlett-Packard's Exemplar V2600 system uses nodes where each node is made up of multiple processors being connected by a bus. These nodes are then interconnected using a ring network. Data General's AV2500 system follows the same approach using the same type of topology. Alliant FXI2800 system follows the same approach using a different type of topology. Compac's AlphaServer SC uses nodes of multiple processors connected by a crossbar switch and these nodes are then interconnected using a fat tree network.
Various sizes of multiprocessors ranging from relatively smaller to comparatively larger are found in market, demanded by users engaged in processing diverse spectrum of applications. Many machines in the market have about 4-128 processors. For very small systems, say, up to 16 processors, the most effective and reasonable choices are a shared bus or a crossbar switch (dynamic interconnection networks). For moderate size of multiprocessors, the obvious choice is mesh networks. For handling massively parallel processing (MPP) systems, some of the static topologies (n-cube) are more scalable in some specific applications. For general-purpose computing, large-scale MINs or crossbar networks are suitable, and may become more viable depending on the advancement in electronics technology (Siegel, H. J.).
Important Characteristics of a Network
Several different network topologies with their own merits and drawbacks are involved in building various types of interconnection networks used in a multiple-processor system. The selection of a particular interconnection network is very critical to decide, mostly depending on the environment where it will be used, along with fulfilment of certain fundamental objectives and requirements. The other salient features present in a particular topology determine its superiority over others in practical situation. Different topologies, however, attain different levels of certain underlying characteristics (Feng, T. Y., et al.). Some of the important characteristics are: (i) network speed and its throughput; (ii) the routing mechanism including broadcast and multicast facilities; (iii) scalability; (iv) fault tolerance; (v) reliability; (vi) ease of implementation and related costs.
However, topological equivalence has been established among a number of network architectures. The most important for an architecture to survive in future systems is mostly its packaging, efficiency, and scalability to allow modular growth.
A brief explanation of each of these characteristics is given on the website: http:// routledge.com/9780367255732.