Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font

<<   CONTENTS   >>

Floating-Point Arithmetic

Let X and Y be two floating-point numbers, be expressed as (Xs, XJ;) and (Ys, YE), respectively. Therefore, the numerical value of X is Xs x BXE and that of Y is Ys x BYE. To explainthis, some realistic assumptions in this respect are needed to be made, which are as follows: • Xs is an ns-bit two's complement or sign-magnitude binary fraction; • X£ is an nE-bit integer in excess 2'V1 code, implying an exponent bias of 2V1;}} [1]


Rules for Basic Operations Involving Floating-Point Operands


Respective Rules


  • 1. Checking of zeros for both the numbers X and У.
  • 2. To equalize the exponent of the two input numbers X and Y, choose the number with the smaller exponent and shift its significand right a number of steps equal to the difference in exponents of two numbers, X and Y.
  • 3. Set the exponent of the number after shift equal to the largest exponent.
  • 4. Perform addition/subtraction on the significands and determine the sign of the result.
  • 5. Normalize the result value, if necessary.


In multiplication and division operations, no alignment of mantissas is needed.

  • 1. Add the exponents and subtract 127.
  • 2. Multiply the significands, and determine the sign of the result.
  • 3. Normalize the resulting value, if necessary.


  • 1. Subtract the exponents and addl27.
  • 2. Divide the mantissas, and determine the sign of the result.
  • 3. Normalize the resulting value, if necessary.
  • *In multiply and divide rules, 127 is added or subtracted. This is due to using the excess-127 notation for exponents.

Basic operations: General methods for floating-point addition, subtraction, multiplication, and division are given in Table 7.1. For addition and subtraction, it is necessary to ensure that both operands have the same exponent value. This may require shifting the radix point on one of the operands to realize alignment. Multiplication and division are relatively simple, because the significands (mantissas) and exponents can be processed independently. The floating-point operations normally produce the usual expected expressible results, but at times, they may give rise to one of these situations, such as:

i. Significand underflow is observed while aligning significands that digits may flow off the right end of the significand. To cope with this situation, some form of rounding-off is required, to be explained later in section;

ii. significand overflow occurs when the addition of two significands of the same sign may result in a carry-out of the most significant bit. This can be fixed by realignment, which will be explained later;

iii. Exponent underflow happens when a negative exponent becomes less than the minimum possible exponent value (e.g. -145 is less than -127) in the prescribed format. This means that the number is too small to be represented, and thus may be considered to be equal to 0; and

iv. exponent overflow happens when a positive exponent exceeds the maximum possible exponent value defined in the prescribed format. This may be designated in some systems as +~ or -°°.

Addition and Subtraction

Floating-point addition and subtraction are relatively complex since the exponents of the two input operands must be made equal before the corresponding significands can be added or subtracted. Following the floating-point format as already described, the two operands must be placed in the respective registers within the ALU to execute the required operation. The floating-point includes an implicit bit in the significand, but that bit must be made explicit at the time of executing the operation. The procedures being followed to perform addition and subtraction, however, are explained in Table 7.1.

During addition/subtraction, if the signs of two numbers are the same, there also exists the possibility of significand overflow, the rectification of which, in turn, may invite exponent overflow. Whatever be it is, the appropriate actions would then be taken with suitable intimation, and possibly the operation is to be halted, and the subsequent needful actions are then required. After addition/subtraction, the result may be required to be normalized, which may invite exponent underflow. Again, suitable actions should be taken to resolve the situation.

A typical flowchart for performing addition/subtraction incorporating all the activities as mentioned in Table lalong with a solved example is given in the website: http://

Implementation: Floating-Point Unit

A floating-point arithmetic unit can be built up by connecting two loosely coupled fixed- point arithmetic circuits, one to be used as an exponent unit and the other as a significand (mantissa) unit. As the significand unit is required to perform all four basic arithmetic operations on the significands, a conventional fixed-point arithmetic circuit (already described earlier) can be used for this purpose. The exponent unit, however, is implemented by a relatively simpler circuit, capable of only adding, subtracting, and comparing exponents of the input operands. Comparison of exponents can be made by a comparator or by subtracting the exponents. With this idea, a schematic structure of a floating-point unit can be built up on the lines of the illustration shown in Figure 7.18.

The exponents of the input operands are loaded in registers £1 and E2, which are connected to an adder that computes El + E2. The comparison of exponents required for addition and subtraction is made by computing El - E2 (i.e. El + (-E2), essentially is an addition) and placing the result in a counter E. The larger exponent is then determined from the sign of E. The bit-shift of one of the significands (mantissas) required before the addition/subtraction of the significands can be controlled by E. The magnitude of E is sequentially decremented to zero. After each such decrement, the corresponding significand located in the significand unit is shifted one-digit position. After the needed alignment of the respective significand


Schematic block diagram of a floating-point arithmetic unit.

(equalizing the exponent, i.e. when £ becomes 0, of the two input numbers X and Y), they are processed in the usual manner depending on the type of arithmetic operation being required. The exponent of the result is also computed and is placed in E.

All the computers have the fixed-point arithmetic instructions as well as the floatingpoint instructions; it is, hence, always desirable to have a single unit within the ALU to execute both these types of instructions. But, as the sophisticated, faster, and also cheaper electronic technology is now readily available in abundance, it is almost common nowadays in most of the computer systems to incorporate separate units: one dedicated for fixed-point integer (FXU) and another for floating-point arithmetic operations (FPU). Separation of these two individual units located within the architecture of ALU facilitates the execution of fixed-point and floating-point instructions to continue in parallel.

Multiplication and Division

Multiplication and division are relatively simpler and somewhat easier than addition and subtraction, in that no alignment of significand (equalization of the exponents) is needed. As usual, the input operands here are represented in 2's (two's) complementary form.

In multiplication, if either operand is 0, the result is automatically declared as 0. The next step is to add the exponents. If the exponents are stored in biased form, the sum of the exponents would then contain double the bias value. Hence, the bias value must be subtracted from the sum. The result may sometimes give rise to a situation of exponent overflow or underflow which must be intimated with the termination of the process. However, if the exponent of the product (result) lies within the specified range, the next step is to multiply the significands of the input operands, taking into account their signs, as is done for integer multiplication (already described earlier). The product (result) will be double the length of the multiplier or multiplicand, which one is larger. The extra bits may be lost due to rounding-off the result. After obtaining the product, the result as usual needs to be normalized, and rounded-off, if required. The action of normalization may sometimes lead to a situation of exponent underflow. Appropriate actions should then be taken to resolve the situation.

Division is performed almost on the same lines as multiplication. Here too, the testing of 0 is to be carried out first. If the divisor is 0, an error is to be declared, or the result may be set to infinity, as per the guidelines of the particular implementation. But, for having a dividend of 0, the final result will be 0. The next step is to subtract the divisor exponent from the dividend exponent. This subtraction removes the bias, which must be added back in, but this addition may result in exponent overflow. However, appropriate tests are then made to inspect exponent underflow or overflow, if any, and a befitting test report can then be accordingly issued. The next action is to divide the significands of the two input operands. Finally, the result as obtained will go through the usual process of normalization, and rounding, if needed.

Two typical flowcharts for separately performing multiplication and division incorporating all the activities as described in Table 1, respectively, are shown in the website:

Implementation: Floating-Point Multiplication

A multiplier circuit can be implemented using a multistage CSA circuit (already described earlier). This circuit is popularly known as a Wallace tree after the name of its inventor (Wallace 1964). The inputs to the adder tree are n terms of the form M, = x, Y 2k. Here, M, represents the multiplicand Y multiplied by the ;'th multiplier bit weighted by the appropriate power of 2. Suppose M, is 2n-bit long, and that a full double-length product is required,


the desired product P is ^ M,. This sum is computed by the CSA tree that produces a

  • (=0
  • 2n-bit sum and a 2n-bit carry word. The final carry assimilation is then usually performed by a fast adder, a CLA, for instance, with normal internal carry propagation.

A brief detail of this topic along with a befitting figure is given in the website: http://

  • [1] В is the base w'hich here is equal to 2. The above assumptions also hold good for Y. In addition, it is also assumed that thefloating-point numbers are stored in their normalized form only, and thus, the final resultof each floating-point operation should also be normalized.
<<   CONTENTS   >>

Related topics