Home Computer Science



Precision Considerations: Guard BitsTable of Contents:
Several important aspects are associated with the implementation of floatingpoint operations and their subsequent representations. Although the significands (mantissas) of initial operands and final results are limited to a specified number of bits (e.g. 24 bits for single format, including the implicit leading 1), it is still often necessary to have some extra bits to accommodate the results of the execution of intermediate steps of any type of arithmetic operation. Fortunately, the ALU registers that hold the exponent and significand of each operand prior and after a floatingpoint operation are always greater in length than the length of the significand plus an implied bit. The register thus automatically provides additional bits to the operands as well as to the final results, and these types of extra bits that are retained are often called the guard bits, which help to realize the maximum accuracy in the final results. TruncationAt the time of generating any result, it is often required to remove the guard bits from the extended significand by appropriately chopping it off to bring it to a specified length that simply approximates the longer version. This type of act making no changes in the retained bits is commonly known as truncation. Truncation and its significant impact on the final result are, however, given in the website: http://routledge.com/9780367255732. Rounding: IEEE StandardRounding is essentially a variant of truncation that also disposes guard bits (extra bits) when represented in a specific format. Different types of rounding, including Von Neumann rounding, regular rounding in general, and also the rounding specified in the IEEE 754 floatingpoint standards, as the default mode for truncation are mostly common. All other truncation methods specified in IEEE standards are referred here as rounding modes with a list o(four alternative approaches, namely round to nearest, round to н», round to and round towards zero are significant. When the guard bits are not present or are removed by truncation (and rounding) at each intermediate step of computation, the amount of error ultimately crept into the final result may be then appreciably high. Therefore, the way that the guard bits and truncation (rounding) are to be used as specified in the IEEE floatingpoint standard is to enforce a maximum within half a unit in the LSB position of the final result. In general, this requires a rounding scheme in which only three guard bits are to be carried along with the needed operations during the computation of each intermediate step. The first two of these three bits are the two most significant bits of the section of the significand to be removed. The third bit is the logicalOR of all the bits beyond these first two bits in the full representation of the significand. From an implementation point of view, this bit is relatively easy to maintain during the intermediate computational steps to be performed. Initially, this bit is set to 0. If a 1 is shifted out through this position, the bit becomes 1 and retains that value. That is why, this bit is sometimes called the sticky bit. Details of different types of rounding as stated above, including those as mentioned truncation methods specified in IEEE standards, are given in the website: http://routledge. com/9780367255732. Infinity; NaNs; and Denormalized Numbers: IEEE StandardsIEEE 754 has not only defined and described various methods to be adopted for different rounding modes as already explained, but also formulated many other aspects, including the procedures, to be followed for the floatingpoint arithmetic so that whatever be the hardware platform used for execution, a uniform and predictable result can always be obtained. However, the main focus is, at present, on three such important aspects related to floatingpoint arithmetic introduced by IEEE standards, namely infinity, NaNs, and denormalized numbers. Infinity: Real arithmetic considers infinity as a limiting case that always produces the infinity values in certain situations abided by the following: «< (any finite number) < + <*> Any arithmetic operation involving infinity, excepting some special cases as will be discussed later, precisely yields the obvious results such as: If .v is any finitely expressible number, then The other cases in this category involving °° are essentially NaNs. NaN: A NaN is essentially a special value encoded in floatingpoint format, which is generated as the result when an invalid operation is performed. NaNs are of two types in general: (i) signalling NaNs and (ii) quiet NaNs. A signalling NaN conveys (signals) an exception whenever an invalid operation involving an operand is attempted. Signalling NaN is accompanied with such values that lie beyond the domain prescribed by IEEE standards. A quiet NaN, on the other hand, smoothly propagates through almost every arithmetic operation without issuing any intimation of an exception, e.g., the operation (°°)  (<*>), 0 x «о, 0 + 0, etc. As a quiet NaN moves lurking, it somehow appears to be hazardous because it may give rise to a situation that may sometimes be fatal. Although IEEE 754 Standard provides the same general format for both types of NaNs, these two kinds of NaNs are precisely implementationspecific in the way they will be represented, so that they can be uniquely identified by the system to appropriately handle numerous exception conditions. Denormalized numbers: The normalization process is to be executed compulsorily to generate normalized numbers in any floatingpoint arithmetic operation following IEEE 754 Standard. However, if only normalized numbers are used, then there exists a reasonable gap between the smallest normalized number and 0 (Figure 7.17). In the case of single format (32bit) under IEEE 754, there are 2^{23} representable numbers in each interval, and the smallest representable positive number is 2'^{126}. If the denormalized numbers can be included in this format, an additional 2^{23}  1 numbers could then be uniformly added between 0 and 2~^{126}. Denormalized representation has an exponent 0, and in the fractional part f, there is no assumed leading 1 before the binary point. In case of denormalized numbers, the exponent field is 1bias, instead of 0bias, where bias = (2^{k} ~^{1}  1), к is the number of bits in the exponent field. Therefore, the value of a denormalized positive number is / x 2~^{126}. For example, The largest denormalized 32bit (singleprecision) number is:
The smallest denormalized 32bit (singleprecision) number is
Denormalized numbers are, therefore, equally useful and hence are included in this standard to mainly handle the cases of exponent underflow. When the exponent of the result becomes too small (a large negative exponent), the result needs to be denormalized to get out of the situation by simply right shifting the fraction (significand) and incrementing the exponent accordingly for each such shift until the exponent comes within a representable range. The inclusion of denormalized numbers in the IEEE 754 Standard prevents the density of representable numbers to increase as one approaches from the point of smallest representable normalized number towards 0. Thus, the use of denormalized number precisely helps to smoothen the said density to be mostly uniform in the said domain, and that is why, it is sometimes referred to as gradual underflow. In effect, it fills the gap reducing the width between the smallest representable nonzero number and zero, and minimizes the effect of exponent underflow to such a level that is almost comparable to normalized numbers with roundingoff. Summary of FloatingPoint NumbersThe significance of the usual bit patterns in the IEEE 754 Standard formats and its interpretations, including some unusual bit patterns, to represent special values have been already described. The extreme exponent values of all zeros (0) and all ones (255 in single format and 2047 in double format), however, define special values that consist of many different types, as already explained. Value = (—l)^{_s} x M x 2^{E}, where S = sign, M = significand (mantissa), and E = exponent Bias = 2^{k} ^{1}  1, where к = number of bits in exponent field For 32bit (singleprecision), bias = 2^{s} ~^{1}  1 = 127. For 64bit (doubleprecision), bias = 2^{n} ~^{1}  1 = 1023. Normalized: Significand field has implied (hidden) leading 1. Exponent field contains at least one "1". E = unsigned value of exponent field with appropriate bias. Denormalized: Significand field has implied leading 0. All exponent field bits are equal to 0. E = 1bias. Special cases: NaN, infinity, and denormalized (already described). SummaryNumerous arithmetical and logical (nonnumerical) operations on various types of operands, including fixedpoint and floatingpoint numbers, are carried out by the data processing part of a CPU, a major constituent of which is an ALU. Most of the modern processors nowadays incorporate numerous types of instructions in their instruction set to enable ALU to carry out all these numerous operations, and in many cases also accompanied by the required hardware to process floatingpoint instructions as well. Computer arithmetic circuit designs, however, presently exhibit several interesting welldeveloped logic designs, including highperformance adder designs, and sophisticated design of both multiplication and division units using Booth algorithm and restoring/nonrestoring division algorithms, respectively. Floatingpoint and other complex operations are implemented by an autonomous execution unit within the CPU or by a supporting coprocessor which is a programtransparent extension to the CPU. A floatingpoint processor is typically composed of a pair of fixedpoint ALUs: one to process exponents and the other to process mantissas. Special circuits are yet needed for normalization, and also for exponent comparison and mantissa alignment in the case of floatingpoint addition and subtraction. The floatingpoint number representation standard proposed by IEEE has been described, and a set of rules under this specification for performing all four basic arithmetic operations has been given, including the options of special values and exceptions. Exercises
a. How many hex digits are required? b. What is the range of addresses in hex? c. How many memory locations are there?
a. signedmagnitude method b. signed l's complement method c. signed 2's complement method 7.8 Represent the number 21 in 8bit format using a. signedmagnitude method b. signed l's complement method c. signed 2's complement method
i. X = 110101 and Y = 011011 ii. X = 010111 and Y= 110110 iii. X = +14 and Y = 13
a. Give the representation of 2762.5 x 10~^{2}. b. Compute the value represented by 1 001010 011000000.
a. A base8 exponent (B = 8) in a 5bit field? b. A base16 exponent (B = 16) in a 6bit field?
a. 7 b. 1.75 c. 389 d. 245.625 e. 1/16 f. 1/32 7.27 The following numbers use the IEEE 32bit floatingpoint format. What is the equivalent decimal value? a. 0101 0101 0110 0000 0000 0000 0000 0000 b. 1100 0011 0110 0000 0000 0000 0000 0000 c. 0011 1111 1010 1100 1000 0000 0000 0000
a. 549 b. 0.645 7.30 Show how the following floatingpoint additions are performed in which significands are truncated to 4 decimal digits. a. 6.487 x 10^{2} + 5.693 x 10^{2} b. 8.546 x 10^{2} + 7.425 x 10"^{2} Show the results also in normalized form. 7.31 Show how the following floatingpoint calculations are made in which significands are truncated to 4 decimal digits. a. 8.748 x 10'^{3}  6.593 x 10~^{3} b. 7.756 x 10^{3}  2.259 x Ю'^{1} Show the results also in normalized form. 7.32 Show how the following floatingpoint computations are performed in which significands are truncated to 4 decimal digits. a. (6.432 x Ю^{2}) x (2.154 x 10°) b. (7.756 x 10^{3}) x (2.259 x 10^{2}) Show the results also in normalized form.
i. Chopping ii. Rounding iii. Von Neumann rounding iv. Both (ii) and (iii)
a. What are the relative errors for X' and Y'? b. What is the relative errors for Z' = X'  Y'? 7.37 Explain how NaN and infinity are represented in IEEE754 standard. Suggested References and WebsitesHamacher, C, Vranesic, Z. G., and Zaky, S. G. Computer Organisation, 5th ed., Int'l. ed. McGrawHill Higher Education, 2002. Hayes, J. P. Computer Architecture and Organisation, Int'l ed. WCB/McGrawHill, 1998. Mano, M. Logic and Computer Design Fundamentals. Upper Saddle River, NJ: PrenticeHall, 2004. IEEE 754: The IEEE 754 documents, related publications and papers, and a useful set of links related to computer arithmetic. 
<<  CONTENTS  >> 

Related topics 