Computer Failure Classifications, Hardware and Software Error Sources, and Computer Reliability Measures
Table of Contents:
Computer-related failures may be categorized under the following five classifications :
There are many sources for the occurrence of hardware and software errors. Some of these sources are inherited errors, data preparation errors, handwriting errors, keying errors, and optical character reader. In a computer-based system, the inherited errors can account for over 50% of the errors . Furthermore, data preparation-associated tasks can also generate quite a significant proportion of errors. As per Bailey , at least 40% of all errors come from manipulating the data (i.e., data preparation) prior to writing it down or entering it into the involved computer system.
Additional information on computer failure classifications and hardware and software error sources is available in Refs. [15,16].
There are many measures used in the area of computer system reliability. They may be grouped under the following two categories [14,17]:
Computer Hardware Reliability versus Software Reliability
As it is very important to have a clear comprehension of the differences between hardware and software reliability, a number of comparisons of important areas are presented in Table 6.1 [12,18,19].
The term fault masking is used in the area of fault-tolerant computing, in the sense that a system with redundancy can tolerate a number of failures/malfunctions prior to its own failure. More clearly, the implication of the term is that some kind of problem has surfaced somewhere within the framework of a digital system, but because of design, the problem does not affect the overall operation of the system under consideration.
The best known fault masking method is probably modular redundancy and is presented in the following sections .
Triple Modular Redundancy (TMR)
In this case, three identical modules/units perform the same task simultaneously and the voter compares their outputs (i.e., the modules/units) and sides with the majority [12,20]. More clearly, the TMR system fails only when more than one module/unit fails or the voter fails. In other words, the TMR system can tolerate failure of a single module/unit. An important example of the TMR system’s application is the Saturn V
Hardware and software reliability comparisons
launch vehicle computer [12,20]. The vehicle computer used TMR with voters in the central processor and duplication in the main memory [12,21].
The block diagram of the TMR scheme is shown in Figure 6.1 and the blocks in the diagram denote modules/units and the circle voter.
For independently failing modules/units and the voter, the reliability of the system in Figure 6.1 is given by [ 12]
R,mv is the reliability of the TMR system with voter. R is the reliability of the module/unit.
Rv is the reliability of the voter.
FIGURE 6.1 Block diagram for TMR system with voter.
With a perfect voter (i.e., 100% reliable), Equation (6.1) becomes where
R, is the reliability of the TMR system with perfect voter.
It is to be noted that the voter reliability and the single unit’s reliability determine the improvement in reliability of the TMR system over a single unit system. For the perfect voter (i.e., Rv = 1), the TMR system reliability given by Equation (6.2) is only better than the single unit system when the reliability of the single unit is greater than 0.5.
At Rv = 0.8, the TMR system’s reliability is always less than the single unit’s reliability. Furthermore, when the voter reliability is 0.9 (i.e., Rv = 0.9), the TMR system’s reliability is only marginally better than the single unit/module reliability when the single unit/module reliability is approximately between 0.667 and 0.833 .
TMR System Maximum Reliability with Perfect Voter
For perfect voter, the TMR system reliability is expressed by Equation (6.2). Under this scenario, the ratio of Rlm/) to a single unit reliability, R, is given by 
By differentiating Equation (6.3) with respect to R and equating it to zero, we get
Thus, from Equation (6.4), we obtain R = 0.75. This simply means that the maximum values of the reliability improvement ratio, y, and the reliability of the TMR system, R,mp, are respectively:
Assume that a TMR system’s reliability with a perfect voter is expressed by Equation (6.2). Determine the points where the single-unit and the TMR- system reliabilities are equal.
To determine the point, we equate a single unit’s reliability with Equation (6.2) to obtain
By rearranging Equation (6.5), we get
The above equation (i.e., Equation (6.6)) is a quadratic equation and its roots are and
This means the reliabilities of the TMR system with perfect voter and the single unit are equal at R = l/2 or R = 1. Furthermore, the reliability of the TMR system with perfect voter will only be greater than the single unit’s reliability when the value of R is higher than 0.5.
TMR System with Voter Time-Dependent Reliability and Mean Time to Failure
With the aid of material presented in Chapter 3 and Equation (6.1), for constant failure rates of the TMR system units and the voter unit, the TMR system with voter reliability is expressed by [12,24].
Rlmv (/) is the TMR system witli voter reliability at time t.
A is the unit/module constant failure rate.
Avr is the voter unit constant failure rate.
By integrating Equation (6.9) over the time interval from 0 to we get the following equation for the TMR system with voter mean time to failure [12,14]:
MTTFlmv is the mean time to failure of the TMR system with voter.
For perfect voter (i.e., Xvr = 0), Equation (6.10) reduces to
MTTFlmp is the TMR system with perfect voter mean time to failure.
Assume that the constant failure rate of a unit/module belonging to a TMR system with voter is Я = 0.0004 failures per hour. Calculate the system reliability for a 500-hour mission if the voter unit constant failure rate is AVJ. = 0.0002 failures per hour. In addition, calculate the TMR system mean time to failure.
By substituting the specified data values into Equation (6.9), we get
Similarly, by inserting the specified data values into Equation (6.10), we get
Thus, the TMR system with voter reliability and mean time to failure are 0.8264 and 1571.42 hours, respectively.
N-Modular Redundancy (NMR)
This is the general form of the TMR (i.e., it contains N identical modules/units instead of only three units).
The number N is any odd number, and the NMR system can tolerate a maximum of n modular/unit failures if the value of N is equal to (2n + 1). As the voter acts in series with the /V-module system, the complete system malfunctions whenever a voter unit failure occurs.
The reliability of the NMR system with independent modules/units is given by [12.25]
Rnmv is the reliability of NMR system with voter.
Rv is the voter reliability.
R is the module/unit reliability.
Finally, it is added that the time-dependent reliability analysis of an NMR system can be performed in a manner similar to the TMR system reliability analysis. Additional information on redundancy schemes is available in Nerber .