Computer and Internet Reliability
Nowadays, billions of dollars are being spent annually around the globe to produce computers for various types of applications ranging from personal use to control space and other systems. The computers are composed of both hardware and software components and for their successful operation, the reliability of both these components is equally important. The history of computer hardware reliability may be traced back to the late 1940s and 1950s [1-4]. For example, the triple modular scheme for improving computer hardware reliability was proposed by Von Neumann in 1956 . It appears that the first serious effort on software reliability started in 1964 at Bell Laboratories . However, some of the important works that appeared in the 1960s on software reliability are provided in Refs. [5-7].
The history of the Internet may be traced back to 1969 with the development of Advanced Research Project Agency Network (ARPANET) . It has grown from 4 hosts in 1969 to about 147 million hosts and 38 million sites in 2002, and nowadays billions of people around the globe use Internet services . In 2001, there were over 52,000 Internet-associated failures and incidents. Needless to say, the reliability and stability of the Internet has become very important to the global economy and other areas, because Internet-related failures can cause millions of dollars in losses and interrupt the day-to-day routines of millions of end users around the globe . This chapter presents various important aspects of computer hardware, software, and Internet reliability.
Computer Failure Causes and Issues in Computer System Reliability
There are many causes of computer failures. The important ones are as follows [10-12]: •
Processor and memory failures/errors
- • Communication network failures
- • Peripheral device failures
- • Environmental and power failures
- • Human errors
- • Mysterious failures
- • Saturation
- • Gradual erosion of the database
The first six of the above causes of computer failures are described below [10-12]:
- • Processor and memory failures/errors: These failures/errors are generally catastrophic, but their occurrence is quite rare, as there are times when the central processor malfunctions and fails to execute instructions correctly due to a “dropped bit”. Nowadays, the occurrence of memory parity errors is very rare because of improvements in hardware reliability, and these errors are not necessarily fatal.
- • Communication network failures: These failures are concerned with intermodule communication, and many of them are usually of a transient nature. It is to be noted that around two-thirds of errors in communication lines can be detected with the use of “vertical parity” logic.
- • Peripheral device failures: These failures are quite important but they rarely lead to a system shutdown. The commonly occurring errors in peripheral devices are transient or intermittent, and the peripheral devices’ electromechanical nature is the usual reason for their failure.
- • Environmental and power failures: Environmental failures take place due to causes, such as air conditioning equipment failure, fires, electromagnetic interference, and earthquakes. In the case of power failures, the causes for their occurrence are the factors, such as transient fluctuations in voltage or frequency and total power loss from the local utility company.
- • Human errors: These errors usually occur due to operator oversights and mistakes. Operator errors often take place during starting up, running, and shutting down the computer system.
- • Mysterious failures: These failures are never categorized properly in real-time systems because they take place unexpectedly. For example, when a normally functioning system stops operating at once without indicating any problem (i.e., software, hardware, etc.) at all, the failure is called a mysterious failure.
There are many issues concerned with computer system reliability and some of the important factors to consider are presented below [8, 13, 14]. 
- • Computers’ main components/parts are the logic elements, which have quite troublesome reliability-associated features. In many situations, it is impossible for properly determining such elements’ reliability and their defects cannot be healed properly.
- • Prior to the installation and production phases, it could be quite difficult for detecting errors associated with hardware design at the lowest system levels. It is quite possible that oversights in hardware design may lead to situations where operational errors due to such mistakes are impossible for distinguishing from the ones due to transient physical faults.
- • Usually, the most powerful type of self-repair in computer systems is dynamic fault tolerance, but it is quite difficult to analyze. However, for certain applications it is quite important and cannot be ignored.