Normal Accident Theory (NAT) and High Reliability Organisations
The concept of the ‘normal accident’ was developed in the 1980s, largely in response to concerns about safety in the nuclear power industry (Perrow, 1999). The complexity of power plants meant that the technology involved could sometimes behave in unpredicted ways. The main contention of NAT is that accidents in highly complex systems should be expected. On the other hand, high reliability organisations (HRO) theorists have tried to identify why critical industries and work contexts - like aircraft carrier deck operations - have remarkably fewer accidents than we might expect (Roberts, 1990). Perrow suggested that the interactive complexity of systems makes the task of the human operator more difficult when components fail. Tight coupling between subsystems makes them highly interdependent, and because of this, malfunctions will quickly propagate in unpredictable ways.
This idea of unpredictability is echoed in Reason’s (1990) latent conditions and resident pathogen concepts. Reason was proposing that certain aspects of the design of a system or process are fallible although their effects may never be discovered until a particular, probably unique, set of conditions applies. Thus, potential sources of failure can exist undiscovered in a system for years. For example, on 31 December 1991, an MD-82 took off from Stockholm Arlanda airport bound for Warsaw (SHK, 1993). The aircraft had been standing overnight, and super-cooled fuel in the wing tanks, resting in contact with the upper surface of the wing, had resulted in ice formation. A ground technician failed to notice the condition of the wing and, when the aircraft took off, the ice broke away and went into the tail-mounted engines, damaging some of the compressor blades. This resulted in both engines starting to surge, disrupting the flow of air through the engines. The normal response to such a condition is to reduce power to restore a normal airflow. However, the aircraft had been fitted with a safety device called automatic thrust restoration (ATR). This system was originally intended for reduced thrust take-offs required under airport noise abatement procedures. In such a situation, should one engine be lost after take-off, the power on the other engine was automatically increased to the maximum. The airline in question, however, never used reduced thrust take-offs and, over time, the ATR system had dropped off the training programme. Although two disconnect buttons were located on the power levers, pilots were not told what they were for. As the pilots tried to reduce power to control the surging, the ATR automatically moved the thrust levers forward. The very system that was designed for safety in the event of an engine failure after take-off probably resulted in the destruction of both engines. During the subsequent accident investigation, the FAA, who approved the modification in the first place, said that they never envisaged a situation where both engines would fail after take-off.
In the Beech 1900 incident at Williamstown described earlier, the FO believed the engine to be on fire when he saw flames coming from the nacelle. Attempts to discharge the engine fire extinguisher failed because the bottle was electrically operated and the wiring to the extinguisher had been destroyed. The design of the system failed to accommodate the possibility of a serious fire in the engine nacelle that, in turn, would disable the system itself. On the one hand, we can identify a fallible design decision here but, at the same time, we can see the tight coupling of events in NAT terms. The anchoring of cables to fuel pipes in a location containing important control devices resulted in one malfunction disabling critical, unrelated, systems.
The accident at Pelee Island has some, limited, characteristics of a normal accident. The Cessna Caravan is a relatively simple, robust aircraft type. It has a high wing configuration and a fixed tricycle undercarriage. Although approved for flight into icing conditions, the aircraft, because of its design, has lots of exposed surfaces on w'hich ice can form, and the manufacturer had provided protective devices in accordance w'ith airworthiness requirements. Electrically heated panels on the propeller and pneumatically operated ‘boots’ on other critical structures provide the means to remove ice that might form in flight. The particular aerofoil section used in the design of the Caravan wing has been found to be susceptible to performance degradation once the ice begins to build up. In fact, global Cessna 208 accidents do show a marked seasonality, with more in the w'inter months. The most critical location for ice formation is at 12% chord but the de-ice boots only extend as far as 5% chord.
Ice removal by boot activation can be improved by accelerating the aircraft. However, ice increases drag, which can sometimes result in the aircraft having insufficient surplus pow'er to allow it to accelerate in level flight. Similarly, a good rule for pilots once they encounter ice is to change altitude in order to get out of icing conditions. This can sometimes require climbing but ice adds to the weight of the aircraft and, again, performance limitations prevent the pilot from taking the best course of action. Some of the comments made by Caravan pilots in Chapter 1 alluded to these problems.
Aircraft design, in terms of aerodynamic properties and performance, and meteorological conditions exist in a tightly coupled state, and the manner in which these factors interact can be surprising, especially when we also consider the nature of normal work, as this example illustrates.
On 29 April 1993, an Embraer 120 encountered icing conditions in the climb and subsequently stalled (NTSB, 1993). The captain had been asked by the flight attendant if it was possible to climb faster in order to reach the cruising altitude sooner: she wanted to start the cabin service but did not want to have to push the service cart along an inclined cabin floor. The captain used the ‘pitch hold’ mode on the autopilot to increase the angle of climb and, thereby, increase the rate of climb. The aircraft power was left at the 90% ‘climb power’ setting. The aircraft pitch angle in the climb was initially 3.2°. The pitch was increased, first, to 5.2° and then 6.4°. The rate of climb reduced to zero, the speed decayed and the stick shaker activated at 141 knots. The aircraft lost 12,000 ft in the subsequent stall.
The aircraft had probably started to accumulate ice once it had climbed into the freezing conditions. Ice starts to form at the stagnation point on the leading edge of the wing where the local airflow velocity is zero. When the captain increased the pitch angle, the stagnation point would have moved down the leading edge and the ice that had previously formed would now be in such a position to disrupt the airflow over the upper surface, leading to premature boundary layer separation and an increased stall speed. In this case, icing, aerodynamics and a desire to be helpful resulted in severe damage to an aircraft.
NAT generally applies to systems that can also be considered highly reliable. An HRO, as mentioned earlier, is one where the number of failures occurring is much less than the total number of accidents - or accident opportunities - that could occur. Amalberti (2001) identifies three bands of organisations based on safety:
An HRO would fall into the final category. Amalberti’s analysis reveals some of the paradoxes present in the HRO concept. First, because accidents are extremely rare in this cluster of systems, the response to failure can often be irrational. Media-driven, short-term responses to accidents give the appearance of taking action while the real cause of the accident is often not addressed. Furthermore, the timescale needed for change to take effect often falls outside the attention span of media and the tenure of the decision-makers. Responses to infrequent failures usually take the form of increased regulation. This, in turn, makes the job of work more difficult and usually results in increased rates of rule violation simply to get the job done on a daily basis. In an examination of a military friendly fire tragedy in Northern Iraq in 1994, Snook
(2000) demonstrated that ‘normal accidents’ can easily occur in ‘HROs’. Snook showed how organisations drift towards an unsafe condition despite efforts to control risk, and we will return to this idea later in this chapter.
Perrow’s NAT focuses on technology and the unpredictable nature of combinations of systems. The HRO school does look at social systems and, indeed, uncertainty is introduced into systems largely through the variability of human operators (Marais, Dulac & Leveson, 2004). Perrow argued for greater redundancy to allow for failure, but redundant systems themselves fail. In the case of HRO theory, it is argued that a ‘culture of reliability’ contributes to safety. However, as both Leveson (2011) and Marais et al. (2004) observe, reliability and safety are different qualities. Accidents can occur as a result of unexpected interactions between normally functioning components in a system: reliable elements induce unsafe conditions. Ironically, in some situations, redundancy has been achieved by introducing more humans. Air taxi companies flying aircraft certified for single-pilot operation are often required, either by clients or their insurance companies, to use two pilots. Where the company has failed to implement effective two-pilot procedures, incidents have been caused because of uncertainty over roles and responsibilities.
Part of the problem we face is that much of our analysis of failure relies on hindsight (Dekker, 2002). We look back on events and see clearly where things went wrong. Hindsight couples cause and effect, even in situations where there is, in fact, only a weak causal relationship, if at all. By concentrating on the technology of systems, and not human operators in systems, NAT and HRO theory lack the ability to address the issue of why operators act in an apparently irrational manner. In order to achieve our work goals, we act in ways that make sense to us at the time. The manner in which we construct a local, bounded, rationality will be covered in more detail in Chapter 5.