Systems and Scale Effects
The Learjet example we have just explored ultimately hinged around behaviour in the near-term: a phone call used to expedite an administrative process. Systems, though, work across multiple time scales. In the last chapter we saw that fallible decisions occur at the level of strategic management and can give rise to latent errors. We also saw that complexity in technology results in tight-coupling and the associated unpredictability in fault propagation. In a systems context, the relationship between aircraft manufacturers and airworthiness authorities represents an interface condition. The airworthiness requirements - controls - are published and the manufacturer then tries to satisfy those requirements in a cost-effective manner. The robustness of the solution might not be immediately apparent, as this next example illustrates.
On 5 December 2001 a Saab 340B was climbing through FL 190 when the two electronic flight information system (EFIS) screens on the right-hand (FOs) side failed (ATSB, 2003). The aircraft had been dispatched, in accordance with the operator’s minimum equipment list (MEL), with the left-hand (captain’s) upper screen unserviceable. In addition, the right-hand starter generator was annotated as requiring removal for overhaul. A 10% extension had been applied to the life of the generator to allow the flight to be accomplished. The crew applied the EFIS failure/ disturbance checklist and tried to restore the screens using the EFIS drive transfer system. The aircraft then experienced the following cascading progression of system failures:
The right engine ice protection annunciator illuminated (failure)
Cabin pressure annunciator illuminated (spurious)
The flight attendant call bell activated (spurious)
The stall warning clacker activated (spurious)
The right stall fail annunciator illuminated (failure)
The rudder limit annunciator illuminated (failure)
The global positioning system (GPS) failed VHF 2 radio and intercom inoperative (failure)
The FO’s communications system failed
The radio magnetic indicators failed
The automatic direction finder 2 failed
The right DC generator out light failed
The right engine instrumentation failed
The right hot battery warning illuminated (correct indication)
It is likely that other services powered from the right-hand electrical system DC bus were not available but the crew did not know of their loss. Once the voltage dropped below 18 V, Inverter 2 failed. As a result, the flight data recorder was inoperative for a period of 18 minutes until the crew selected Inverter 1. The ground proximity warning system (GPWS) was unavailable, as were the right fuel gauge and fuel shut off valve. No heading reference information was available for the 18-minute period. Once the generator out light came on, the crew used the appropriate electrical failure drill to restore power. They landed safely.
The immediate cause of the event was quite straightforward. Having started the engines, the crew completed the after-start checklist. This included a check of generator output and the indications were normal. At some point soon after the check was completed, the wear on the generator brushes reached a point at which the right-hand generator was no longer fully functioning. The generator control units (GCU) of this production standard Saab 340Bs were not capable of detecting a low voltage condition and so no direct warning was available. The crew could only have discovered the deteriorating condition of the generator if they had chosen to randomly check the output. There were no further mandated generator checks in the normal procedures. Once the output from the generator began to drop, power was supplemented by the battery. Once the battery became the sole supply of electrical power - that is, there was now no electrical power coming from the right-hand generator - it started to drain. A low voltage light operated when the output from the battery dropped below that required to operate the GCU. Throughout the episode, the bus tie relay failed to connect, which would have connected the left- and right-side electrical distribution systems, because, w'ith the switch set to AUTO, the system only functioned when the generator out light illuminated, which did not happen for some 18 minutes.
I want to examine this scenario in three parts. First, I want to look at aspects of the original design of the aircraft. Next, I want to look at the maintenance event that precipitated the failure and then, finally, I want to look at how the crew dealt with the situation. Each of these elements represents a slice in time. The Saab 340 first flew in January 1983 and the second-generation 340B entered service in 1989. Production ceased in 1999. By the time of the event, the Saab 340B was already a relatively mature product. I want to explore how latency - Reason’s latent error - and crossscale interactions shape the behaviour of systems.
Establishing the fundamental airworthiness of the aircraft design is a superior goal with many subordinate goals that must be met. Saab, the manufacturer, needed to demonstrate that the decisions its engineers had taken in relation to the relevant airworthiness requirements were valid. Originally a manufacturer of military aircraft, the 340 was its first venture into the commercial market, having identified a need for a 30 seat regional airliner. The boundary of the system might be considered the point at which the effort needed to gain the type certificate did not exceed the available design resources or reduce the potential future returns, rendering the project no longer economically feasible. The UK Government’s attempts to modify the Nimrod maritime patrol aircraft, itself a development of the 1950s Comet commercial airliner, to provide an airborne early warning capability is an example of development costs resulting to a system boundary being breached (Connor, 1986).
The aircraft was approved by the Swedish Luftfartsverket, having satisfied itself that the design was compliant with the European and United States requirements. In this specific instance, in the absence of a GCU able to detect an under-voltage condition, the manufacturer relied on the failure of the EFIS screens as an indirect indication of a problem rather than develop a direct warning. This was a design compromise but one that was considered sufficient to meet the requirement. Once the aircraft entered service, the risk of an undetected low voltage condition had become apparent through operational experience and redesigned starter generator brushes were introduced in 1995 in response, together with a new checklist. A modified
GCU, with a low voltage detection capability, was also produced but its installation was optional. The Luftfartsverket was satisfied that the new generator brushes and checklist afforded adequate protection but the investigation report could not, subsequently, establish if the original design of the electrical system actually met airworthiness requirements. The aircraft involved in this incident was number 328 in the production sequence and the new' GCU with low voltage detection was fitted to all aircraft as standard after serial number 367. The margin, in this case, extended over many years, from the point at which a design solution had crystallised to the point at which it was finally confirmed as being adequate. Signals relating to the adequacy of the design compromise did emerge through operational experience and these were known to the actors involved.
Buffering, as we have seen, represents the range of behaviours the system can accommodate. In this case, we see the original plan to use the EFIS screens as a surrogate for a dedicated low voltage warning, modifications to generator brushes, checklist revisions and, finally, a redesigned GCU. There might also have been others considered by the manufacturer at the design phase. If tolerance describes the nature of failure then given that the investigation could not decide if the electrical system met the required standards, it seems that we have another example of an inert state.
The significance of this element of the event is that it is suggestive of Reason’s ‘latent conditions’. A decision that satisfied one set of criteria created the opportunity for failure elsewhere but that risk only became apparent once other factors came into play. The solution to protection against a low voltage condition also illustrates how cross-scale interactions shape system behaviour. Design decisions taken probably 20 years prior to the event created the context for the actions of both the maintenance engineers and the pilots. It is maintenance that we will look at next.
The awarding of a release to service represents an actor at Level 4 - the authority - granting permission to one at Level 3 - the manufacturer. Once in service, as we saw with the Learjet, there is a need for ongoing maintenance to maintain the specific airframe in an airworthy condition. The operator used a third party to provide maintenance - just as in the Learjet example - but, under the terms of its AOC, it was required to have its own approved maintenance manager, together with a formal management process. Specific communication relating to the maintenance of the generator was contained in the manufacturer’s maintenance review board (MRB) report. The MRB is an approved document that describes a minimum number of tasks necessary to keep the aircraft in an airworthy condition. The manufacturer also provided detailed maintenance task descriptions as individual Job Cards. The MRB and the Job Cards are communication artefacts that act as forms of control. The MRB specified that the aircraft generators were to be removed for overhaul every 1200 in-service hours. The life of the generator could be extended to 1600hours if the brushes were replaced at 800 hours. The Job Card relating to this task stated that the brushes need only be replaced at 800 hours if they exceeded a certain level of wear. We have a discrepancy here. One document stated that brushes must be replaced to extend the service life while the other document stated that brushes only need replacing if they exceed a certain limit of wear. The manufacturer’s guidance was that the MRB took priority over Job Card information. The operator reported that it had been led to believe that the MRB incorporated maintenance manual procedures and, so, the Job Card was an adequate reference.
The left-hand generator had been maintained in accordance with the MRB. The right-hand generator had been inspected at about 800 hours and, although a note to the effect that the brushes were 40% worn was recorded in the aircraft maintenance log, the brushes were not replaced. On the day of the flight, it was noted that the generator was due for servicing but the maintenance log had been annotated to the effect that a 10% extension to the generator life was authorised. No company maintenance procedures allow for such action. The generator had been installed for 1601.9hours when the EFIS screens failed and, so was already at its maximum installed life when the aircraft departed.
The maintenance ‘system’ we see here has similarities to the Learjet example we saw earlier. The system comprises those components brought together to provide maintenance support. Control was exercised by mandating a system of maintenance oversight and by documenting the processes by which maintenance tasks were to be conducted. Servicing the generator was a subset of the broader process of conducting periodic maintenance, contributing to achieving the goal of sustaining aircraft serviceability. The boundary was the point at which the aircraft maintenance state must be declared valid prior to further use. If a task was not completed or an item was beyond its notional life then that state would be false. The margin was the notional space between the first task demanded by the MRB and full compliance. In effect, the boundary and its margin were forming and reforming as demanded by the cyclical nature of maintenance.
Buffering is represented by the range and scale of maintenance actions associated with, in this case, the generator. Maintenance tasks can be done correctly, part finished, done using the wrong tools or parts or completed using inappropriate techniques. For example, in a study of Piper PA-31-350 Navajo Chieftain accidents, 35% were the result of maintenance, primarily the use of the wrong part or the wrong maintenance technique (over-torquing retaining bolts or misaligning turbocharger section pipes being commonplace). One aspect of buffering is the fact that inconsistencies can be codified into processes. The discrepancies between the MRB document and the Job Card allowed for interpretations that increased risk. At a practical level, buffering could accommodate the fact that one generator could be maintained using the MRB process while the other followed the Job Card. In the latter case, if the brushes were not replaced at 800 hours, the generator would need to be replaced after 1200 hours. Buffering also needs to capture non-standard activities. It reflects the variability that is a part of normal operations. We saw that, after the 800-hour inspection, the extent of the wear on the brushes was simply annotated in the log, which brings us to the extension of the generator’s life. The report did not explain what was meant by the notional T0% life extension’ nor when it was entered into the logbook. It was probably a local fix to allow the aircraft to remain in service for a few more days but with no context we only have recourse to conjecture. With hindsight, it appears that an ad hoc act of creativity finally exceeded the system’s buffering capacity. In this instance we see a variety of behaviours, all but one being accommodated by the system. Buffering needs to cope with latent errors (the discrepancy between the MRB and the Job Card) and local activity. The systems tolerance, in this case, was brittle. Breaching the boundary resulted in a significant failure.
Electrical generator brush wear is widely understood and periodic inspection and replacement is the standard control process. These are mitigations as the issue is a constant requirement and simply a function of the design of generators. Saab selected a generator produced by Goodrich to satisfy the requirements of the Swedish Authority. In so doing, it created a need on the part of its customers to enact solutions to ongoing maintenance. These cross-scale interactions, events happening removed in space and time and at different levels in the hierarchy, are additional sources of failure.
Turning, now, to the performance of the crew, although the aircraft remained controllable and standby flight instruments were available throughout, the tight-coupling inherent in the technology and the subsequent cascading failures - some spurious and others real - presented a challenge to the crew, who were already having to cope with one EFIS out of commission. For a period of around 18 minutes, the exact status of the aircraft was not understood by the crew. The weather conditions were relatively benign and so the loss of the GPS, GPWS and heading reference was not as significant as it might have been. The crew on the day comprised a captain conducting supervised line flying with an FO on his second operational flight. The captain was very experienced, in terms of total hours, and had been employed as a supervisory pilot, involved in the training of FOs, since March 1999. Although also an experienced pilot, the FO had completed the 5-hour Saab 340 type conversion course on 1-2 December 2001, just 3days before the event. There was, therefore, a significant imbalance between the two crew in terms of on-type experience. Ground training did address a range of malfunctions pilots could encounter and both crew commented that they were aware of some aspects of a low voltage condition, such as the difficulty of detecting the condition prior to the EFIS screen blanking and the possible cascading sequences of failures.
The aircraft had been dispatched with one EFIS screen inoperative, a condition permitted under the terms of the MEL. When the next two EFIS screens failed the crew believed that this was a related problem, which was why their initial response was to action the EFIS checklist. The captain also reported that he had been reading about the EFIS system in the weeks prior to the flight, in particular how the EFIS drive transfer system worked. It was his opinion that the MEL item coupled with his recent reading led him to conclude that the problem lay with the drive transfer system. Fie handed control of the aircraft to the FO while he dealt with the ‘EFIS failure/disturbance’ checklist. The first step on the checklist was to check the generator voltage. This step was missed. Several senior pilots in the company later said that Saab checklists were poorly designed and it was not uncommon for crew to miss steps in checklists. Having completed the checklist, the EFIS screens were not restored. By now, the pilots were wearing oxygen masks and in a precautionary descent because of a spurious cabin pressurisation warning.
The goal of the crew was to maintain control of the aircraft and manoeuvre it to a safe landing place. To be successful, they had to prevent any further degradation and to restore the aircraft to its highest level of technical capability. The system boundary was the point at which further degradation would place the aircraft in jeopardy. This would mean that either aircraft controllability was impaired or the opportunity to make a safe landing was reduced.
The margin associated with the boundary was marked by the point at which the generator brushes became degraded and output was reduced. Although prior to the failure, there was no formal requirement to check generator output, a signal relating to system status was, nonetheless, accessible. The margin became tangible once the EFIS screens blanked.
Buffering, in this case, was reflected in the actions of the crew. From an organisational perspective, the blanking of an EFIS screen was considered sufficient to trigger the crew to execute the low voltage checklist. In reality, crews could implement a correct or an incorrect checklist and the chosen checklist could be actioned accurately or inaccurately. In this case, the crew implemented the wrong checklist incorrectly. What we see here is that the system was able to absorb irrelevant activity with no further degradation until a more unambiguous cue directed the crew to more constructive behaviours.
Fundamental to system functioning is the efficacy of the crew’s behaviour. At this point we need to consider the other boundaries in the system, the interfaces. These are the notional internal partitions within the system that need to be negotiated. Interfaces will shape efficacy. In this case, the cockpit controls and displays represented an interface through which the pilots were trying to make sense of the situation and restore control. As we saw, the blanking of the two EFIS screens was misdiagnosed, in part because it occurred in association with an already disabled screen. This was followed by a sequence of warnings and systems failures, some real but many of which were spurious. This added to the challenge the crew faced. I have suggested that efficacy describes the probability of actions satisfying the constraints of the active goal and is shaped by Neerincx’s cognitive task load model. The initial diagnosis by the crew was shaped by the existing malfunction, the pre-existing disabled EFIS screen, but was also influenced by the captain’s recent study of the EFIS drive transfer system. We cannot discount the inexperience of the FO, a potential source of corroboration or alternative interpretation. Their response was to implement the most appropriate checklist, which was the one that matched their construction of the apparent problem. It takes time to complete a checklist. However, once started, the crew were presented with a series of unrelated indications, some real and some false, all of which were competing for the attention of the crew and possibly explains the fact that the first item on the checklist being actioned was missed. They would have been faced with the problem of trying to fit the new data into their active schema, that of an EFIS problem. The alternative was to break out into an alternative mode of action but information processing consumes cognitive resource and task switching is effortful. This will be discussed in more detail in Chapter 5.
The incident report notes that the crew donned their oxygen masks and initiated an emergency descent. This was, presumably, in response to the spurious illumination of the cabin pressure indication. This warning was powered by right-hand electrical busbar. The actual pressurisation system was controlled by the functioning left-hand busbar. It is not clear from the report if the aircraft had actually lost pressurisation but the significance of the warning was sufficient to trigger the crews’ response. An emergency descent would have added to workload and further inhibited information processing. The problem was resolved by a new cue: the generator out light. It seems that the efficacy of the crews’ actions in relation to the specific problem was low but the fact that the aircraft remained controllable suggest that the mode of failure, the tolerance, was graceful rather than brittle.
This example illustrates the effects of cross-scale interactions in systems. Decisions about the design of the aircraft, in this case the selection of a source of electrical power and, importantly, the provision of a warning of reduced pow'er output, had two significant, interrelated, consequences. First, the design of routine maintenance schedules afforded opportunities for unintended activity. Second, the solution to the need to w'arn of a reduced output, when coupled with incorrect maintenance, created an opportunity for the crew' of an aircraft to follow the wrong checklist. Systems interact across scales w'ith events removed in time, and undertaken in completely different contexts, interacting in unintended ways. The arrangement of components in systems affords opportunities for control of processes to be degraded or even lost. Developing and sustaining complex systems is problematical. The way in which a system is configured represents a solution to a specific problem achieved within specific resource availability. The way in which the structure of the system influences its behaviour is part of the problem. The Learjet and Saab 340 examples illustrate a property of all systems, which is to migrate from a designed, intended state to an extant unintended operational condition. This process is known as drift.