Systems, Drift and the Distortion of Buffering
Complex systems exhibit self-organised criticality (Bak, Tang & Wiesenfeld, 1987), the classic illustration of which is the sand table demonstration. Imagine a table where we continually add single grains of sand to a pile in the centre. The pile will build and maintain a specific shape. Occasionally, the addition of the next single grain w'ill trigger slides of grains down the sides of the pile until an equilibrium is restored. The pile has sufficient integrity to maintain its shape but the addition of just a single additional grain will render the pile temporarily unstable until the new' stable state is achieved. The idea is that systems migrate to states of sufficient stability but inherent criticality. This dynamic, fluid nature of systems behaviour is reflected in the concept of ‘drift’. Dekker (2004) points out that most commercial enterprises function in conditions of resource scarcity and competition. In order to generate a sufficient return on investment in the face of competition for customers, organisations have to survive in highly marginal production environments. System resilience can be degraded as modes of production become risky and safety begins to drift. To illustrate what w'e mean by drift consider this example from the Pelee Island accident. When the charter w'as initiated it was expected that each sector, the outbound leg and the return, w'ould be treated as tw'o distinct flights. As such, each sector w'ould be subject to the same standards of flight planning and pre-flight preparation. From the evidence of minimal turn-round times and load sheet production, it seems that crews viewed the task not as tw'o separate sectors but as a single flight with an en route stop. This conceptualisation of the flight process allowed important safeguards for the return flight to be breached. In a similar manner, on 6 December 1999 a Piper PA31-350 Navajo Chieftain crashed after take-off from Johannesburg in South Africa (SACAA, 1999). The aircraft, engaged in a weekly charter transporting computer programmers to Namibia, was overloaded. Of nine flights flown as part of the contract, an examination of load sheets revealed that eight had been overweight for take-off. When the operator bid for the contract, the plan was to make the flight in two stages, stopping to refuel en route. After the initial proving flight, all subsequent flights were made with a full fuel load and continued direct without stopping. When fully loaded with passengers and their luggage for a week, the aircraft now had inadequate performance margins should an engine fail after take-off, as happened on this day. Drift, then, describes the manner in which working practices become aberrant and activity now takes place in ways that erode margins.
There is a grey area between drift and the concept of violation. The application of the ‘10% extension’ in the Saab 340 event described earlier might be considered a violation. Violations will be considered in more detail when we look at human error but, in the context of this chapter, I will use ‘drift’ to mean the gradual modification of work processes that have a higher level of risk but, importantly, that change in risk exposure is not detected by the organisation.
I want to illustrate three modes of drift at the level of the system. The first we can call ‘rational change’, the second ‘normalisation of deviance’ and the third, ‘asynchronicity’. Dekker (2004; NTSB, 2002) illustrates the process of rational change in relation to the failure of the Alaskan Airlines MD83 elevator trim jack- screw that resulted in the aircraft losing control over the Pacific Ocean, off the coast of California. The original aircraft design called for maintenance at set intervals. The maintenance task was codified in a set of schedules of work to be completed at set intervals. In Table 3.1, we show how the interval between maintenance events inexorably increased over time. The figures in parenthesis represent the actual time, in flying hours, between servicing events.
The table shows how, over a prolonged period, the interval between lubrications grew almost by an order of magnitude. Within a year of the original aircraft type, the DC-9, entering service jackscrew thread wear in excess of the predicted value was being reported and so a safety constraint was put in place. The extent of the wear was to be tested at intervals. Table 3.2 shows how the interval between safety checks experienced the same rational change process.
Rational change occurs when decisions are made based on factors that fail to capture the risks present in a system. In the case of aircraft maintenance, checking regimes are based on predicted values for wear and component failure. Maintenance aims to replace
Jackscrew End Play Check Intervals
components before they fail or to control the rate of degradation. To achieve this goal, managers of maintenance need reliable information about mean times between failures and rates of wear. Given limited resources, one way of improving efficiency is to increase intervals between maintenance events such that interventions are more closely aligned with need. Decision-making is often based on the absence of failure events. So, as the servicing intervals increased and no adverse effects were reported, the rational approach assumes that the new interval is no more risky than the old schedule. By locking the safety check into the same interval-driven checking process, the constraint in the system is, similarly, becoming more risky. Rational change is predicated on the fact that modifications to processes do not give rise to anomalies. However, systems can approach boundary conditions. The last jackscrew' end play check made before the accident occurred was done in 1997. When the check was instituted, acceptable tolerance limits were defined for the measured play. These were set at between 0.003 and 0.040 inch. The final check on the accident aircraft measured the play to be 0.040 inch. After much discussion, it was decided that this was acceptable in that it did not exceed the limit. It took 6days of discussion before the aircraft was released and a further 3 years before the trim system finally failed. The activity that underpinned the decision to release the aircraft characterises the second drift process, the normalisation of deviance.
In her study of the loss of the Challenger Space Shuttle in 1986, Vaughn (1996) describes the process by w'hich exceptional events become rationalised. In the case of the space shuttle, tolerance limits were exceeded but w'ith no subsequent failure. As a result, the tolerance limits were increased. Events that were deemed unacceptable at the design stage became acceptable once data w'as gathered to show that excursions from expected performance had no apparent ill effect. In the case of rational change, no evidence existed to indicate that the system w'as unsafe. With the normalisation of deviance, anomalies are a feature of the operational environment but data is used to minimise the significance of those anomalies. One of the w'ays anomalies become ‘institutionalised’ is thorough the embedding of warning signals in other information. Vaughn reports that danger signs can be interpreted as ‘weak’ or ‘mixed’ so that their significance is lost. Within hierarchical systems, information is transmitted upwards in abbreviated, generic forms and unfavourable information is lost. What people know and understand depends upon their position in hierarchy. Finally, deviation is normalised when oversight, such as is offered by safety departments or quality management systems, lacks independence. Where the responsibility for oversight is physically removed from the site of operations, the level of awareness is degraded. Where oversight is co-located, it becomes absorbed into the construction of understanding and is, therefore, complied in the organisational view' of safety.
The final source of drift is asynchronicity, in which changes in one part of the system are not made with reference to the rest of the system. Examples of asynchronicity include a change in procedure without reference to all agencies involved or, say, a change in a component supplier. One factor in the Pelee Island accident was the fact that the standard passenger weights used for load planning now no longer reflect the average weights of actual passengers carried. Society at large has grown heavier faster than regulations have been updated. Societal change has resulted in the planned and the actual loads becoming uncoupled. In this case, asynchronicity is a function of a change at Level 5 not being reflected in guidance issued by lower levels. The discrepancy between the MRB and the Job Card in the Saab 340 example suggests asynchronicity.
In a systems context, drift distorts the efficacy of interventions by arbitrarily restricting buffering: permissible solutions are engineered out of the range of possible responses to a problem. As a result, tolerance is modified as modes of failure shift towards brittleness. Drift is another illustration of cross-scale interaction in a system.