Assigning a Value to the Performance
Having gathered the evidence, the next step in assessment is to assign a value to the performance. Evaluating pilot performance is not the same as measuring their height or weight. As soon as we start talking about a ‘scale’, we automatically think in terms of intervals of standard, knowable magnitude. My subject who is 1.82 m tall is exactly 10 cm taller than someone who stands 1.72 m tall. They are twice as tall as someone who is only 91 cm. The intervals on the scale are of a fixed magnitude, and
FIGURE 12.1 Mapping markers on to competencies.
the distance between intervals is known. When we grade a pilot’s competence, we are actually assigning that pilot to a category of performance. The categories are arranged in a sequence and may have a numerical label, but it is not a scale in the same sense as a ruler is a scale. The non-technical skills (NOTECHS) markers initially used a five-point grading scale:
Very Poor: Poor: Acceptable: Good: Very Good
The problem with this approach is that we have neither a clear understanding of the magnitude of each interval nor the distance between intervals. Is an ‘acceptable’ performance three times better than a ‘very poor’ performance, for example?
The problems we face at this stage are how many intervals do we want and how to describe them? If we go back to the question of stakeholders, the aviation authority might just want two intervals: compliant with all requirements or not compliant. The training system might want a finer-grained analysis to identify specific classes of endemic weakness so that the training system can be tweaked. The airline might want to know who can be trusted to do the job, who needs watching and who needs further training. Each interval on a scale represents data about performance. The problem is that with more intervals we suffer an increasing loss of standardisation, or consistency, between assessors. An acceptable compromise in terms of capturing the most information for the least loss of consistency is four intervals (Flin, personal communication). Unfortunately, settling on the number of intervals does not answer what the intervals represent. That decision will flow from your answer to why you want to assess in the first place. Table 12.2 offers a suggested grade scale for use in assessing operational line pilots. A grade scale for use in initial training might need more intervals or different category descriptions. I have deliberately chosen not to
Example Grade Scale
assign letters or numbers to each of the intervals simply because to do so reinforces the misperception that assessment is measurement, rather than categorisation.
It should be immediately apparent that the intervals are not of the same size. Given that we are typically assessing line pilots who hold a type rating and many of whom will have years of experience, the ‘unfit’ interval ought to represent a small proportion of the distribution of performance. The ‘unfit’ grade is probably the only interval that can be considered ‘objective’ because it should be anchored to a regulation or a procedure (making it both summative and criterion-referenced). Conversely, the ‘operational standard’ interval will capture the bulk of the pilots in a company and, thus, will be very broad. The boundaries between the first three intervals are quite clear, but my experience is that the boundary between ‘operational standard’ and ‘resilient’ can be problematic. To understand why, we need to look at the sources of unreliability in assessment that flow from the act of assigning a value to performance.
A common problem in all assessment systems is that of ‘central tendency’: everyone is average. In cases where the scale has five intervals, everyone gets ‘3’. If it is a seven-point scale, then everyone gets ‘4’. Attempts to combat this using even- numbered scales simply offset the ‘centre’: everyone is still a ‘3’. Central tendency is often a manifestation of a lack of a clear performance standard. If I am not sure of the expected standard, then I cannot go wrong with ‘average’. Equally, if I give a low score, then I have to justify to the trainee why I viewed the performance the way I did and that can be uncomfortable. If I give a high score, then I might have to justify my decision to management. So, a ‘3’ is the easy way out. Another problem is ‘scale clipping’. In this case, assessors adopt the attitude that no pilot is bad enough to warrant a T or else we would not have given them a job. At the other end of the scale, ‘there is always room for improvement’ and, so, the best you can get is a ‘4’ on a five-point scale. Grade manipulation can sometimes result from experience levels or a belief that pilot ability should improve over time. In the case of experience, it is only natural that a new-hire pilot might not be as competent as an experienced line pilot (summative v formative assessment). The idea that pilots should get better with experience, logically, means that more than four intervals would be needed to capture performance in a career airline. Status can shape grading, and I have seen examples where management pilots always score ‘straight 5s’ on their assessment, training captains get all ‘4’s and the rest of the pilots fight it out for the remaining numbers. Following on from this idea is the problem of, for want of a better word, vanity. I have seen situations where very senior captains expect top scores not just because of their performance (which is, more often than not, good) but simply because of their position in the airline. I have heard the argument made to retain a ‘top score’ on the scale simply as a way of rewarding longevity. That said, I have also seen high, but unrealistic, scores awarded to trainees as a way of motivating them. The grading of performance is a social act, but the more the social factors intrude in the process, the less valuable the activity becomes.
Poorly designed grade scales will affect reliability in that the data collected across time will be inconsistent. We need to establish what level of sensitivity - the number of intervals - will provide the data needed to satisfy the measurement goals (the reason for assessing). The intervals or categories must be defined in meaningful terms, using agreed benchmarks if possible. The reasons for categorising performance must be made clear to the pilot group, and the assessors must be disciplined in their use of the system.