Assessors as Sources of Bias
In the last section, we saw examples of how the act of grading a performance can become inconsistent. The attitude of the rater to the process of performance appraisal is just one of many forms of bias that affect the reliability of the system. A fundamental problem is that, being a social animal, I generally like people who are like me. On first encounter, I start sizing someone up. If I find something about them I like, I start to build a positive reaction to them. Then confirmation bias kicks in. The more I see what I like, the stronger the positive evaluation. The problem is that I now discount any signals that might be telling me that my initial assessment was wrong. Goldin and Rouse (2000) found that the per cent of female musicians in the five highest-ranked orchestras in the USA increased from 6% in 1970 to 21% in 1993 after ‘blind’ auditions were introduced. Candidates played behind a screen and decisions were made on skill alone. ‘Unconscious bias’ reflects the fact that we cannot help but make judgements about others and the ability to rapidly categorise strangers, while possibly conferring fitness an evolutionary sense, creates problems in the modern world. Airlines, like any organisation, are social entities and people will get reputations. Crew reading this will identify with that sensation of getting their flying roster and, once they see who they have been paired with, will already have an expectation of what that particular day will be like. When an individual arrives for assessment, the groundwork has often been prepared, unconsciously, in advance.
Another source of bias is what is known as the ‘horns or halo’ effect. If someone creates a bad impression at the start or arrives with a poor reputation, then everything they do will be bad. We are primed to expect the worst. The halo effect is the opposite. Someone who has a good reputation is seen in a positive light and even a poor performance is explained away: ‘they were having a bad day’. Primacy and recency describe attributes of memory. In an encounter, our first and last experiences tend to be more memorable, whereas the events in between can blend into a generic memory of what happened. The problem with memory is that an evidence-based assessment scheme requires assessors to attend to the performance and capture samples that can be categorised and evaluated. Reliance on memory will distort the evaluation.
We all have expectations that we bring to the workplace. I have played a game on CRM courses where I divide the class into captains and FOs and get each group to imagine that they are approaching the desk in Dispatch and they are about to join up with their other crew member. What do they expect from that person? The lists can be illuminating, but the activity reflects the fact that we are primed to expect certain responses in specific settings and a failure to meet expectations creates a negative impression. Individuals are activating their mental script for the situation ‘arrive at Dispatch’ and discrepancies create dissonance. The problem is that these expectations are not a formal specification, they reflect personal preference and, as such, we all differ. Where a power differential exists, as in an assessment situation, the preferences of the rater become the standard: individual bias drives interpretation. One final problem we have to contend with is that attitudes change over time. Usually, assessors become harder to please. As one trainer put it to me: T just couldn’t face seeing the same mistakes being made over and over again’.
These various forms of bias can be covered in an assessor training course, but awareness of these issues is not mitigation. Returning to Unconscious Bias, evidence suggests that specific training courses are of questionable value (see Atewologun, Cornish & Tresh, 2018 for a review). In fact, some studies show that training merely provides a way to rationalise, and deny, biased behaviour. The fact that giving information about bias does not remove bias has long been recognised.
Assessors need to be recalibrated at intervals, and their work needs to be checked and standardised. A failure to recognise this will result in unreliable data being generated.
Establishing Reliability and Validity
In any domain, testing regimes must meet the requirements of reliability and validity. Although earlier I used the term ‘reliability’ in a generic sense in relation to the data collection, in measurement it has a stricter meaning and refers to the stability of a testing instrument over time. In short, two observations or samples of the same item taken at different times should return the same score unless the target has changed in some way between sampling. In a quality system, measurement devices, such as scales or chronometers, must be subject to independent testing to verify that they are consistently accurate. This is reliability. Validity on the other hand, is the degree to which the evidence gathered actually reflects the underlying construct being measured. For example, if I am interested in height, I can take a tape measure and check the height of an individual. The value I get will have a strong relationship with the ‘real’ height of that individual, within tolerable limits. If the measure shows 1.82 m, then their height is probably 1.82 m. Now, let us assume I want to compare a group of comedians on a scale of ‘funniness’. I watch each comedian in turn for a period of, say, 15 minutes in a room containing 50 people. Using the decibel app on my smartphone I track the noise from the audience and I record the peak decibel value recorded during the 15 minutes. My hypothesis is that the funniest comedian will promote the loudest laughs. I think the flaws in my approach are clear. Decibels would not be a valid measure of comedic competence, whereas height taken with a tape measure would be. The behaviours we identify as candidates for a measurement must pass the test of validity. What we observe must relate to the underlying construct. This book has shown that human behaviour is messy and does not fall into neat categories. Furthermore, as has been said, outputs flow from hidden internal processes. The shorter the distance between the observed behaviours and the attribute being measured, the less need there is for interpretation and the data will be more valid.
McMullan et al. (2020) conducted a large-scale, systematic review of observational tools used in hospital operating room contexts in order to report on their psychometric properties. Thirty-one tools were identified in the literature, most of which were derived from the original aviation NOTECHS scheme (Table 11.2) but with modified grade scales of between 3 and 8 intervals. Reliability was established for three of the tools using a test-retest method. Various aspects of validity were also assessed. In our case, the range and complexity of statistical testing needed to establish the psychometric properties of a marker scheme are probably unnecessary, but a familiarity with some basic concepts is appropriate.
To illustrate the problem, consider Table 12.3. Two experienced training captains were asked to assess ten newly graduated student pilots to estimate their probability of successfully passing the company’s type rating course. The candidates had two attempts at flying a short profile in a flight simulator. To avoid training effect, each assessor saw half the group on their first attempt to fly the profile and half on their second attempt. Assessors were asked to assess the probability of success on a five-point scale:
The simplest method of calculating the inter-rater agreement is to look at those occasions where both raters gave the same score and then dividing the number of agreed scores by the total number of ratings per individual. The assessors agreed on four out of ten occasions (0.4), suggesting that the process was no better than chance.
Assessment of performance comprises two separate activities. First, assessors allocate performance elements to a category, or marker. Second, assessors evaluate overall performance in each marker and assign the individuals to another category, or grade. In the example, we have just looked at the training captains were assigning candidates to categories - likelihood of success - not grading the performance against a benchmark. We need to be sure that the cadre of assessors is consistent in both allocating observations to markers and then in assigning a value to the performance.
There are two areas of interest to us: inter-rater agreement or reliability (IRR) and inter-rater variability. In fact, these terms are used flexibly in the literature, but I will differentiate between them by using the former when assessors are looking at the same candidate and the latter in cases where multiple assessors are grading a range of candidates. If we look at IRR first, many of the most common statistical tests
TABLE 12.3 Assessor Scores
available were developed in a specific context, such as clinicians looking at cancerous tissue samples and assigning them to a limited number of categories based on the tumour type. Quite often, tests simply compared two raters and the choice was dichotomous: ‘is this an example of x, yes or no?’.
The simplest way to test IRR is the method used in the example above: how often to assessors assign the same score to an individual? To increase reliability, tests also need to accommodate measurement error. Any score assigned to a performance includes a component that reflects the true score and a component that reflects measurement error, one example of which is Tuck’. The fact that two raters assign the same score to a performance could simply be chance, especially when a grade scale only has a few intervals. If we consider the case of an exam using multiple-choice questions (MCQs) each with four responses, there is a 25% chance of getting the right answer simply by guessing. It is for this reason that many MCQ exams have penalty scoring for wrong answers. Commonly used tests that allow for measurement error include Cohen’s and Fleiss’ kappa. IRR is an important tool used during the initial design of psychometric tests to explore their accuracy or discriminatory powers.
The simple calculation of agreement illustrated here is probably adequate for use on an assessor training course to focus attention on group consistency. If we assume that the grade awarded in an exercise is a reliable surrogate for the nature of the evidence collected by the trainee assessor, the output from an IRR calculation can be used as a starting point for a conversation about observational performance. The score is a catalyst for discussion.
To determine the accuracy of the performance of a group of assessors we need a ‘gold standard’ against which the performance of raters can be compared. While this is feasible in some domains, in assessing competence in an operational aviation context it is remarkably difficult to achieve. Consider the data in Table 12.4, which shows the distribution of grades by six assessors observing the same candidate.
On the four-point scale depicted in Table 12.1, the upper two categories are acceptable - above the cut-off - while the lower two require management effort. From an organisational perspective, we want to minimise the risk of assessing someone as competent when they are not (a false-positive error). Equally, we want to prevent unnecessary retraining if the candidate is, in fact, of an acceptable standard (a false negative error). The absence of a ‘gold standard’ places an additional burden on the design of the grade scale. Agreement between assessors, in the sense we have been discussing, would not be an acceptable metric simply because, although there might be congruence, they could simply all be wrong! On an initial assessor course, again, data presented as in Table 12.4 highlights any lack of standardisation and can be used to promote a discussion.
TABLE 12.4 Grade Calibration
Finally, once we have trained and standardised our cadre of assessors, we now have to check on the level of IRA. For this, we use the statistic Rvt>g (Smith-Crowe et al., 2014). For each marker, we calculate the statistic using the formula 1 - (S x 2/ (A2 - 1/12), where S x 2 is the observed variance of the single item, in this case, the scores per marker, and A is the number of response options. A result of >0.8 shows an acceptable level of agreement. This test should be applied to the output from a final standardisation exercise conducted during an initial assessor training course.
All of the examples used so far have involved small groups of assessors looking at the same performance in a training context. To monitor the health of the operational system, we need a test that can cope with multiple assessors looking at different candidates with each candidate being graded by more than one assessor. The training record system must be capable of generating a table of scores awarded by each assessor for the specific candidates. The result can be analysed using Gwet’s AC2 statistic (Gwet, 2014).
I want to end this section with a slight caveat. When we assess performance and assign a sample of behaviour to a marker, we are creating a piece of categorical data. I am simply saying that I have x samples of the behaviour we have chosen to call category y. Because we have labelled the categories, calling them markers, the data are classed as nominal. Similarly, a grade is also categorical data but now we can consider it to be ‘ordinal’ because the category we have used is on a progression from low to high. The categories themselves have no specific value; they simply represent ways of ordering data. As such, they represent types of non-parametric data. Because we are dealing with non-parametric data, mathematical operations, such as averaging the grade scores to produce a single value for an event, are, strictly speaking, illegal. It is commonplace to see parametric statistical techniques, such as calculating means and standard deviations, applied to competence assessment scores. In the same way that Dekker et al. questioned epistemological overconfidence, I would guard against too much faith in misapplied statistics in assessment.