Conducting the Assessment
A fundamental problem with assessment is that the assessor is not invisible. The assessor changes the dynamic of the situation. Called ‘observer effect’, a suggested advantage of the line operations safety audit (LOSA) methodology is that, because the observer is deliberately unobtrusive, the performance of the crew is closer to ‘real life’. Another problem is that assessors are sometimes involved in the conduct of the exercise. This is more obvious in the simulator context where the trainer also manages the scenario, often provides external inputs such as АТС but still has to assess performance. I have witnessed debriefings unravel because the trainer was distracted when a specific action was completed by the crew and a comment about a procedural omission becomes a contest over truth. For line checks, it sometimes happens that the assessor forms part of the operating crew and this is certainly the case with line flying under supervision for new hires and upgrades. Combining operating with assessing increases the risk of poor data collection. The mechanics of the assessment situation, then, militate against the process.
A recognised framework for the conduct of assessment is captured in the mnemonic ORCE. First, we need to ‘observe’. Assessment is an evidence-based process, and the evidence is the observed behaviour captured during the event. Of course, we need to ‘record’ the evidence. Because of the problems of primacy and recency, a failure to record will result in a skewed evaluation. These first two stages are conducted during the event, obviously, but once the session is over, we need to ‘classify’ the evidence. This involves looking at the samples of behaviour and assigning them to the relevant marker. The final step is to ‘evaluate’, which is to assign a value to the performance based on the grade scale.
Evaluating the evidence will be shaped by the design of the assessment system. There are three broad approaches to evaluation. The first can be described as the ‘global’ approach involving a single grade for each marker across the whole event. Another approach would be to segment the profile and collect grades for each phase (taxi, take off and climb, approach and land). A third, tailored, method is to constrain the data collected to the most important behaviour for each phase. For example, a complex non-normal scenario might only be assessed against markers for, say, problem solving and communication.
Another method of evaluation is known as the VENN method (EASA, 2019). This approach suggests that an individual is graded according to:
How well the candidate demonstrated the required behaviours (quality)?
How often the behaviours were demonstrated (frequency)?
How many behaviours were demonstrated (quantity)?
What was the outcome (result)?
The VENN approach uses a marker framework with specific exemplars of observable behaviour for each marker (see Table 11.3). There are problems with this method (see Roth & Mavin, 2015 for a discussion of some of the issues). First, it is impossible to elaborate all of the behavioural events that are representative of a specific marker. It is the role of the assessor, after training and standardisation, to use their expertise to analyse the performance. Competence for a line pilot in productive service is typically assessed during a line check or in a simulator. The problem we face here is that, under normal line conditions, a crew might not have an opportunity to fully demonstrate their capability. The situation simply did not demand any great virtuoso performance. The frequency and quantity criteria are, thus, fallacies.
Another problem is that elaborations of markers using exemplars encourage assessors to use the marker as a checklist. Performance is graded according to the number of exemplars observed. To a degree, we are entering a stalled loop here in that assessors are encouraged to look for frequency and quantity is a part of the recommended process, but the fact that the marker is simply a subset of all possible exemplars means that the process is arbitrary and constrained.
Gontar and Hoermann (2015) found that ‘non-technical skills are rated more reliably under high workload conditions than under low workload conditions, and social aspects of non-technical skills are rated more reliably than cognitive aspects’. The stress of high workload exposes fundamental competence and the demands of a line check might simply be insufficient to expose weaknesses in crew competence. Equally, whereas the social aspects of performance are tangible, the cognitive aspects are not. This brings us back to the question about how best to ensure that we have a full understanding of competence.
Data collection opportunities should be reviewed in order to take full advantage of their potential. Rather than line checks being passively driven by events on the day, they could be designed to target specific competencies, especially analysis and planning. The simulator activity typically comprises a set of standard non-normal and emergency events as part of license renewal, but it may also include a LOFT event. The problem is that the complexity of the LOFT scenario may vary from one training cycle to another and may be conducted differently according to the trainer assigned. Equally, the scenario might not tap into all the competences we have decided are of interest. The assessment situation, then, can constrain the data we collect. The fact that the assessment framework might not map onto the data collection opportunities means that training managers must have the skills to recognise the problem and find ways to mitigate, either through effective design of LOFT scenarios or through innovative data collection methods, such as self-assessment. Over an annual simulator training cycle, LOFT scenarios can be coordinated to provide coverage across all competencies. Trainers, including simulator instructors, are chosen because of their skill and breadth of experience. The training event should be sufficiently standardised to ensure that the data captured is trustworthy but still affords space for the trainers to share their expertise. This can be a challenge. The conduct of assessment, however, must be accomplished to the same standard by all raters. The context of assessment, then, is the final source of bias in the data collection process.