The past two decades have seen rapid growth in the deployment of employment tests via the internet, together with the increasing use of unproctored administration (Bartram, 2008b). This practice has raised concerns about the security of cognitive ability tests in particular, and more generally about the validity of scores from all types of tests administered in UIT conditions (Tippins et al., 2006). The early discussion ofwhether this should happen has moved on to an acceptance of this mode of administration. Now the focus is on how best to ensure it is safe, secure and valid (Bartram & Burke, 2013; Burke, 2006, 2009; Burke, Mahoney-Phillips, Bowler & Downey, 2011; Lievens & Burke, 2011; Tippins, 2008).
Cheating occurs and always has (Cizek, 1999). What concerns us here is whether cheating is more of an issue for technology-based testing or whether the technology provides a means of mitigating some of the risks associated with testing. Tate and Hughes (2007) reported results from a survey of the perceptions of UIT of 319 university undergraduates and postgraduates in 51 British universities. The majority (76%) had taken tests at home, with the next most frequently used testing location being a university computer room (27%). Taking a test at home was the preferred location (selected by 81% of undergraduates). Respondents were asked to report the frequency of different actions. Thirty-seven (12%) reported actions that could be considered to be cheating, and of these 15 reported colluding with friends, 15 reported obtaining the questions in advance and 6 reported circumventing the technology in some way. This survey indicates that the administration of traditional tests under UIT conditions may be subject to a degree of cheating. The challenge is how to counter this while retaining the logistical advantages of UIT.
We can divide instruments used in employment testing into two types: measures of maximum performance and measures of typical performance. The former, which include cognitive ability tests, have a right answer to each question. In this respect, they are similar to knowledge tests and other achievement-related measures in the way they are scored and normed. Typical performance measures, on the other hand, focus on how candidates typically behave in work settings. These are largely self-report measures (e.g., personality questionnaires), which do not have ‘right’ answers.
These two types of measure, maximum and typical performance, entail very different issues for ensuring the quality and validity of the data obtained. Measures of maximum performance are potentially open to various forms of cheating as candidates may find a way to obtain access to the correct answers or take the test with the assistance of another. Measures of typical performance are open to more subtle forms of distortion, such ‘faking good’.
It is natural for applicants to attempt to create a good impression when applying for a job, but a line should be drawn between putting forward a positive but honest view of oneself and pretending to be something one is not. Faking on self-report measures not only reduces the construct validity of the test (Tett, Anderson, Ho, Yang, Huang & Hanvongse, 2006) but also skews the rank ordering of applicants, which in turn can cause false-positive errors in selection decisions (Griffith, Chmielowski & Yoshita, 2007). The same concerns arise with ability tests; inflated false-positive rates can occur when a candidate’s responses to a test represent either unfair access to the correct answers, a proxy sitting the test on behalf of the candidate or a candidate colluding to obtain a score that is higher than his or her true level of ability.
Controlling cheating often depends on making it difficult for candidates to obtain access to the questions and checking that the candidate’s score was obtained from the candidate in question rather than someone else and that the score was achieved without assistance. These two issues are often dealt with by item banking and verification testing.
Item banking provides the means to construct tests ‘on the fly’ that differ for each candidate using item-response theory as the basis for test construction. Tests may either be fixed-length fixed-difficulty linear on-the-fly testing (LOFT) or computer-adaptive, where the selection of the next question is based on the current estimate of the candidate’s ability level determined from responses on the previous items. In either case, the candidate will not know which items will be presented or in the order in which they are presented. It is important to have a large item bank and to control item exposure levels to maintain security (Davey & Nering, 2002). Over time, test producers should monitor the questions in the item bank to check for changes in their parameters that might indicate they have been compromised through over-exposure or theft. They can also ‘patrol’ the internet to search for sites that offer illicit access to items from the bank and take action to close them down.
Verification testing involves the administration of a proctored test to shortlisted candidates who have previously been screened using an UIT. Scores on the UIT and the verification test can be compared to identify people with inconsistent scores. There are various ways of doing this. In one example - CEB’s Verify 2 - CAT is used to provide an ability estimate in UIT and then, if the candidate passes the selection sift, this estimate is used as the starting value for the CAT verification test. If the final ability level estimate falls below the original cut-score, the candidate can be rejected.
In addition, it is now possible to use remote proctoring. This involves the use of technology such as webcams to monitor and record the test-taker during the test. In addition, behaviour is recorded in terms of response times, typing patterns and other measures. Data forensics software (Maynes, 2009) can alert a proctor to atypical behaviour and then the video record can be checked. Proctors can remotely stop a test or issue a warning (Foster, 2013; Foster, Mattoon & Shearer, 2009).
Faking on self-report measures is different from cheating on an ability test. Faking cannot be observed by a test proctor and few people ask a colleague to complete a personality inventory or other self-description inventory. However, if people differ in the degree to which they bias their scores by faking, this may confer an unfair advantage to those who fake more. Most of the research on ‘faking good’ (see Griffith & Peterson, 2006, for a comprehensive review) has been laboratory-based, with students being asked to role-play applicants in ‘faking-good’ conditions or non-applicants in so-called ‘honest’ conditions. In such situations, we find that people can fake on self-report inventories. In a laboratory setting, people are not only willing to adopt false roles but are indeed instructed to do so. There are no negative consequences associated with them lying about themselves. Indeed, the demand characteristics of the situation encourage people to ‘fake’ as much as they can with no real adverse consequences.
If, and to what extent, job applicants ‘fake good’ in real situations is a much more complex issue, and the research on faking in real selection situations is far more ambiguous than that for laboratory studies (Levashina, Morgeson & Campion, 2009). Many have challenged the view that because people can fake in simulated settings, they will fake in real ones when the demand characteristics of the situation are very different (Arthur, Woehr & Graziano, 2000; Hough & Schneider, 1996; Ones & Viswesveran, 1998; Viswesveran & Ones, 1999).
There are several ways to control ‘faking good’ behaviour. Some evidence suggests that lie scales, warnings and honour statements, among others, deter candidates from inflating their self-ratings. A common way to control faking is by making it difficult to do. Students in simulated selection situations can typically raise their scores by around 1 SD on instruments using Likert single stimulus item format response scales. However, they raise them by only around a third of that when forced-choice item format instruments are used (Christiansen, Burns & Montgomery, 2005; Jackson, Wroblewski & Ashton, 2000; Martin, Bowen & Hunt, 2002; Vasilopoulos et al., 2006). Most of the studies with forced- choice item formats have used forced-choice item pairs. Those with high ability are able to raise their scores more than others (Levashina et al., 2009). However, faking becomes increasingly difficult as the number of alternatives increases.
The challenge in constructing forced-choice items used to be in the process of matching statements in pairs (or triples or quads, depending on the format used) such that they provided good information about each of the scales involved and were equally desirable options. There was also an issue of ipsativity associated with how forced-choice format instruments were scored. Traditional methods of scoring forced-choice items result in the sum of the points given to the various scales being a constant. This means that if you know the scores obtained on all but one of the scales, the score on the last scale is determined. Ipsative items pose problems for a range of psychometric analyses and impose a constraint on the central location of score profiles (see Baron, 1996; Bartram, 1996). Recent developments have provided solutions to both problems. IRT scoring models have been developed (Brown & Bartram, 2009; McCloy, Heggestad & Reeve, 2005) and can be applied to forced-choice item data to recover the latent trait scores that determined the pattern of choice. These recovered scores are not ipsative. The IRT parameters of the items also provide a basis for assembling them into sets for forced-choice format use (Brown & Maydeu-Olivares, 2011).
Bartram and Burke (2013) argue that the degree to which people might cheat or ‘fake good’ depends on a combination of five factors;
- 1 People have to perceive a need to fake, which is determined by the level of the candidate’s investment in the outcome of the process. The higher the personal consequences of ‘failure’ on a test, the more someone might be driven to find a way to avoid failure that involves some degree of dishonesty. If the stakes are low or if the situation is not one in which one can either ‘pass’ or ‘fail’, then the motivation to cheat or fake will be low. We can identify this aspect of the situation as being ‘the perceived cost of failure’. The higher the stakes, the higher the perceived cost of failure and the more likely someone is to fake.
- 2 People differ in their willingness to fake or cheat, which relates to the strength of an individual’s moral or ethical stance on cheating or faking. Some are more willing to be dishonest in a given set of circumstances than others. Others may perceive collusion or cheating as fair play and believe that they are only doing what everyone else does.
- 3 People differ in their ability to cheat or fake. It is argued that 30-50% of applicants may inflate their scores (Griffith & McDaniel, 2006) by faking. Applicants can buy books such as Ace the Corporate Personality Test (Hoffman, 2000), and there is lots of advice on the internet on how to fake personality questionnaires. While we should not ignore the assistance that is available, the fact that people try to ‘ace the test’ does not imply necessarily that they will be successful. Applicants do not always get it right when they fake: as many as 20% of those who attempt to fake do so in the wrong direction (Griffith & McDaniel, 2006).
- 4 Test-takers have to believe that the benefits of cheating outweigh the risks associated with being caught. Although candidates might be willing and able to cheat in a laboratory study, in a real-life selection setting they might regard the situation as one in which the risks of being caught are too high.
- 5 Most important of all is the opportunity to fake or cheat, which the test designer can control. However, it is impossible to control the willingness or ability to behave dishonestly as these are attributes of the candidate.
Candidate can only cheat on a test if they have access to the answer or assistance, or use a proxy of higher ability. The opportunity to fake or cheat can be managed in several ways. Methods of test construction, including use of forced-choice item formats and test administration procedures (e.g., verification testing or remote proctoring) can mitigate the risks of cheating. The use of LOFT and CAT with verification testing limits foreknowledge of items and indicates who may have used a proxy or received some form illicit assistance. Multiple test forms were a key recommendation made by Tippins and colleagues (2006), and Hollinger and Lanza-Kaduce (1996) report a study in which 80% of students surveyed stated that the ‘scrambling’ of items was the most effective anti-cheating strategy of those included in this study. In addition, faking will become increasingly difficult as the complexity of what one is trying to fake increases.
Faking on a personality instrument is an invisible process that requires no prior knowledge of the content. A single-scale instrument (e.g., a conscientiousness scale or an extraversion scale) will be much easier to fake than a profile on a multi-scale instrument. It is fairly easy to role-play a given persona in terms of the Big Five, but role-playing a person defined by the 30 facets of the NEO-PI-R or the 32 scales of the OPQ32r is much more demanding. The use of multi-scale instruments with detailed levels of measurement not only makes faking harder, it also optimizes validity. The validity of personality instruments increases if relatively narrow bandwidth criterion-focused scales are aligned with specific criteria (Bartram, Warr & Brown, 2010; Ones & Viswesveran, 2001; Warr, Bartram & Martin, 2005).
Optimal test security puts in place multiple layers of control so that cheating the system becomes too complicated, too risky and too costly for the candidate. For self-report measures, measuring relatively large numbers of scales and using forced-choice item formats and non-transparent scoring algorithms and scale combination rules can curtail cheating. Systems of checks and balances can detect people whose results from one method of assessment at one stage of the process are inconsistent with those from another method at a later stage.
Finally, cheating can be deterred by informing candidates that they are expected to respond honestly and openly, and that there are consequence if they are found not to have done so. In many testing programs (e.g., CEB’s), tests are introduced with a simple honesty contract which states that the candidate will undertake the test honestly and in the spirit of fairness to all candidates; this also serves as a reminder that the content of the tests are protected by copyright and covered by law. Ariely (2008) cites studies showing that subjects will cheat if given the opportunity to do so. However, when participants were asked to sign a simple honesty statement, the level of cheating dropped substantially. Honesty contracts have been shown to have a positive impact on the quality of information obtained from biographical questionnaires (Stokes, Mumford & Owens, 1994). Reminding candidates that they are expected to be honest is an easy way to make the ‘rules of the game’ clear to them.
Detailed guidelines on how to ensure test security and good practice in technology- based testing are contained in the International Test Commission’s Guidelines on Computer-based and Internet Delivered Testing (International Test Commission, 2006) and Guidelines on the Security of Tests, Examinations and Other Assessments (International Test Commission, 2014).