Criterion-related and incremental validity
Much data have accumulated over the years supporting the relation between SJTs and job performance. McDaniel and colleagues conducted two meta-analyses (McDaniel et al., 2001, 2007) and reported corrected estimated population correlations of 0.26 and 0.34, respectively (uncorrected correlations 0.20 and 0.26). The more recent analysis included data on over 24,000 respondents. The criterion used in most studies is a composite score of job performance ratings. However, as evidenced by Christian and colleagues’ metaanalytic findings, criterion-related validity can increase when predictor and criterion are more carefully matched (Christian, Edwards & Bradley, 2010). These authors divided the job performance criterion into three facets: task performance (i.e., job-specific skills), contextual performance (i.e., soft skills and job dedication), and managerial performance (i.e., management skills). SJTs were then sorted into a typology of construct domains. The authors hypothesized that criterion-related validity would increase if particular criterion facets were closely matched with the content domains of the SJTs (e.g., contextual performance predicted by SJTs from the domains of interpersonal and teamwork skills). Overall, the authors found support for their content-based matching approach: relatively homogeneous SJTs saturated with a particular construct domain evidenced higher criterion-related validity with the criterion component they were designed to predict than heterogeneous composite SJTs.
In addition to moderation by criterion facet (a content-based moderator), the criterion- related validity of an SJT can be influenced by method-based moderators. We highlight three moderators identified in the literature relating to 1) test development procedure, 2) item stem format and 3) test delivery format. Meta-analytic evidence established that SJTs yield higher validities (r =0.38 vs. r = 0.29) when they are based on a careful job analysis than when they are based on intuition or theory (McDaniel et al., 2001). A second moderator is the level of detail in the item stem; less detailed questions show a slightly larger validity than highly detailed questions (r = 0.35 vs. r = 0.33). This runs somewhat counter to the premise of contextualized SJTs that context and level of detail increase the criterion-related validity of the test scores. Third, the test delivery format has been found to differentially affect validity, with video-based SJTs showing higher levels of criterion- related validity for predicting interpersonal skills than the traditional paper-and-pencil format, with a corrected population correlation of 0.36 for video-based SJTs and 0.25 for paper-and-pencil formats (Christian et al., 2010). This finding supports the contextualized perspective of SJTs because contextual information (e.g., about environmental cues, nonverbal behaviour) seems to be necessary to adequately apply interpersonal skills.
An interesting strand of research concerns investigating the incremental validity of SJTs as compared to other predictors of performance. McDaniel and colleagues (2007) found that SJTs explained 6-7% additional variance above the Big Five personality factors and 3-5% additional variance above cognitive ability, depending on the type of response instruction (knowledge instructions vs. behavioural tendency instructions). Further, SJTs explained 1-2% of variance above both cognitive ability and the Big Five factor scores. More recently, SJTs as low-fidelity simulations have been contrasted with assessment centre exercises in a high-stakes selection context. Lievens and Patterson (2011) found that criterion-related validity was similar for both the SJT and assessment centre exercises. Subsequent incremental validity analyses revealed that the assessment centre exercises explained 3% additional variance in the criterion job performance over the SJT. However, subsequent path analysis showed that assessment centre performance only partially mediated the effect of procedural knowledge as measured by the SJT on job performance, indicating that scores obtained from these two types of simulations should not be viewed as redundant.
In sum, contextualized SJTs predict variance in job-related criteria to an extent that is comparable to other frequently used selection tools (see Schmidt & Hunter, 1998). Importantly, contextualized SJTs contribute incrementally above and beyond Big Five personality factors and general mental ability.
Construct-related validity For the same reason that makes it difficult to estimate internal consistency reliability of SJT scores, item heterogeneity makes it challenging to delineate which construct(s) are being measured by the SJT. Next to decisions pertaining to the actual test content, the method of measurement can also influence which constructs are being measured by SJTs. Concerning measurement method, McDaniel and colleagues (2007) obtained a differential pattern of construct-related validity coefficients when SJTs with knowledge instructions (‘What should you do in a given situation?’) were compared to SJTs with behavioural tendency instructions (‘What would you do in a given situation?’). Correlations between SJTs with behavioural tendency instructions and three Big Five personality factors were higher than for SJTs with knowledge instructions
(agreeableness 0.37 vs. 0.19, conscientiousness 0.34 vs. 0.24, and emotional stability 0.35 vs. 0.12, respectively). Conversely, SJTs with knowledge instructions correlated at a higher rate with measures of cognitive ability than SJTs with behavioural tendency instructions (0.35 vs. 0.19, respectively).
Subgroup differences Although SJTs generally result in smaller subgroup differences than cognitive ability tests, they are not absent in SJTs (Lievens et al., 2008). Whetzel, McDaniel and Nguyen (2008) meta-analytically investigated race and gender as two demographic variables that can lead to subgroup differences in SJT scores. Regarding gender, females in general performed slightly better than males (d = 0.11). Concerning race, they found that Whites performed better than Blacks (d = 0.38), Hispanics (d = 0.24) and Asians (d = 0.29). Subgroup differences were not invariant across all SJTs because several moderators have been found to influence the relation with SJT performance. Racial differences, for example, could be explained by the cognitive loading of the SJT. That is, SJTs that were more correlated with general mental ability resulted in larger racial differences than SJTs that were more correlated with personality constructs (Whetzel et al., 2008). Reduced racial differences were also observed when behavioural tendency instructions were used instead of knowledge instructions (differences between Whites and Blacks of d = 0.39 for knowledge instructions and d = 0.34 for behavioural tendency instructions; Whetzel et al., 2008), and when video-based SJTs were used (d = 0.21 compared to a paper-and-pencil SJT, Chan & Schmitt, 1997). In contrast to racial differences, gender differences seemed to increase only when the personality loading of the SJT increased, thereby favouring women (d = -0.37 and -0.49 as compared to men for conscientiousness and for agreeableness, respectively) and remained invariant when the cognitive loading increased (Whetzel et al., 2008).
Other than the cognitive loading of SJTs, McDaniel and colleagues (2011) suggested that more extreme response tendencies might also explain Black-White subgroup differences in SJT scores and proposed controlling for these response tendencies in SJT scoring. They administered SJTs with Likert-type scales in two concurrent designs and subsequently adjusted the scale scores for elevation and scatter (i.e., respondents’ item means and deviations). Their strategies resulted in a reduction of Black-White mean score differences across the two measurement occasions, with effect sizes dropping from around half an SD (d = 0.43-0.56) to about a third of an SD (d = 0.29-0.36) for the standardized scores to less than a fifth of an SD for the dichotomous scoring (d = 0.12-0.18). Roth, Bobko and Buster (2013) highlighted a caveat in this subgroup differences SJT research, namely that the studies have nearly always been conducted with concurrent designs (i.e., samples consisting of job incumbents and not applicants). A sole focus on concurrent designs could lead to range restriction attenuating the obtained effect sizes and thus to an underestimation of effect sizes in the population (see also Bobko & Roth, 2013). These authors argue that in order to reduce the potential issue of range restriction, subgroup differences should also be studied in samples of applicants who are assessed with the SJT at the earliest possible selection stage (and before any other measures have been deployed). In such applicant samples findings pointed towards Black-White differences of d = 0.63 for SJTs that were mainly saturated with cognitive ability, d = 0.29 for SJTs saturated with job knowledge and d = 0.21 for SJTs that mainly tapped interpersonal skills. These results further confirm previous findings of racial differences increasing with the cognitive loading of the SJT.
Applicant reactions In general, research has demonstrated that applicants prefer selection tools they perceive as job-related, that provide opportunities to show their capabilities and that are interactive (e.g., face-to-face interviews) (Hausknecht, Day & Thomas, 2004;
Lievens & De Soete, 2012; Potosky, 2008). High-fidelity simulations typically contain many of these aspects. Several studies have shown that applicant reactions to low-fidelity SJTs also tend to be favourable, and even more so when fidelity is increased and interactivity is added. Chan and Schmitt (1997) showed that a video-based SJT received higher face validity ratings than a written SJT. Richman-Hirsch and colleagues (2000) found that interactive video-based formats were preferred to computerized and paper-and- pencil formats. In an interactive (branched or nonlinear) SJT, the test-taker’s previous answer is taken into account and determines the way the situation develops. Kanning, Grewe, Hollenberg and Hadouch (2006) went a step further and varied not only stimulus fidelity (situation depicted in a video vs. written format), but also response fidelity (response options shown in a video vs. written format) and interactivity of SJTs. In line with the previously mentioned studies, applicants reacted more favourably towards interactive video-based formats, and in this case towards both the stimulus and the response format.
Faking, retesting and coaching Hooper, Cullen and Sackett (2006) compiled the research findings on faking and discovered that there was a lot of variation concerning the relation between faking and SJT performance: effect sizes ranged from d = 0.08 to 0.89 suggesting the presence of moderators. One such moderator proposed by the authors is the cognitive or g loading of the items. Although based on just a handful of studies, the trend is that SJTs with higher cognitive loadings are less easy to fake (Hooper et al., 2006; Peeters & Lievens, 2005). Similarly, the degree of faking can vary depending on the response instructions, with knowledge instructions being less easy to fake than behavioural tendency instructions (Nguyen, Biderman & McDaniel, 2005).
As SJTs are often part of large-scale, high-stakes selection programmes, it is also important to examine whether retest and coaching effects influence test scores and their psychometric properties. Concerning retest or practice effects, Lievens, Buyse and Sackett (2005) reported effects of d = 0.29 (0.49 after controlling for measurement error). A similar result was found by Dunlop, Morrison and Cordery (2011), who found an effect size of d = 0.20. Importantly, in both studies retest effects were found to be smaller for SJTs in comparison to cognitive ability tests. Dunlop and colleagues further noticed that practice effects decreased at a third measurement occasion for both the SJT and the cognitive ability tests. As far as coaching is concerned, only two studies have tackled this issue to date. Cullen, Sackett and Lievens (2006) investigated the coachability of two college admission SJTs and found that coaching increased the scores on one of the SJTs (d = 0.24) but not on the other. In contrast to Cullen and colleagues’ study which took place in a laboratory setting, Lievens, Buyse, Sackett and Connelly (2012) investigated coaching on SJT scores in a high-stakes setting. Moreover, the latter study included pretest and propensity score covariates to control for self-selection in order to reduce the non-equivalence of the groups. Using this more sophisticated analysis, they found that coaching raised SJT scores with 0.53 SDs. Finally, a recent study (Stemig, Sackett & Lievens, 2015) found that organizationally endorsed coaching (i.e., coaching provided by the organization rather than commercial coaching) also enabled people to raise their SJT scores, but did not reduce the criterion-related validities of the SJT scores.
In sum, contextualized SJTs seem to be less prone to faking and retest effects than other selection methods. Such effects may be further reduced by using knowledge-based response instructions and developing SJTs with higher g loadings. Coaching effects can be reduced by enabling all candidates to practice on SJTs in advance of high-stakes assessments.