Strengths and Weaknesses of Each Assessment Method
There are three sorts of data to investigate the strengths and weaknesses of each assessment method. The first is evaluation by academic experts. They are interested primarily in validity, but other factors too. There is no consensus, but clear trends can be seen. What these reviews lead one to conclude from the two criteria sets are the following. First, assessment centres and peer ratings are arguably the best selection methods. The former is expensive and the latter low cost. Second, many well-known methods (interviews, references) are of very limited validity. Third, surprisingly little is known about the potential bias of these tests. Fourth, despite the fact that this table was published over 15 years ago, few would disagree with the overall trends.
Schmitt (1989) argued for the validity of, but also fairness in, employment selection. Subgroup means refers to the fact that these tests show results for different groups of people (male vs. female, Black vs. White, old vs. young). This is an important area of bias (see Table 10.1). The larger the subgroup means, the more the potential bias in these tests which differentiate between various groups based on gender, age, race, etc.
Anderson and Cunningham-Snell (2000) make an interesting and important distinction between validity (i.e. predictive accuracy; see Table 10.2) and popularity (see Table 10.3). Cook (2009, pp. 283-287) lists six criteria for judging selection tests:
- 1 Validity is the most important criterion. Unless a test can predict productivity, there is little point in using it.
- 2 Cost tends to be accorded far too much weight. Cost is not an important consideration if the test has validity. A valid test, even the most elaborate and expensive, is almost always worth using.
Table 10.1 Level of validity and subgroup mean difference for various predictors.
Predictor |
Validity |
Subgroup Mean Difference |
Cognitive ability and special aptitude |
Moderate |
Moderate |
Personality |
Low |
Small |
Interest |
Low |
?a |
Physical ability |
Moderate-high |
Largeb |
Biographical information |
Moderate |
? |
Interviews |
Low |
Small (?) |
Work samples |
High |
Small |
Seniority |
Low |
Large (?) |
Peer evaluations |
High |
? |
Reference checks |
Low |
? |
Academic performance |
Low |
? |
Self-assessments |
Moderate |
Small |
Assessment centres |
High |
Small |
a=a lack of data or inconsistent data; b = mean differences largely between male and female subgroups.
Table 10.2 Predictive accuracy.
Predictive Accuracy |
Range 0-1 |
Perfect prediction |
1 |
Assessment centres - promotion |
0.68 |
Work samples |
0.54 |
Ability tests |
0.54 |
Structured interviews |
0.44 |
Integrity tests |
0.41 |
Assessment centres - performance |
0.41 |
Biodata |
0.37 |
Personality tests |
0.38 |
Unstructured interviews |
0.33 |
Self-assessment |
0.15 |
Reference |
0.13 |
Astrology |
0 |
Graphology |
0 |
Table 10.3 Popularity of assessment methods.
Popularity |
|
Interviews |
97% |
References |
96% |
Application forms |
93% |
Ability tests |
91% |
Personality tests |
80% |
Assessment centres |
59% |
Biodata |
19% |
Graphology |
2.6% |
Astrology |
0% |
Table 10.4 Summary of 12 selection tests by six criteria.
Selection Test |
VAL |
COST |
PRAC |
GEN |
ACC |
LEGAL |
Interview |
Low |
Medium/Low |
High |
High |
High |
Uncertain |
Structured interview |
High |
High |
?Limited |
High |
Untested |
No problems |
References |
Moderate |
Very low |
High |
High |
Medium |
Some doubts |
Peer rating |
High |
Very low |
Very limited |
Very limited |
Low |
Untested |
Biodata |
High |
High/Low |
High |
High |
Low |
Some doubts |
Ability |
High |
Low |
High |
High |
Low |
Major problems |
Psychomotor test |
High |
Low |
Moderate |
Limited |
Untested |
Untested |
Job Knowledge |
High |
Low |
High |
Limited |
Untested |
Some doubts |
Personality |
Variable |
Low |
High |
High |
Low |
Some doubts |
Assessment |
High |
Very high |
Fair |
Fair |
High |
No problems |
Work sample |
High |
High |
Limited |
Limited |
High |
No problems |
Education |
Moderate |
Nil |
High |
High |
Untested |
Major doubts |
VAL=validity, COST = cost, PRAC = practicality, GEN = generality, ACC = acceptability, LEGAL = legality. Source: Adapted from Cook (2009, p. 386).
- 3 Practicality is a negative criterion - a reason for not using a test.
- 4 Generality simply means how many types of employees the test can be used for.
- 5 Acceptability on the part of candidates is important, especially in periods of full employment.
- 6 Legality is a negative criterion - a reason for not using something. It is often hard to evaluate, as the legal position on many tests is obscure or confused.
This implies that many organizations have to make a trade-off - cost for validity, practicality for generality. Second, while some methods perform well at some criteria and poorly at others, very few succeed at all criteria. Assessment centres are probably the most successful (see Table 10.4).
The six criteria provide some interesting issues for those using these methods to consider. A key criterion is cost. Cook notes that interview costs are generally graded as low to medium because interviews vary widely and because the costs are taken for granted as part of the process. In contrast, structured interview costs are high because the system has to be tailor-made and requires a full job analysis. Biodata costs are viewed as low or high, as their categorization depends on how they are used - the cost is high if the inventory has to be specially written for the employer, but it be might be low if ‘ready-made’ consortium biodata could be used. The cost of using educational qualifications is given as zero because the information is routinely collected from application forms, and limited analysis is used, save to confirm the data supplied matches the requirements of the role. A further check of qualification certificates may be made at the interview or on appointment, but even with this additional administration the costs remain low.
A second criterion is practicality. This means that the test is not difficult to introduce because it fits easily into the selection process. Ability and personality tests are very practical because they can be given when candidates come for interview, and they generally permit group testing. References are very practical because everyone is used to giving them. Employers may consider assessment centres as only fairly practical, because they need detailed organizing and do not fit into the conventional timetable of selection procedures.
Peer assessments are highly impractical because they require applicants to spend a long time with each other and may require briefings or pre-training to explain the process. Structured interviews may be seen as having limited practicality because managers may resist the loss of autonomy, preferring to use their own questions and questioning style. Finally, work-sample and psychomotor tests are seen as being of limited practicality because candidates have to be tested individually, rather than in groups.
The third criterion is generality. Most selection tests can be used for any category of worker, but Cook notes that true work samples and job knowledge tests can only be used where there is a specific body of knowledge or skill to test. This means they are restricted to skilled manual work. He notes that psychomotor tests are only useful for jobs that require dexterity or good motor control. Peer ratings can probably be used in uniformed disciplined services, due to issues of attendance, and the possible need for training or at least an understanding of the competences required. Assessment centres too tend to be restricted to managers, probably on grounds of cost, although they have been used for more junior posts.
The fourth criterion reviewed is legalization. While this varies between countries or states, much of the legislation has common origins relating to a desire to prevent discrimination on the grounds of gender, colour or ethnicity. Assessment centres, work samples and structured interviews do not usually cause legal problems, but educational qualifications and mental ability tests most certainly do. Cooked notes that in some areas, such as biodata, the position remains uncertain.
Cook notes that;
Taking validity as the overriding consideration, there are seven classes of test with high validity,
namely peer ratings, biodata, structured interviews, ability tests, assessment centres, work-
sample tests and job-knowledge tests. Three of tests have very unlimited generality, which
leaves biodata, structured interviews, ability tests and assessment centres.
- • Biodata do not achieve such good validity as ability tests and are not as transportable, which makes them more expensive.
- • Structured interviews have excellent validity but limited transportability, and are expensive to set up.
- • Ability tests have excellent validity, can be used for all types of jobs, are readily transportable and are cheap and easy to use, but fall foul of the law in the US.
- • Assessment centres have excellent validity, can be used for most grades of staff and are legally fairly safe, but are difficult to install and are expensive.
- • Work samples have excellent validity, are easy to use and are generally quite safe legally, but are expensive, because they are specific to the job.
- • Job-knowledge tests have good validity, are easy to use and are inexpensive because they are commercially available, but they are more likely to give rise to legal problems because they are usually paper-and-pencil tests.
- • Personality inventories achieve poor validity for predicting job proficiency, but can prove more useful for predicting how well the individual will conform to the job’s norms and rules.
- • References have only moderate validity, but are cheap to use. However, legal cautions are tending to limit their value (Cook, 2009, pp. 386-387).
Arnold, Silvester, Pattersin, Robertson, Cooper and Burnes (2005) provided a similar analysis of the literature. This is summarized in Table 10.5.
What stands out from Tables 10.1-10.5 is their similarity despite the fact that they may be based on a different database. Occasionally, an individual technique, such as a structured interview, is judged as fair to average (in terms of validity) by one, as good to excellent by another
Table 10.5 A summary of studies on the validity of selection procedures.
Selection Method |
Evidence for Criterion- Related Validity |
Applicant Reactions |
Extent of Use |
Structured interviews |
High |
Moderate to positive |
High |
Cognitive ability |
High |
Negative to moderate |
Moderate |
Personality tests |
Moderate |
Negative to moderate |
Moderate |
Biodata |
Can be high |
Moderate |
Moderate |
Work sample tests |
High |
Positive |
Low |
Assessment centres |
Can be high |
Positive |
Moderate |
Handwriting |
Low |
Negative to moderate |
Low |
References |
Low |
Positive |
High |
but overall the results are robust. Assessment centres, work-sample tests and cognitive ability tests are usually judged most valid in all reviews. This is not surprising as many base their assessments on the same data. What we can say, therefore, is that among academic reviewers there remains good consensus as to the efficacy of different assessment methods.