Home Psychology Psychiatric Diagnosis Revisited: From DSM to Clinical Case Formulation
Diagnostic Reliability Since the DSM-III
We started this chapter with the observation that the DSM is often praised for its reliability: The DSM-III and its successors are often presumed to be much more reliable than earlier diagnostic systems. However, the question as to what evidence substantiates such a belief remains open. In this section I review this presumption, focusing on key reliability studies concerning the DSM-III as well as subsequent editions of the manual. Given our focus on the statistical kappa coefficient the following section is quite technical. However, a careful understanding of important psychometric studies published over the past 25 years is key to evaluating the claim that since the DSM-III great progress has been made at the level of reliability. We will start by looking at the initial reliability study that was conducted for the DSM-III and then look more closely at the DSM-5 field trial.
As the DSM-III was being developed, Spitzer and associates (1979) examined 25 out of 265 diagnostic categories with a test-retest design in a field trial (Spitzer et al. 1979). In total 131 patients were evaluated by two volunteering psychiatrists who selected participants from their work settings. This time a kappa value of 0.70 or more was considered to be “high,” and indicative of “good agreement” (Spitzer et al. 1979, pp. 816-817). Importantly, the study did not evaluate the reliability of exact diagnostic categories, but examined clusters of disorders, assuming that diagnoses are identical if they belong to the same cluster. Given the fact that all DSM-III disorders were presumed to be discrete entities, the choice of only testing clusters in which disorder categories were grouped is strange. In other words, the field trial didn’t actually test the disorder categories that the DSM-III offered as an answer to presumably unscientific pre-DSM-III diagnostic practices.
The results of the field trial can be found in Table 2.4. We again interpret the kappa coefficient starting from the norms formulated in Table 2.1. Spitzer and associates (1979, p. 818) concluded that for most
Table 2.4 Kappa coefficients from the DSM-III field trials, adapted from Spitzer et al. (1979), and three sets of interpretations of the kappa coefficient using different norms
Table 2.4 (continued)
of the diagnoses reliability is “quite good and, in general is higher than that previously achieved using DSM-I and DSM-II.” Given the fact that for 11 of the 25 categories the observed kappa is below their own standard of 0.70 this conclusion seems overly optimistic. The overall kappa coefficient they observed for Axis-I disorders (clinical syndromes) was 0.66 and for Axis-II disorders (personality disorders and specific developmental disorders) it was 0.54. Generally, the results of this study are better than those of the review study by Spitzer and Fleiss (1974), yet the small sample size used renders it impossible to make relevant comparisons. Indeed, the DSM-III field trial started from a sample that was substantially smaller than the 1726 patients Spitzer and Fleiss (1974) included in their review of pre-DSM-III diagnosis. Spitzer and his associates (1979) calculated kappa values for 25 conditions, but used only 131 patients, which implies that some of the kappa coefficients were based on a very low number of patients. Unfortunately, they didn’t discuss the distribution of patients over the different conditions. As a result, the kappa coefficients that were calculated cannot be considered as good estimates.
Other test-retest reliability studies focusing on the DSM-III or the revised DSM-III (DSM-III-R11) that included sufficient numbers of patients per tested condition yielded mixed results. For example, a DSM- III-R test-retest reliability study in Germany that examined 5 disorder categories with 100 patients and 4 raters (Hiller et al. 1993) observed a mean kappa value of 0.53, as well as substantial differences between disorders: schizophrenia: kappa = 0.51; schizophreniform/acute and transient psychotic disorder: kappa = 0.50; schizoaffective disorder: kappa = 0.08; major depression: kappa = 0.80; bipolar disorder: kappa = 0.65.
High-quality test-retest reliability studies on the DSM-III were not performed until the Structured Clinical Interview for DSM (SCID) was published. This diagnostic interview was developed in order to provide a standardized assessment of DSM disorders (Spitzer et al. 1984). The use of a structured interview has a heightening effect on reliability: because questions clinicians ask are uniform and clinical improvisation is restricted, variability in patients’ answers diminishes, which facilitates consistency in diagnostic classification across judges. The first edition of the SCID was developed for the DSM-III, but the version developed for the DSM-III-R was more frequently used in research. Different versions of the SCID were developed for examining various clinical syndromes (Axis I in DSM-III until DSM-IV-TR) and diagnosing personality disorders (Axis II in DSM- III until DSM-IV-TR). A version of the SCID for DSM-5 (SCID-5) is also published, but reliability studies are not yet available (First et al. 2016).
In 1992 Janet Williams and ten other collaborators, including Robert Spitzer, published a large multisite test-retest study that examined the reliability of the SCID for DSM-III-R Axis-I disorders.  Six clinics in the USA and one in Germany were involved. The study examined how well clinicians agree when diagnosing patients from psychiatric clinics. Next to that, they examined inter-rater agreement when testing mentally
Table 2.5 Kappa coefficients from SCID-I test-retest study in psychiatric patients by Williams et al. (1992), and three sets of interpretations of the kappa coefficient using different norms
distressed people from the general community. Twenty-five trained raters were involved and in total 390 patients and 202 non-patients were tested. Kappa coefficients were calculated for categories that were diagnosed at least ten times. Table 2.5 contains interpreted results for current13 diagnoses in psychiatric patients. Compared to the results obtained for pre- DSM-III reliability studies (Table 2.2) these reliability coefficients are only slightly better, which is also what Kutchins and Kirk (1997) suggest
The study also includes kappa coefficients calculated for lifetime diagnoses. For some diagnostic categories these were slightly better than those for current diagnosis.
2 Dynamics of Decision-Making: The Issue of Reliability...
Table 2.6 Kappa coefficients from SCID-I test-retest study in non-patients by Williams et al. (1992), and three sets of interpretations of the kappa coefficient using different norms
in their critical evaluation of the DSM-III. Indeed, considered from the Spitzer and Fleiss (1974) norms, 13 categories have an unacceptable reliability, 5 have satisfactory reliability, but none is highly reliable, which is remarkable given the fact that the structured nature of the SCID should yield heightened reliability scores. The Landis and Koch (1977) norms and the norms used by Clarke and associates (2013) lead to more positive interpretations, yet only the results for mood disorders are substantially better than those obtained 20 years earlier.
The results from the non-patient group (Table 2.6), in their turn, indicate that the correspondence between raters is lower than in the clinical group. Actually, these are at the level of the reliabilities observed by Spitzer and Fleiss (1974). Considered from the Spitzer and Fleiss (1974) norms these kappa values cast doubt on the typical neo-Kraepelinian presumption (Klerman 1978) that mental disorders make up illness conditions that are clearly delineated from normal states of mind.
In a next step, these researchers also tested the inter-rater reliability of the SCID for personality disorders (First et al. 1995): on a pairwise basis, 25 trained judges evaluated 103 psychiatric patients and 181 non-clinical participants. The results (Tables 2.7 and 2.8) indicate that the kappa values are somewhat lower than for the SCID assessment of clinical disorders (Axis I). Raters disagreed substantially in the non-clinical group in particular, bringing the level of agreement below the level observed in pre-DSM-III diagnosis.
Table 2.7 Kappa coefficients from SCID-II test-retest study in psychiatric patients by First et al. (1995), and three sets of interpretations of the kappa coefficient using different norms
Table 2.8 Kappa coefficients from SCID-II test-retest study in non-patients by First et al. (1995), and three sets of interpretations of the kappa coefficient using different norms
The DSM-IV and DSM-IV-TR, published in 1994 and 2000 respectively, did not bring about major changes in the system. Overall, criteria that reflected patients’ experience of distress were added; several other diagnostic criteria were altered; some disorder categories were removed while others were introduced. This resulted in a total of 297 diagnoses. The field trials that accompanied the DSM-IV didn’t specifically focus on test-retest reliability. Concentrating on specific disorders, they mainly examined how changes in diagnostic criteria altered prevalence rates (e.g., Keller et al. 1996). Studies using the SCID for DSM-IV diagnoses, in their turn, yielded results that were similar to those obtained with the SCID for the DSM-III-R. For example, in a small-size multisite reliability study, Zanarini and associates (2000) evaluated 52 psychiatric cases with multiple trained judges in a test-retest design. Of the nine personality disorders tested, eight had fair to good reliability according to the Landis and Koch (1977) standard, while one (paranoid personality disorder) had poor reliability.
Notwithstanding all rhetoric on the good reliability of the DSM since 1980 it was only with the DSM-5 field trials that a major test-retest reliability study was conducted with a sample size that was comparable to the 1726 patients that Spitzer and Fleiss (1974) included in their review. The DSM-5 field trials tested the reliability of 27 diagnostic categories in adults as well as in children and adolescents (Clarke et al. 2013; Regier et al. 2013). Two hundred and eighty-six clinicians participated, and a total of 1466 adult patients and 616 pediatric patients were evaluated by 2 trained clinicians separately. Only patients with problems that were relevant in terms of the 27 tested conditions were included. The obtained kappa coefficients as well as our interpretation of these values can be found in Table 2.9 for adults and Table 2.10 for children.
Despite the claim in the introduction of the DSM-5 (p. 5) that “DSM has been the cornerstone of substantial progress in reliability,” the results are simply no better than those found by Spitzer and Fleiss (1974) (Table 2.2). Of the 18 categories examined by Spitzer and Fleiss (1974), 15 had an unacceptable reliability according to their norms. Applying these norms to the DSM-5 field trial makes clear that 14 of the 15 diagnoses in adults and all seven pediatric diagnoses have an unacceptable reliability. In terms of the kappa evaluation norms formulated by Clarke et al. (2013), four pre-DSM-III categories had a very good reliability, eight a good reliability, and six a questionable reliability. In terms of these norms three categories from the DSM-5 field trial for adult patients had a very good reliability, seven a good reliability, four a questionable reliability,
Table 2.9 Kappa coefficients from the DSM-5 field trials in adult patients (Regier et al. 2013) and three sets of interpretations of the kappa coefficient using different norms
and one an unacceptable reliability. In the DSM-5 field trial for children, two diagnoses had a very good reliability, two a good reliability, two a questionable reliability, and two an unacceptable reliability. Whereas those categories with unacceptable reliability have been omitted from the
2 Dynamics of Decision-Making: The Issue of Reliability...
Table 2.10 Kappa coefficients from the DSM-5 field trials in pediatric patients (Regier et al. 2013) and three sets of interpretations of the kappa coefficient using different norms
final version of the DSM-5, it remains that the kappa values anno 2013 are comparable to those obtained anno 1974.14
What is more, an important problem coming to the fore in the DSM-5 field trial is that the diagnosis of mood disorders (the most frequently made psychiatric diagnosis) is no more reliable than it was found to be in the review study of Spitzer and Fleiss (1974). In the 1970s affective disorders in particular had a questionable reliability; 40 years later the reliability coefficients are even worse. This might be a result of defining depression too broadly, including diagnostic criteria that are far too vague, and neglecting context variables (Maj 2012). Contra this criticism it could be argued that throughout the years a number of standardized interviews that have a good reliability were developed for assessing depression (e.g., Trajkovic et al. 2011; Williams and Kobak 2008). Yet, Contra this interpretation it could be argued that across studies the statistical formulae used for estimating reliability indices are slightly different. For example, the DSM field trial took into account the population prevalence of tested conditions, while the studies from the 1970s did not. These differences have to do with the evolution of statistical methodology and the availability of accurate prevalence data. However, throughout the years the interpretation of the kappa coefficient largely remained the same, which is why comparisons can indeed be made.
these were developed apart from the DSM, and were not integrated into subsequent editions of the manual.
Interestingly, the interpretation of the DSM-5 field trials by the researching team (Regier et al. 2013) as well as by editors of the American Journal of Psychiatry (Freedman et al. 2013) particularly focuses on the “new blooms” the field trials produced, leaving aside some of the “old thorns” the study brought to the fore.
One problem that both interpretations acknowledge concerns the poor results for depression- and anxiety-related conditions. While these poor results are mainly attributed to comorbidity between these conditions (including major depression, generalized anxiety, alcohol use, and posttraumatic stress disorder [PTSD]), reflection on the system that produces such high comorbidity rates is yet to be made. Moreover, these interpretations (Freedman et al. 2013; Regier et al. 2013) have not mobilized any kind of shift toward putting into question the overrated claim of good diagnostic reliability since the DSM-III. A further issue that remains wholly neglected is that whereas this field trial was able to estimate reliabilities for some disorders, the majority of DSM-5 diagnostic categories were not tested at all: the DSM-5 counts 347 disorder categories, but kappa coefficients could be calculated only for 20 conditions (6 %). Moreover, of those categories only 14 had a good or very good reliability, which means that only 4 % of the DSM-5 categories have been shown to have sufficient reliability. Indeed, since the inter-rater reliability of the majority of DSM-5 categories remains untested, the idea that the DSM is a reliable instrument is simply wrong. In their DSM-5 editorial, by contrast, Freedman et al. (2013, p. 3) conclude: “For a general psychiatric practice, the diagnostic reliability data suggest that two-thirds of patients will receive a reliable DSM-5 principal diagnosis at the first visit.” Given the fact that acceptable reliability coefficients have been observed for only 4 % of the DSM categories, and given the poor reliabilities observed for anxiety- and depression-related psychopathology, this claim is seriously overrated. In other words, just because the field trials indicate that two out of three conditions had good kappa coefficients it does not follow that the same is true for all other DSM-5 categories (94 % of the manual). Moreover, there is no discussion of the norms used for interpreting the kappa coefficient in any publications by those responsible for the DSM-5. Compared to frequently used standards in psychiatric research (Baer and Blais 2010) the norms used in the DSM-5 field trials were exceptionally low (Frances 2012; Vanheule et al. 2014). Such use of low standards might reflect a tactical maneuver to boost bad results.
The main achievements Regier et al. (2013) and Freedman and associates focus on concern the very good reliabilities observed for a number of conditions, like PTSD in adults and ADHD in children. Considered from the norms of Clarke et al. (2013) these disorders can indeed be classified in reliable ways. However, these disorders have been contested at the level of validity. For example, in an analysis of the criteria that make up PTSD, Gerald Rosen and Scott Lilienfeld (2008) conclude “that virtually all core assumptions and hypothesized mechanisms lack compelling or consistent empirical support.” Similarly, critical examinations of ADHD reveal that fundamental questions on the nature and meaning of the ADHD construct remain unanswered (Batstra and Thoutenhoofd 2012; Parens and Johnston 2011; Rafalovich 2004; Timimi and Leo 2009). Another success the authors focus on concerns their proposition that schizophrenia, bipolar disorder, and schizoaffective disorder can be distinguished reasonably well. This observation is important, but perhaps not that surprising, since the SCID already distinguishes these conditions reasonably well (Williams et al. 1992). Nevertheless, whereas the field trial underlines the differentiation between these psychosis-related conditions, research pointing to the overlap between them must not be ignored (Angst 2002; Hyman 2010; McNally 2011; Van Os 2016) as it suggests that categorical distinctions between these disorders might not be valid.
In a later reliability study, Chmielewski et al. (2015) pointed out that the disappointing kappa coefficients in the DSM-5 field trials might be an effect of the method the researchers used. Just like the studies synthesized by Spitzer and Fleiss (1974), the DSM-5 field trial used a test-retest design: two or more diagnosticians evaluate one patient, with a small time interval between the interviews. Another method for evaluating diagnostic reliability is audio- or video-recording: “In this method, one clinician conducts the interview and provides diagnoses; a second ‘blinded’ clinician then provides an independent set of diagnoses based on recordings of the interview” (Chmielewski 2015, p. 765). Such a method usually yields higher reliability estimates. In their study, Chmielewski and colleagues evaluated 12 DSM-IV disorders, and included 339 psychiatric patients. All were diagnosed with the test-retest method using the SCID interview for DSM-IV and a one-week interval between both interviews. For 49 patients the audio-recording method was used as well.
The obtained kappa coefficients, and interpretation of these values, can be found in Tables 2.11 and 2.12. In the test-retest condition the mean kappa was 0.47. Starting from the strict kappa evaluation standards of Spitzer and Fleiss (1974) all diagnoses have unacceptably low reliabilities (see Table 2.11). Using the norms of Clarke et al. (2013) four disorders have very good diagnostic reliability, including major depressive disorder, panic disorder, psychosis, and substance use disorder. Five conditions were diagnosed with good reliability (obsessive-compulsive disorder, PTSD, bipolar I disorder, specific phobia, generalized anxiety disorder), and three have questionable reliabilities (social phobia, dysthymic disorder, other bipolar disorders). Overall, this study yields reliability indices that are similar to those observed in the other studies we reviewed, albeit that kappa values are lower than in the study by Williams and associates (1992), which also used the standardized SCID interview. Remarkably, just like in the other SCID study (Williams et al. 1992) the diagnostic reliability of major depressive disorder and generalized anxiety disorder is much better than in the DSM-5 field trial. In the test condition where a second diagnostician used audiotaped interviews conducted by a first diagnostician, by contrast, reliability indices are much better, reaching a mean kappa value of 0.80 (see Table 2.12). Of the 12 conditions tested in the study by Chmielewski et al. (2015), 11 had excellent or very good reliabilities, as evaluated with the DSM-5 thresholds (Clarke et al. 2013).
Most remarkably, both test conditions yielded highly different results. This curious outcome indicates that perhaps the problem is not
2 Dynamics of Decision-Making: The Issue of Reliability...
Table 2.11 Kappa coefficients from SCID test-retest study in psychiatric patients by Chmielewski et al. (2015), and three sets of interpretations of the kappa coefficient using different norms
only a matter of diagnosticians’ divergent opinions about psychological distress, and of their idiosyncratic ways of interpreting mental health symptoms. After all, the test condition with audiotapes demonstrates that when using exactly the same patient accounts, evaluators make very similar classificatory decisions. A conclusion we can draw from these results is that perhaps the accounts individuals give about mental distress are not very stable. The entire enterprise of evaluating reliability presupposes stability both at the side of diagnosticians’ evaluation and at the side of the patients’ account. Yet if patient accounts of distress have a different outline and character across interviews, the search for good reliability will always be disappointing unless the nature and
Table 2.12 Kappa coefficients from audio-recording reliability study in psychiatric patients by Chmielewski et al. (2015), and three sets of interpretations of the kappa coefficient using different norms
organization of this variability are taken into account. Indeed, usually reliability research starts from the basic assumption that mental distress is expressed via symptoms and signs that univocally refer to underlying disorders. Within this line of reasoning, diagnosticians screen for specific indicators. For example, loss of appetite and frequent crying are usually seen as indicators of major depression. However, it might be that in patients’ experience of distress and in the accounts they give about their mental suffering, varying symptoms come to the fore. For example, at one moment an individual might feel sad, while at another concerns about physical health or anxious preoccupation with people’s opinions come to the fore. Obviously, symptoms and complaints don’t simply appear and disappear on a daily basis, but the weight that is subjectively attributed to specific symptoms, and the narratives along which distress is communicated, might indeed vary quite substantially.
If this is the case, it is not so surprising that in the study of Chmielewski et al. (2015), the evaluations of interviews with a one-week interval, and a second interviewer who creates a different interactional context, yield results that are so dissimilar from the research condition in which a single audiotaped interview is used.
Thus, research into diagnostic reliability has inadvertently opened another field of inquiry: what psychiatric disorders exactly are and how mental health symptoms can be conceptualized. In discussions about diagnosis and assessment it is often argued that reliability is a necessary condition for validity. Indeed, before we can ever conclude that an instrument measures the phenomenon we want it to, we must already be convinced that the instrument yields stable results. Reliability research on the DSM sheds doubt on its scientific credibility, but this should not make us shy away from the issue of validity and from examining how psychopathology is best conceptualized. The next chapter deals with this topic.
|< Prev||CONTENTS||Next >|