Home Psychology Psychiatric Diagnosis Revisited: From DSM to Clinical Case Formulation
Commotion in Psychiatry, Part I: Diagnosis Under Fire in the 1970s
The reason why psychiatry was contested in the late 1960s and 1970s is illustrated in a series of critical empirical studies that documented the poor reliability and validity of psychiatric diagnosis. I illustrate this below through three iconic studies.
The first study was conducted by Maurice Temerlin (1968, 1970). In 1968 Temerlin published a paper on a naturalistic experimental study in which he asked 25 psychiatrists, 25 clinical psychologists, and 45 clinical psychology students to diagnose an individual. Participants listened to an audiotaped interview with an actor who followed a script, that of a normal, healthy man. However, prior to listening to the recording an eminent colleague remarked that the individual they were about to hear was “a very interesting man because he looked neurotic but actually was quite psychotic” (Temerlin 1968, p. 1949). After listening to the recording, participants were asked to indicate their diagnosis on a data-sheet containing multiple diagnostic categories: “10 psychoses, 10 neuroses and 10 miscellaneous personality types, one of which was ‘normal or healthy personality’” (Temerlin 1968, p. 350). Moreover, they were asked to substantiate their diagnosis by noting down the behavioral characteristics that guided them in their decision.
The result was quite shocking: 15 psychiatrists gave a diagnosis of psychosis, while 10 concluded that they were dealing with a case of neurosis. The clinical psychologists were slightly more precise in their decision: 7 diagnosed psychosis, 15 neurosis, and 3 indicated that he was normal. The graduate students, in their turn, were the most accurate: 5 indicated that the interviewee was psychotic, 35 pointed to neurosis, and 5 believed he was in good mental health.
In the next step, the study was replicated in three control conditions. First, the interview was presented with reverse suggestion, that is, with an eminent colleague saying that the man on the tape is normal. All 20 mental health professionals participating in this condition agreed that he was normal. In the second control condition, the interview was presented without suggestion to a group of 21 mental health professionals. This time no one indicated that the interviewee was psychotic: 9 believed he was neurotic and 12 that he was normal. In the third control condition, the tape was presented to 12 laypersons. Again, no suggestion was made, and all of them agreed that the interviewee was in good mental health. When Temerlin finally studied the participants’ written notes he observed that only the participants who arrived at making a correct diagnosis made good resumes. Indeed, only those who characterized the interviewee as normal provided accurate paraphrasing of what the actor had actually said. Others either made very poor summaries or mentioned personal inferences rather than observations.
Overall, Temerlin concluded that diagnostic appraisals made by psy-professionals were highly influenced by prejudice and suggestion. This influence is situated at two levels. First, the huge impact of the eminent colleague’s informal comment shows that trained professionals’ decision-making was not guided by strong internal standards: they were exceedingly suggestible and proved to be biased by a variety of contextual influences. Second, the difference between the professionals and the laypersons in the non-suggestion condition illustrates that specialized knowledge did not aid them in drawing valid conclusions, but instead engendered a certain prejudice. After all, while the script was highly balanced and did not contain references to symptoms, mental suffering, or underlying psychopathological processes, almost half of the professionals in the non-suggestion condition believed that the interviewee had some kind of neurosis. Finally, the participants’ observation notes were far from accurate. They contained elements of prejudice and misrepresentation that subsequently distorted their evaluation. Indeed, the major impact of prejudice and suggestion not only implies that the reliability of psychiatric decision-making is low, but more importantly indicates that the validity of psychiatric diagnosis was fundamentally questionable.
The Temerlin study came as a bombshell: highly qualified professionals who were enveloped by an aura of seriousness suddenly proved to be simply uncritical and biased in their lines of reasoning: emperors without clothes. Without explicitly referring to the Temerlin study, in 1973 clinical psychologist Paul Meehl further substantiated this problem. In his influential essay “Why I Do Not Attend Case Conferences,” Meehl listed several weaknesses in how the clinical decision-making process, at that time, often took shape in psychiatric settings. His list is impressive and further demonstrates why the then common practices of informal data gathering and diagnostic decision-making were not only inadequate, but sometimes blatantly erroneous. Problems he reviews include the following:
Taken together these problems, which Meehl (1973) described as so common that attending clinical case meetings was no longer worth the effort, provided a shocking picture of clinical psychiatric practice: rather than being informed by scientific methods and insights, diagnostic decision-making seemed governed by all kinds of flaws.
The second iconic study, which further demonstrated a number of problems mentioned by Meehl, was published by David Rosenhan in the highly prestigious journal Science (1973, 1975). In order to test how professionals appraise minimal psychiatric symptoms, Rosenhan recruited eight individuals without psychiatric problems, “sane people” (1973, p. 179), and sent them as pseudo-patients to 12 hospitals in total. The pseudo-patients contacted the hospitals with the complaint that they were hearing voices that were communicating ambiguous messages, like “empty,” “hollow,” and “thud.” Aside from providing information about these voices, the pseudo-patients did not give any other information related to (psycho)pathology and answered all further questions truthfully: “Beyond alleging the symptoms and falsifying name, vocation, and employment, no further alterations of person, history, or circumstances were made” (Rosenhan 1973, p. 180). All participants were admitted to the hospital, at which moment they ceased feigning any further characteristics of abnormality. Rosenhan was interested in how the pseudo-patients would be diagnosed, and observed that after an average of 19 days of hospitalization the participants were discharged with the following diagnoses: 11 with a diagnosis of schizophrenia and 1 with a diagnosis of manic-depressive psychosis, all of them with the added specification of “in remission.” Rosenhan concluded that the diagnoses were not based on the relative normality of the pseudo-patients or their behavior in the ward. Rather, the fact of staying in a hospital provided a context in which minor symptoms were magnified and normal behaviors were pathologized, illustrating the sick-sick fallacy coined by Meehl
(1973). Rosenhan anecdotally substantiated this conclusion by referring to a case summary prepared by one hospital as their pseudo-patient was discharged. In line with Meehl’s (1973) observation that clinicians tend to rapidly make use of complex theoretical speculations, often getting trapped in the ad hoc fallacy, the case report “explained” the hallucinatory symptom by referring to affective instability and ambivalence in the pseudo-patient’s family constellation; this, it was concluded, had provoked a schizophrenic reaction. With this experiment, Rosenhan set the problem of false-positive diagnosis (i.e., giving a diagnosis when it is unwarranted) on the agenda of scholars in psychiatry.
When Rosenhan first presented his results, the general response in the psychiatric establishment was disbelief. One hospital reacted that such frank errors would never occur in their institution, and to test this brave assertion, Rosenhan (1973, 1975) introduced a second phase to his experiment. He told the hospital that in the next three months one or more pseudo-patients would again attempt to get admitted to the hospital. Each staff member was asked to rate incoming patients according to the likelihood that they were pseudo-patients. Judgments were obtained on 193 admitted patients. The results seemed promising: 42 admitted patients were qualified, with unflinching confidence, as pseudo-patients by at least one staff member; 23 were alleged to be pseudo-patients according to their psychiatrist; and 19 were alleged to be pseudo-patients according to the psychiatrist and at least one other member of staff. Hospital staff were now meticulous in their detection of what could be false-positive diagnoses. The only problem in all of this is that Rosenhan had not actually sent anyone to the hospital. Indeed, the second phase of the experiment casted even more doubt on the validity of psychiatric diagnosis, and put the credibility of psychiatric practice into question.
Rosenhan’s second experiment opened up at least two possible interpretations. On the one hand, the results demonstrated the danger of false-negative diagnosis (giving no diagnosis when it seems justified). On the other hand, Rosenhan’s findings could be interpreted as a fundamental attack on psychiatry, as such: if staff had serious doubts about the authenticity of the complaints in 10-20 % of their patients, why did they decide to treat them in a residential setting? What is more, questions on the quality of psychiatric treatment also came to the fore in notes made by the participants in the first phase of the experiment. Based on their daily observations in the ward, staff were characterized as disinterested in the patients’ problems. Occasionally staff members displayed basic misconduct in relation to patients and overall developed relationships whereby the patients felt powerless and were treated impersonally.
Rosenhan’s study aroused debate about the discipline of psychiatry, as well as criticism by colleagues. Robert Spitzer, the president of the DSM-I II task force, expressed one such criticism, using Rosenhan’s study to actually promote diagnosis that utilized standardized checklists (Spitzer 1975, pp. 450-452). In his view, the Rosenhan experiments were mere pseudo-science in Science. Contrary to the common interpretation of the results, Spitzer (1975) suggested that the psychiatrists were actually correct in their consideration of psychosis. After all, the hallucinations might have indicated an acute schizophrenic episode, concerning which the DSM-II (p. 34) specifies that “in many cases the patient recovers within weeks.” The psychiatrists were correct when they described the condition as “in remission”: “The meaning of ‘in remission’ is clear: it means without signs of illness. Thus, all of the psychiatrists apparently recognized that all of the pseudo-patients were, to use Rosenhan’s term, ‘sane’” (Spitzer 1975, p. 444). As he examined actual patient files from hospitals, Spitzer observed that the diagnosis “in remission” was used very rarely. The fact that it was used in Rosenhan’s study thus demonstrates that psychiatrists were actually very accurate in their evaluation. Meanwhile the damage had been done to the image of psychiatric diagnosis.
Actually, Spitzer himself was a key figure in pointing to severe methodological problems with psychiatric diagnosis. Together with a collaborating statistician, Joseph Fleiss, he published a third iconic study, in which he argued that problems with inter-rater reliability in habitual prototypical diagnosis were insurmountable (Spitzer and Fleiss 1974). With the aim of making a comprehensive overview of diagnostic reliability they aggregated data from six major studies that examined how well diagnosticians usually reach agreement in their evaluation of the same patient. All studies included were published between the late 1950s and 1970s. The accumulated data included 1726 patients, all diagnosed by two psychiatrists. The diagnostic system used in the six studies differed slightly, but allowed aggregation in terms of the DSM-II disorder categories. Spitzer and Fleiss checked inter-rater agreement by calculating a kappa statistic for each DSM-II disorder category. This statistic estimates agreement between judges, but incorporates a correction for agreement based on mere chance. As Diana E. Clarke et al. (2013, p. 47) indicate, kappa coefficients reflect “the difference between the probabilities of getting a second positive diagnosis between those with a first positive and those with a first negative diagnosis.”
The next paragraphs in this section provide a methodological discussion of the study by Spitzer and Fleiss. Some parts might be hard to follow for those who are not particularly interested in numbers. However, a detailed discussion of reliability studies dating from the 1970s is important for an accurate evaluation of the idea that with the publication of DSM-III the reliability of psychiatric diagnoses strongly improved.
In a nutshell, what Spitzer and Fleiss attempted to argue was that in the 1960s and 1970s psychiatric diagnosis was not reliable. However, as I outline below, their method of reaching this conclusion started from rather high evaluation thresholds for the kappa statistic, making poor reliability results more likely. Indeed, it was on the basis of these high standards for the kappa statistic that poor reliability was established. This appears to have given the authors reasonable grounds to denounce the classic prototype-based method of diagnosis, which we saw characterized the method of diagnosis of the previous decades, and ultimately argue for a switch to the checklist format, that is, that which characterizes the DSM-III and subsequent editions of the manual. As we will see, reliability studies on the DSM-III (and its successors) indicate that diagnostic reliability improved substantially over the years. However, as I hope to demonstrate, this conclusion is spurious: whereas the magnitude of the kappa coefficients largely remained the same over the years, what seems to have changed was the threshold for which they were being interpreted.
While strict norms for interpreting the kappa coefficient do not exist, several authors have presented standards for interpreting them. In an early paper on the kappa statistic Fleiss and Cohen (1973) devoted a section to the interpretation of the kappa coefficient. At that moment, no fixed standards for using the statistic seemed to have been established, yet the authors suggested that the kappa coefficient could be interpreted as an intra-class correlation coefficient. In another paper, Spitzer and colleagues argued that numerical scales of psychopathology “typically have reliabilities in the interval 0.70 to 0.90,” which motivated Spitzer and Fleiss (1974) to use this interval to evaluate the values of the kappa statistics collected: reliability is high if the coefficient is more than 0.90, satisfactory if it falls in the 0.70- 0.90 range, and unacceptable if the obtained coefficient is lower than 0.70.
However, over the years, commonly accepted standards for interpreting the kappa coefficient were reformulated, whereby norms for evaluating it relaxed considerably. For example, J.R. Landis and G.G. Koch (1977) proposed to adhere to one frequently used standard: They cautiously indicate that a kappa value of more than 0.75 indicates excellent reliability, while a value between 0.40 and 0.75 points to fair to good reliability, and a value below 0.40 is indicative of poor reliability. In 1981 Spitzer’s collaborator Joseph Fleiss endorsed these more flexible norms. He proposed to use them to interpret kappa coefficients (Fleiss et al. 1981, p. 609), and thus to leave aside the previously formulated standard that used a 0.70 to 0.90 interval. Later on, further refinement in the interpretation of the kappa coefficient was argued for. Indeed, the standards that were used to interpret the DSM-5 field trial data (Clarke et al. 2013), which we discuss later, delineate kappa values of 0.80 and above as excellent, indicating almost perfect agreement; values from 0.60 to 0.79 as very good, indicating substantial agreement; kappa values from 0.40 to 0.59 as good, indicating moderate agreement; values from 0.20 to 0.39 as questionable, indicating fair agreement; and values below 0.20 as unacceptable, indicating slight agreement (see also Kraemer et al. 2002; Viera and Garrett 2005). An overview of these sets of kappa evaluation standards can be found in Table 2.1. Balanced evaluations of results from
Table 2.1 Overview of three sets of norms for evaluating the kappa statistic
inter-rater reliability studies that were conducted throughout the years should take into account these changing norms. Indeed, this important evolution in statistical standards of evaluation seems to have been largely ignored. In other words, over time, the standards upon which statistical evaluations were made relaxed substantially. If these changes are not taken into account, the results of different studies simply cannot be compared.
In interpreting their aggregated data from reliability studies published between the late 1950s and early 1970s, Spitzer and Fleiss (1974) used the strict standard they themselves formulated (see Table 2.2). This brought them to the conclusion that the reliability of pre-DSM- III diagnosis was disappointing. Indeed, as one applies their norms none of the diagnostic categories has high reliability: no coefficients are higher than 0.90. Only three diagnostic categories fall into the 0.70 to 0.90 interval for acceptable reliability (i.e., “mental deficiency, organic brain syndrome (but not its subtypes), and alcoholism” (Spitzer and Fleiss 1974, p. 344)). For 15 out of the 18 studied diagnostic categories, by contrast, kappa values were below 0.70, meaning that for these
Table 2.2 Kappa coefficients for DSM-II categories, adapted from Spitzer and Fleiss (1974, p. 344), and three sets of interpretations of the Kappa coefficient using different norms
conditions reliabilities were unacceptable. This brought the authors to the following conclusion: “The reliability of psychiatric diagnosis as it has been practiced since at least the late 1950s is not good” (Spitzer and Fleiss 1974, p. 345). What is remarkable about Spitzer and Fleiss’s (1974) paper is that they act as if the norms they applied were commonly accepted, while in reality this was not the case. This might indicate that they were either very naive in their use of the kappa coefficient, or they were determined to finish with psychiatric diagnosis, as commonly practiced between the late 1950s and the early 1970s. Reading their 1974 paper today, it seems they, above all, aimed at paving the way for checklist-based diagnosis. After all, following their general conclusion that the quality of psychiatric diagnosis was poor, Spitzer and Fleiss (1974) suggested that an overall change in the practice of diagnosis was needed. Referring to an older study by Aaron Beck et al. (1962), which demonstrated that diagnostic unreliability was mainly due to ambiguities in commonly used nomenclature, they suggested that the descriptions of disorders in the DSM-II were far too imprecise, and that a different method of diagnosis was necessary. The alternative method they proposed was indeed the checklist approach Spitzer embedded in the DSM-III.
However, if one interprets the kappa coefficients collected by Spitzer and Fleiss (1974) using the norms proposed by Landis and Koch (1977), a different conclusion comes to the fore (see Table 2.2). Based on their standards, reliability is excellent for 1 diagnostic category: organic brain disorder; fair to good for 11 diagnostic categories; and poor for 6 categories, including the subtypes of depression, personality disorder, and psychophysiological reaction. Indeed, bearing in mind Landis and Koch’s (1977) norms, strong assertions like those made by Spitzer and Fleiss (1974) cannot be confirmed, since two-thirds of the diagnostic categories show fair to excellent reliabilities. For example, applying these norms one can conclude that the diagnostic reliability of the classic diagnostic categories, neurosis and psychosis, is fine. In view of the harsh discussions around the diagnostic category of neurosis during the development of the DSM-III, this might sound ironic. Indeed, as Hannah Decker (2013) documented extremely well, the category of neurosis was a divisive element in the late 1970s when the DSM-III was composed: psychoanalysts were convinced that neurosis was a crucial concept in diagnostic work, while biomedically inspired psychiatrists believed it was a useless category because of its link with psychoanalytic theory. The psychoanalysts lost the debate. Among other reasons, psychoanalysts’ staunch attitude and poor knowledge of statistics and research methods have been their undoing: they couldn’t detect the holes in Spitzer’s methodology, and failed to substantiate the credibility of the neurosis concept. It was thus deleted as a diagnostic category. In the DSM-III the term “neurosis” was relegated to a few disorder category subtitles, and this as a descriptive label only. Eventually the concept disappeared from the later editions of the handbook.
As one interprets the kappa coefficients from the Spitzer and Fleiss (1974) study with the kappa norms used in the DSM-5 field trial (Clarke et al. 2013), the interpretation is even more positive: very good reliability for four categories; good reliability for eight categories, including neurosis, psychosis and affective disorder; and questionable reliability for the six categories that Landis and Koch’s (1977) standard had also evaluated as poor. Indeed, the use of contemporary norms provides a completely different picture of the reliability of pre-DSM- III diagnosis. Starting from Spitzer and Fleiss’s (1974) benchmarks, the reliability was unacceptable for 15 of the 18 disorder categories. In terms of contemporary standards, by contrast, none of the diagnostic categories has an unacceptably low reliability: for 12 disorders reliability is good to very good, and for six categories inter-rater reliability is questionable.
However, history teaches us that until now the interpretation by Spitzer and Fleiss (1974) set the tone. Their interpretation substantiated the idea that in the 1970s psychiatric diagnosis was in deep crisis, and that drastic reforms were necessary. In the discussion of their paper Spitzer and Fleiss (1974) not only proposed a switch to checklist-based thinking, but also implied that diagnostic categories with psychoanalytic origins, like neurosis, caused diagnostic unreliability, which is why Spitzer campaigned against these categories and eventually strived to remove them from the DSM-III (Decker 2013). Spitzer himself was initially trained in psychoanalysis, but disliked the method, believing that by switching to a biomedical model psychiatry could make great advances.
In later publications, neither Spitzer nor any other officials working on subsequent editions of the DSM pointed to the evolution in the use of the kappa statistic, meaning that the early claims of Spitzer and Fleiss (1974) were never properly put into perspective. Based on their results Spitzer and Fleiss simply denounced diagnosis based on prototypes, and, as we will argue later on in this chapter, made a plea for a biomedical approach to psychopathology. Interestingly, recent research demonstrates that the idea that the prototype-based method diagnosis is unreliable is untenable. For example, Drew Westen and associates launched and tested a prototype-matching approach to diagnosis: “diagnosticians compare a patient’s overall clinical presentation to a set of diagnostic prototypes - for clinical use, paragraph-length descriptions of empirically identified disorders - and rate the ‘goodness of fit’ or extent of match of the patient’s clinical presentation to the prototype” (2012, p. 17). Applied to the diagnosis of mood and anxiety disorders, such prototype-based diagnoses cohered with consistent patterns among criterion variables, and were actually better in predicting self-reported symptoms than DSM- based categorical diagnosis (DeFife et al. 2013). Applied to personality disorders this method yielded highly reliable judgments: in an inter-rater reliability study including 65 patients a median correlation of 0.72 was observed (Westen et al. 2010). In a study focusing on 62 adolescent psychiatric patients a mean correlation of 0.70 was observed (Haggerty et al. 2016). However, in the 1970s researchers in psychiatry concluded that prototype-based diagnosis was unreliable and that a switch to a checklist- based approach was needed.
|< Prev||CONTENTS||Next >|