PROMs that are potentially invalid, difficult to interpret, and of questionable sensitivity can be, as Hobart et al. argue, an impediment to accurate estimates of effect size and detection of clinical change (Hobart et al. 2007). Indeed, they suggest that the failures of clinical trials to yield larger numbers of effective treatments may be due to the lack of scientific rigor of their measuring instruments. It is not surprising that pharmaceutical companies keen to demonstrate the effectiveness of their products while using measures that will satisfy the FDA guidelines are eager to explore methods that will improve their success. And it is not only industry that wants to see the acceleration of medical product development. The FDA also shares this goal.
Public-private partnerships, such as Critical Path Institute (C-Path) created under the auspices of the FDA’s critical path initiative program, aim to create drug development tools (DDTs): new data, measurement and method standards to accelerate the pace and reduce the cost of medical product development, etc. (Critical Path Institute 2015). They do so by coordinating collaborations among scientists from the FDA, industry, and academia. Essentially, C-Path puts industry scientists and academics together to develop tools that will enhance the ability of industry to develop medical products. The FDA then provides iterative feedback on the tools they create hopefully ending in the approval of the DDT for use in specific product development. PROMs are one of the four types of clinical outcome assessments eligible to qualify as a DDT (Food and Drug Administration 2007).
The FDA’s DDT qualification program along with C-Path is one way to build on the FDA guidelines for the use of PROMs in medical product labeling in order to streamline the process for an instrument’s acceptance by the FDA. It is also an opportunity for industry and academics to work together to further their individual ends and, in doing so, flesh out the FDA guidelines, i.e., the FDA is not specific in its guidelines regarding what psychometric methods should be used to establish validity, interpretability, etc. DDTs provide industry and academics with the opportunity to develop new standards for measurement and methods, thus opening up room for new psychometric methods, such as Rasch. Indeed Hobart et al. explicitly call for such developments in their work.
Partly through the work of Sergio Sismondo, the philosophical and bioethics community has learned to have a healthy skepticism of industry/academic partnerships. Much of Sismondo’s work focuses on violations of publication ethics through ghost-managed research (see Sismondo’s chapter “Hegemony of Knowledge and Pharmaceutical Industry Strategy” in this volume; Sismondo and Doucet 2010; Sismondo and Nicholson 2009; Sismondo 2007). He identifies the entangled nature of ghost management as practically expedient, but ethically troublesome. It is practically expedient because at least at first gloss (almost) everyone involved wins: pharmaceutical companies get more market value out of their publications if well- respected academics put their names on the manuscripts; academics get publications in notable journals; and the journals get well-cited manuscripts, which if published will produce revenue in the form of offprints purchased by industry (Sismondo and Doucet 2010). But ghost management is ethically troublesome—we might even say corrupt—because it reveals how extensively clinical research is driven by market concerns, which in turn begs questions about (1) the justification of subjecting human subjects to research and (2) the integrity of that research. It also intimates a kind of sad desperation among academics for high impact publications and involvement in large clinical trials. As Sismondo points out, what they are doing is unethical, but ambitious academics may have few other options (Sismondo and Doucet 2010).
Although my objective in this last section is not to reveal the kind of widespread corruption that Sismondo does in his work, I do want to suggest that the collaboration of industry and academics to develop DDTs should be critically evaluated. In what follows, I suggest how a PROM developed using Rasch—for the sake of argument, a DDT—could be co-opted to provide evidence of clinical change. Thus, not only should we critically evaluate the collaborative partnerships that C-Path facilitates, but also the use of Rasch as a value-neutral improvement to the scientific rigor of PROMs.
As I discussed earlier, Rasch, unlike CCT, makes use of a more robust measurement theory. As such, it tells us what to make of respondent answers to survey questions, e.g., when respondents answer yes to more difficult items, then they have more ability than those who answer yes to easier items. It also provides us with a ruler with specific item locations. Recall that the Rasch scale runs from plus to minus infinity, with the zero point at the place where the difficulty of the items in the survey is equal to the ability of the sample population. Each item is located on the ruler relative to the point at which there is equal probability of respondents answering “yes” or “no” to that particular item. In sum, Rasch provides a formal theory that tells us where to locate items and where to locate people. But Rasch does not provide an attribute theory that guides us in choosing the content of the scale, i.e., the items or questions.
To be sure, there are constraints in the items that are chosen, the most obvious being that the data resulting from them must coincide with the Rasch model. And as I discussed earlier, if the standard error estimates of adjacent items overlap, then those items are taken to be too similar to one another. Although this latter constraint is not an absolute constraint, since increasing the sample size will decrease the standard error estimates and possibly preserve the questions under consideration. In any case, I want to put aside these two constraints and instead focus on the lack of an attribute theory within the Rasch model.
Rasch lacks a theory regarding the content of its target construct. Moreover, unlike the measurement of time, these target constructs are not enmeshed within a robust science such as physics. For example, Rasch does not tell us what is important about a particular construct (e.g., mobility) and neither does psychology. Thus, it is up to researchers who develop such scales to try out different questions if and until the survey data yields a fit with the Rasch model. But without theoretical guidance regarding the content of the construct of interest, how can we determine the adequate sensitivity of a scale? It seems that in this regard, Rasch is no better than CTT and possibly worse.
How might the use of Rasch be worse than CTT when it comes to the sensitivity of a scale? The problem is that Rasch makes it too easy to create a measure that is calibrated to detect clinical change. Consider the following example. It is possible to take survey data from questionnaires such as the European Quality of Life Five Dimensions (EQ-5D) and model it using Rasch. Imagine that when we do so, we find, not surprisingly, that the EQ-5D’s five questions are relatively insensitive to change because they divide wide variables into only a few levels, i.e., mobility, selfcare, usual activities, pain discomfort, and depression/anxiety. Earlier we discussed a similar problem regarding sensitivity in the context of CTT. In Rasch language, the EQ-5D is too easy, i.e., even respondents without a lot of ability can answer all the questions positively. For instance, eye problems, sleep problems, sexual functioning, memory problems, problems communicating poststroke, and fatigue are a few of the deficits to which the EQ-5D is generally insensitive.
Now, suppose that you were looking at the EQ-5D data because you were interested in whether or not it was the appropriate measure to use in a clinical trial to establish the effectiveness of a drug. You have the mean pretreatment scores of your target population and you know that pretreatment they already have has more ability than the EQ-5D is able to measure. If you want to show a clinical improvement, then you need a measure that is more sensitive. In the language of Rasch, you need a measure that can target a higher-functioning population, i.e., respondents with more ability. Because you already know the mean pretreatment scores, you have an idea where on the ruler you need to develop the scale in order to measure the change you anticipate. Moreover, the more responsive the rulers (i.e., the closer together each step on the ruler), the more likely you will find a clinically significant change.
I want to be very clear: I am not suggesting that anyone is disingenuously using Rasch to demonstrate clinical change. What I am suggesting is that Rasch represents an opportunity to increase the likelihood of finding clinical benefit, while the choice to use Rasch is presented as a matter of scientific rigor. I am not alone in recognizing that Rasch represents this opportunity. Indeed Hobart et al. admit that one criticism of more sensitive measures is that they will increase type 1 errors (false positives) (Hobart et al. 2007). But while they more or less dismiss this concern since blunt instruments are equally problematic, I think it is worth taking seriously.
One reason to do that is because science is a value-laden enterprise. Indeed, as Heather Douglas writes in Science, Policy, and the Value-Free Ideal, social and ethical values are necessary to any science that has a public role, i.e., any science that has a role in policy, medicine, technology, etc., as health measurement certainly does. Douglas’s argument is twofold. First, she reminds us that our evidence always underdetermines what we should believe (Douglas 2009). We can see her point, if we attend to Rasch measurement scales. Here we see that our knowledge of a construct, including respondent data from items thought to be related to the construct, underdetermines how many questions we ought to ask and at what difficulty level we should target our efforts, including how sensitive the scale should be and if certain areas of the scale should be more sensitive than others. Put another way, there is always an element of uncertainty in the use of scientific evidence. This uncertainty is overcome only when scientists use their judgment to determine which standard, characterization, claim, or theory is indicated (Douglas 2009).
For the second part of her argument, she claims that when science has a public role, when, for instance, a study has the potential to affect public policy or medical treatment options, then the use of expert judgment draws on social and ethical values. When science has the potential to affect others—and it clearly does in the context of health measurement—then the values employed in using one’s judgment should be connected to an individual’s perception of what is at stake should one make a mistake. Scientists ought to evaluate the social and ethical consequences of error (Douglas 2009: 87). In other words, when considering how sensitive a scale should be, researchers should contemplate the social and ethical consequences of creating an overly sensitive scale that increases the likelihood of type 1 errors. Some of the consequences might be loss of public trust, overmedication, rising healthcare costs, and industry (dis?)satisfaction. Equally, researchers should consider the consequences of creating less sensitive scales that increase the likelihood of type 2 errors (false negatives). Some of these consequences might be increased cost and time to medical product development and increased patient suffering due to the delay in medical product development.
For Douglas, the solution to scientific disagreements that stem from differences in social and ethical value orientations is to make the values on which decisions or judgments are based more transparent (Douglas 2009). In the context of health measurement, we might begin by simply acknowledging their existence. When Hobart et al. criticize CTT as unscientific and suggest Rasch as a replacement in the name of scientific rigor, we might soften the critique by recognizing that the choice to use CTT over Rasch is not only a lack of sophistication and knowledge as they sometimes seem to suggest, but also a value choice of prioritizing expediency and simplicity. Moreover, even if Rasch does provide the basis for more scientific measurement scales—as I believe it does—supporters need to recognize the valueladen decisions that still characterize these scales. Without recognition of values we employ under conditions of uncertainty, we cannot evaluate them. If we do not evaluate them, then I worry that similar to the case of ghost management, we might find ourselves building measures to tailor the marketing needs of pharmacy.