Thomas J. Moore
The George Washington
University School of Public Health and Health Services
Alan Scoboria and Sarah S.
Nicholls
University of Connecticut
Although antidepressant medication is widely regarded as efficacious, a
recent meta-analysis of published clinical trials indicates that 75
percent of the response to antidepressants is duplicated by placebo (Kirsch
& Sapirstein, 1998). These data have been challenged on a number
of grounds, including the restriction of the analyses to patients who had
completed the trials, the limited number of clinical trials assessed, the
methodological characteristics of those trials, and the use of
meta-analytic statistical procedures (Klein,
1998).
The present article reports analyses of a data set to which these
objections do not apply, namely, the data submitted to the U.S. Food and
Drug Administration (FDA) for approval of recent antidepressant
medications. We analyzed the efficacy data submitted to the FDA for the
six most widely prescribed antidepressants approved between 1987 and 1999
(RxList:
The Internet Drug Index, 1999): fluoxetine (Prozac), paroxetine
(Paxil), sertraline (Zoloft), venlafaxine (Effexor), nefazodone (Serzone),
and citalopram (Celexa). These represent all but one of the selective
serotonin reuptake inhibitors (SSRI) approved during the study period. The
FDA data set includes analyses of data from all patients who attended at
least one evaluation visit, even if they subsequently dropped out of the
trial prematurely. Results are reported from all well controlled efficacy
trials of the use of these medications for the treatment of depression.
FDA medical and statistical reviewers had access to the raw data and
evaluated the trials independently. The findings of the primary medical
and statistical reviewers were verified by at least one other reviewer,
and the analysis was also assessed by an independent advisory panel. More
important, the FDA data constitute the basis on which these medications
were approved. Approval of these medications implies that these particular
data are strong enough and reliable enough to warrant approval. To the
extent that these data are flawed, the medications should not have been
approved.
Khan,
Warner, and Brown (2000) recently reported the results of a concurrent
analysis of the FDA database. Similar to the Kirsch and Sapirstein report,
their analysis revealed that 76% of response to antidepressant was
duplicated by placebo. In several respects, our analyses of the FDA data
differ from, and supplement those, reported by Khan et al. First, although
information on all efficacy trials for depression are included in the FDA
database, mean change scores were not reported to the FDA for some trials
on which a significant difference between drug and placebo was not
obtained. Thus, the summary data reported by Khan et al. overestimate
drug/placebo differences. In contrast, we provide an estimate of
drug/placebo differences that is based on those medications for which for
all clinical trials were reported, thus eliminating the bias due to the
exclusion of trials least favorable to the medication.
Third, two methods of accounting for attrition were used in the data
reported to the FDA: last observation carried forward (LOCF) and observed
cases (OC). In LOCF analyses, when a patient drops out of a trial, the
results of the last evaluation visit are carried forward as if the patient
had continued to the completion of the trial without further change. In OC
analyses, the results are reported only for those patients who are still
participating at the end of the time period being assessed. Because
patients who discontinue medication are regarded as treatment failures,
LOCF analyses are widely considered to provide a more conservative test of
drug effects, and the Khan
et al. (2000) analysis was confined to those data. We used the FDA
database to test this hypothesis empirically by comparing LOCF and OC data
for all trials in which both were reported.
Using the Freedom of Information Act, we obtained the medical and
statistical reviews of every placebo controlled clinical trial for
depression reported to the FDA for initial approval of the six most widely
used antidepressant drugs approved within the study period. We received
information about 47 randomized placebo controlled short-term efficacy
trials conducted for the six drugs in support of an approved indication of
treatment of depression. The breakdown by efficacy trial was as follows:
fluoxetine (5), paroxetine (16), sertraline (7), venlafaxine (6),
nefadozone (8), and citalopram (5). Data on relapse prevention trials were
not analyzed.
In order to generalize the findings of the clinical trial to a larger
patient population, FDA reviewers sought a completion rate of 70% or
better for these typically 6-week trials. Only 4 of 45 trials, however,
reached this objective. Completion rates were not reported for two trials.
Attrition rates were comparable between drug and placebo conditions. Of
those trials for which these rates were reported, 60% of the placebo
patients and 63% of the study drug patients completed a 4-, 5-, 6-, or
8-week trial. Thirty-three of 42 trials lasted 6 weeks, 6 trials lasted 4
weeks, 2 lasted 5 weeks, and 6 lasted 8 weeks. Patients were evaluated on
a weekly basis. For the present meta-analysis, the data were taken from
the last visit prior to trial termination.
A shortcoming in the FDA data is the absence in many of the reports of
reported standard deviations. This precludes direct calculation of effect
sizes. Calculating effect sizes by dividing mean differences by standard
deviations allows researchers to combine the results of trials on which
different outcome measurement scales had been used. However, when
the same scale is used across studies, it is possible to combine the
results of the studies without first dividing them by the standard
deviation of the scales (Hunter
& Schmidt, 1990). The HAM-D was the primary endpoint for all of
the reported trials in this analysis, thereby allowing direct comparisons
of outcome data without conversion into conventional effect size
(D) scores. The HAM-D is a widely used measure of depression, with
interjudge reliability coefficients ranging from r = .84 to
r = .90 (Hamilton,
1960).
For each clinical trial, we recorded the mean improvement in HAM-D
scores in the drug and placebo groups. Next, improvement in the placebo
group was divided by improvement in the drug group to provide an estimate
of the degree of improvement in the drug-treated patients that was
duplicated in the placebo group. Then, the mean of each of these trials,
weighted for sample size, was calculated within each drug.
Results
The 17-item version of the HAM-D was used in all trials of paroxetine,
sertraline, nefazodone, and citalopram. The 21-item version was used in
trials of fluoxetine and venlafaxine. One citalopram trial reported scores
on both the 17-item scale and the 21-item scale, and another reported
scores on the 17-item scale and a 24-item version of the scale. We used
the 17-item scores for citalopram studies because this version of the
scale was used in all of the clinical trials of that medication.
Calculation of response to drug and placebo for the two studies using
different forms of the scale reveals that the drug/placebo comparison is
comparable, regardless of which scale is used.
Mean improvement scores were not reported in 9 of the 47 trials.
Specifically, four paroxetine trials involving 165 participants, four
sertraline trials involving 486 participants, and one citalopram trial
involving 274 participants were reported as having failed to achieve a
statistically significant drug effect, but the mean HAM-D scores were not
reported. This represents 11% of the patients in paroxetine trials,
38% of the patients in sertraline trials, and 23% of the patients in
citalopram trials. In each case, the statistical or medical reviewers
stated that no drug effect was found.
Including data from paroxetine and sertraline trials in summary
statistics would produce an inflated estimate of drug effects. Therefore,
to obtain an unbiased estimate of drug and placebo effects across
medications, we calculated weighted means of all medications for which
data on all clinical trials were reported. This included the data for
fluoxetine, venlafaxine, and nefadozone. The weighted mean difference
between the drug and placebo groups across these three medications was
1.80 points on the HAM-D, and 82% of the drug response was duplicated by
the placebo response. A t-test, weighted for sample size, indicated
that the drug/placebo difference was statistically significant,
t(18) = 5.01, p < .001.
On most of the clinical trials, medication dose was titrated
individually for each patient within a specified range. However, in 12
trials involving 1,942 patients, various fixed doses of a medication were
evaluated in separately randomized arms. It is possible that some of the
doses used in these trials were subclinical. If this is the case,
inclusion of these data could result in an underestimate of the drug
effect. To test this possibility, we compared LOCF data at the lowest and
highest doses reported in each study. Across these 12 trials, mean
improvement (weighted for sample size) was 9.57 points on the HAM-D at the
lowest dose evaluated and 9.97 at the highest dose. This difference
between high and low doses of antidepressant medication was not
statistically significant.
Finally, we tested the hypothesis that LOCF analyses provide more
conservative tests of drug effects than do OC analyses. LOCF means were
reported for all 38 of the 46 trials in which means of any kind were
reported. OC means were reported for 27 of these 38 trials. In 22 trials,
the difference between drug and placebo group was not statistically
significant with either LOCF or OC measures. In 12 trials, the difference
was statistically significant with both measures. In 8 trials, the
difference was significant with LOCF but not with OC, and 4 trials were
reported to have shown no difference between drug and placebo without
specifying an attrition rule. For the 27 trials for which both sets of
means were reported, correlated t-tests indicated that mean
improvement scores were significantly greater with OC data than with LOCF
data for both drug, t(26) = 12.46, p < .001, and placebo,
t(26) = 10.56, p < .001, as was the proportion of the
drug response duplicated by placebo, t(26) = 3.36, p <
.01. In the LOCF data, 79% of the drug response was duplicated in the
placebo groups; in the OC data, 85% of the drug response was duplicated by
placebo. Thus, LOCF analyses indicate a greater drug/placebo difference
than do OC analyses.
Discussion
In clinical trials, the effect of the active drug is assumed to be the
difference between the drug response and the placebo response. Thus, the
FDA clinical trials data indicate that 18% of the drug response is due to
the pharmacological effects of the medication. This is based on LOCF data,
in which the drug effect was significantly stronger than in OC data, and
it is obtained after those who show the greatest response to placebo are
excluded from the study. Overall, the drug/placebo difference was less
than 2 points on the HAM-D, a highly reliable physician-rated scale that
has been reported to be more sensitive than patient-rated scales to
drug/placebo differences (Murray,
1989). The range was from a 3-point drug/placebo difference for
venlafaxine to a 1-point difference for fluoxetine, both of which were on
the 21-item (64-point) version of the scale. As intimated in FDA memoranda
(Laughren,
1998; Leber,
1998), the clinical significance of these differences is
questionable.
Of the two widely used methods of coping with attrition in clinical
trials, LOCF analyses are considered the more stringent. The FDA data set
calls this assumption into question. The proportion of the drug effect
duplicated by placebo was significantly larger in the OC data set than in
the corresponding LOCF data set. In addition, the degrees of freedom are
necessarily larger in LOCF analyses, thereby making it more likely that a
mean difference will be statistically significant. In the 47 clinical
trials obtained from the FDA, there were no reported instances in which OC
data yielded significant differences that were not detected in LOCF
analyses. However, in 8 trials, LOCF data yielded significant differences
that were not detected when OC data were analyzed. These data indicate
that, compared with LOCF analyses, OC analyses provide more conservative
tests of drug/placebo differences.
Although mean differences were small, most of them favored the active
drug, and overall, the difference was statistically significant. There
were only 4 trials in which mean improvement scores in the placebo
condition were equal to or higher than those in the drug condition, and in
no case was placebo significantly more effective than active drug. This
may indicate a small but significant drug effect. However, it is also
possible that this difference between drug and placebo is an enhanced
placebo effect due to the breaking of blind. Antidepressant clinical trial
data indicate that the ability of patients and doctors to deduce whether
they have been assigned to the drug or placebo condition exceeds chance
levels (Rabkin
et al., 1986), possibly because of the greater occurrence of side
effects in the drug condition. Knowing that one has been randomized to the
active drug condition is likely to enhance the placebo effect, whereas
knowledge of assignment to the placebo group ought to decrease its effect
(Fisher
& Greenberg, 1993). Enhanced drug effects due to breaking blind in
clinical trials may be small, but evaluation of the FDA database indicates
that the drug/placebo difference is also very small, amounting to about 2
points on the HAM-D.
Although our data suggest that the effect of antidepressant drugs are
very small and of questionable clinical significance, this conclusion
rests on the assumption that drug effects and placebo effects are
additive. However, it is also possible that antidepressant drug and
placebo effects are not additive and that the true drug effect is greater
than the drug/placebo difference. Clinical trials are based on the
assumption of additivity (Kirsch,
2000). That is, the drug is deemed effective only if the response to
it is significantly greater than the response to placebo, and the
magnitude of the drug effect is assumed to be the difference between the
response to drug and the placebo. However, drug and placebo responses are
not always additive. Alcohol and stimulant drugs, for example, produce at
least some drug and placebo effects that are not additive. Placebo alcohol
produces effects that are not observed when alcohol is administered
surreptitiously, and alcohol produces effects that are not duplicated by
placebo alcohol (Hull
& Bond, 1986). The placebo and pharmacological effects of caffeine
are additive for feelings of alertness but not for feelings of tension (Kirsch
& Rosadino, 1993), and similarly mixed results have been reported
for other stimulants (Lyerly,
Ross, Krugman, & Clyde, 1964; Ross,
Krugman, Lyerly, & Clyde, 1962).
If antidepressant drug effects and antidepressant placebo effects are
not additive, the ameliorating effects of antidepressants might be
obtained even if patients did not know the drug was being administered. If
that is the case, then antidepressant drugs have substantial pharmacologic
effects that are duplicated or masked by placebo. In this case,
conventional clinical trials are inappropriate for testing the effects of
these drugs, as they may result in the rejection of effective medications.
Conversely, if drug and placebo effects of antidepressant medication are
additive, then the data clearly show that those effects are small, at
best, and of questionable clinical efficacy. Finally, it is conceivable
that the effects are partially additive, with the true drug effect being
somewhere in between these extremes. The problem is that we do not know
which of these models is most accurate because the assumption of
additivity has never been tested with antidepressant mediation.