http://engr.case.edu/ray_soumya/mlrg/empirical_bayes_efron.pdf

publicité
Statistical Science
2008, Vol. 23, No. 1, 1–22
DOI: 10.1214/07-STS236
© Institute of Mathematical Statistics, 2008
Microarrays, Empirical Bayes and the
Two-Groups Model
Bradley Efron
Abstract. The classic frequentist theory of hypothesis testing developed by
Neyman, Pearson and Fisher has a claim to being the twentieth century’s
most influential piece of applied mathematics. Something new is happening in the twenty-first century: high-throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and
Bayesians alike. The two-groups model is a simple Bayesian construction
that facilitates empirical Bayes analysis. This article concerns the interplay
of Bayesian and frequentist ideas in the two-groups setting, with particular attention focused on Benjamini and Hochberg’s False Discovery Rate method.
Topics include the choice and meaning of the null hypothesis in large-scale
testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray
studies), correlation effects, multiple confidence intervals and Bayesian competitors to the two-groups model.
Key words and phrases:
rates.
Simultaneous tests, empirical null, false discovery
main goal being preservation of “α,” overall test size,
in the face of multiple inference. Most of the current
microarray statistics literature shares this goal, and also
its frequentist viewpoint, as described in the nice review article by Dudoit and Boldrick (2003).
Something changes, though, when N gets big: with
thousands of parallel inference problems to consider simultaneously, Bayesian considerations begin to force
themselves even upon dedicated frequentists. The
“two-groups model” of the title is a particularly simple
Bayesian framework for large-scale testing situations.
This article explores the interplay of frequentist and
Bayesian ideas in the two-groups setting, with particular attention paid to False Discovery Rates (Benjamini
and Hochberg, 1995).
Figure 1 concerns four examples of large-scale simultaneous hypothesis testing. Each example consists
of N individual cases, with each case represented by
its own z-value “zi ,” for i = 1, 2, . . . , N . The zi ’s
are based on familiar constructions that, theoretically,
should yield standard N(0, 1) normal distributions un-
1. INTRODUCTION
Simultaneous hypothesis testing was a lively research topic during my student days, exemplified by
Rupert Miller’s classic text “Simultaneous Statistical
Inference” (1966, 1981). Attention focused on testing
N null hypotheses at the same time, where N was typically less than half a dozen, though the requisite tables
might go up to N = 20. Modern scientific technology,
led by the microarray, has upped the ante in dramatic
fashion: my examples here will have N ’s ranging from
200 to 10,000, while N = 500,000, from SNP analyses, is waiting in the wings. [The astrostatistical applications in Liang et al. (2004) envision N = 1010 and
more!]
Miller’s text is relentlessly frequentist, reflecting a
classic Neyman–Pearson testing framework, with the
Bradley Efron is Professor, Department of Statistics,
Stanford University, Stanford, California 94305, USA
(e-mail: [email protected]).
1 Discussed in 10.1214/07-STS236B, 10.1214/07-STS236C,
10.1214/07-STS236D and 10.1214/07-STS236A; rejoinder at
10.1214/08-STS236REJ.
1
2
B. EFRON
F IG . 1. Four examples of large-scale simultaneous inference, each panel indicating N z-values as explained in the text. Panel A, prostate
cancer microarray study, N = 6033 genes; panel B, comparison of advantaged versus disadvantaged students passing mathematics competency tests, N = 3748 high schools; panel C, proteomics study, N = 230 ordered peaks in time-of-flight spectroscopy experiment; panel D,
imaging study comparing dyslexic versus normal children, showing horizontal slice of 655 voxels out of N = 15,455, coded “−” for zi < 0,
“+” for zi ≥ 0 and solid circle for zi > 2.
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
der a classical null hypothesis,
(1.1)
theoretical null : zi ∼ N(0, 1).
Here is a brief description of the four examples, with
further information following as needed in the sequel.
E XAMPLE A [Prostate data, Singh et al. (2002)].
N = 6033 genes on 102 microarrays, n1 = 50 healthy
males compared with n2 = 52 prostate cancer patients;
zi ’s based on two-sample t statistics comparing the two
categories.
E XAMPLE B [Education data, Rogosa (2003)]. N =
3748 California high schools; zi ’s based on binomial
test of proportion advantaged versus proportion disadvantaged students passing mathematics competency
tests.
E XAMPLE C [Proteomics data, Turnbull (2006)].
N = 230 ordered peaks in time-of-flight spectroscopy
study of 551 heart disease patients. Each peak’s z-value
was obtained from a Cox regression of the patients’
survival times, with the predictor variable being the
551 observed intensities at that peak.
E XAMPLE D [Imaging data, Schwartzman et al.
(2005]). N = 15,445 voxels in a diffusion tensor
imaging (DTI) study comparing 6 dyslexic with six
normal children; zi ’s based on two-sample t statistics
comparing the two groups. The figure shows only a
single horizontal brain section having 655 voxels, with
“−” indicating zi < 0, “+” for zi ≥ 0, and solid circles
for zi > 2.
Our four examples are enough alike to be usefully
analyzed by the two-groups model of Section 2, but
there are some striking differences, too: the theoretical
N(0, 1) null (1.1) is obviously inappropriate for the education data of panel B; there is a hint of correlation of
z-value with peak number in panel C, especially near
the right limit; and there is substantial spatial correlation appearing in the imaging data of panel D.
My plan here is to discuss a range of inference problems raised by large-scale hypothesis testing, many of
which, it seems to me, have been more or less underemphasized in a literature focused on controlling TypeI errors: the choice of a null hypothesis, limitations of
permutation methods, the meaning of “null” and “nonnull” in large-scale settings, questions of power, test
of significance for groups of cases (e.g., pathways in
microarray studies), the effects of correlation, multiple
confidence statements and Bayesian competitors to the
two-groups model. The presentation is intended to be
3
as nontechnical as possible, many of the topics being
discussed more carefully in Efron (2004, 2005, 2006).
References will be provided as we go along, but this is
not intended as a comprehensive review. Microarrays
have stimulated a burst of creativity from the statistics
community, and I apologize in advance for this article’s concentration on my own point of view, which
aims at minimizing the amount of statistical modeling required of the statistician. More model-intensive
techniques, including fully Bayesian approaches, as in
Parmigiani et al. (2002) or Lewin et al. (2006), have
their own virtues, which I hope will emerge in the Discussion.
Section 2 discusses the two-groups model and false
discovery rates in an idealized Bayesian setting. Empirical Bayes methods are needed to carry out these ideas
in practice, as discussed in Section 3. This discussion
assumes a “good” situation, like that of Example A,
where the theoretical null (1.1) fits the data. When it
does not, as in Example B, the empirical null methods
of Section 4 come into play. These raise interpretive
questions of their own, as mentioned above, discussed
in the later sections.
We are living through a scientific revolution powered by the new generation of high-throughput observational devices. This is a wonderful opportunity for
statisticians, to redemonstrate our value to the scientific world, but also to rethink basic topics in statistical
theory. Hypothesis testing is the topic here, a subject
that needs a fresh look in contexts like those of Figure 1.
2. THE TWO-GROUPS MODEL AND FALSE
DISCOVERY RATES
The two-groups model is too simple to have a single
identifiable author, but it plays an important role in the
Bayesian microarray literature, as in Lee et al. (2000),
Newton et al. (2001) and Efron et al. (2001). We suppose that the N cases (“genes” as they will be called
now in deference to microarray studies, though they
are not genes in the last three examples of Figure 1)
are each either null or nonnull with prior probability
p0 or p1 = 1 − p0 , and with z-values having density
either f0 (z) or f1 (z),
(2.1)
p0 = Pr{null}
f0 (z) density if null,
p1 = Pr{nonnull} f1 (z) density if nonnull.
The usual purpose of large-scale simultaneous testing is to reduce a vast set of possibilities to a much
smaller set of scientifically interesting prospects. In
4
B. EFRON
Example A, for instance, the investigators were probably searching for a few genes, or a few hundred at most,
worthy of intensive study for prostate cancer etiology.
I will assume
(2.2)
p0 ≥ 0.90
in what follows, limiting the nonnull genes to no more
than 10%.
False discovery rate (Fdr) methods have developed
in a strict frequentist framework, beginning with Benjamini and Hochberg’s seminal 1995 paper, but they
also have a convincing Bayesian rationale in terms of
the two-groups model. Let F0 (z) and F1 (z) denote
the cumulative distribution functions (cdf) of f0 (z)
and f1 (z) in (2.1), and define the mixture cdf F (z) =
p0 F0 (z) + p1 F1 (z). Then Bayes’ rule yields the a posteriori probability of a gene being in the null group of
(2.1) given that its z-value Z is less than some threshold z, say “Fdr(z),” as
(2.3)
Fdr(z) ≡ Pr{null|Z ≤ z}
= p0 F0 (z)/F (z).
[Here it is notationally convenient to consider the negative end of the z scale, values like z = −3. Definition (2.3) could just as well be changed to Z > z or
Z > |z|.] Benjamini and Hochberg’s (1995) false discovery rate control rule begins by estimating F (z) with
the empirical cdf
(2.4)
F̄ (z) = #{zi ≤ z}/N,
yielding Fdr(z) = p0 F0 (z)/F̄ (z). The rule selects a
control level “q,” say q = 0.1, and then declares as
nonnull those genes having z-values zi satisfying zi ≤
z0 , where z0 is the maximum value of z satisfying
(2.5)
Fdr(z0 ) ≤ q
[usually taking p0 = 1 in (2.3), and F0 the theoretical
null, the standard normal cdf (z) of (1.1)].
The striking theorem proved in the 1995 paper
was that the expected proportion of null genes reported by a statistician following rule (2.5) will be
no greater than q. This assumes independence among
the zi ’s, extended later to various dependence models
in Benjamini and Yekutieli (2001). The theorem is a
purely frequentist result, but as pointed out in Storey
(2002) and Efron and Tibshirani (2002), it has a simple Bayesian interpretation via (2.3): rule (2.5) is essentially equivalent to declaring nonnull those genes
whose estimated tail-area posterior probability of being null is no greater than q. It is usually a good sign
when Bayesian and frequentist ideas converge on a single methodology, as they do here.
Densities are more natural than tail areas for Bayesian fdr interpretation. Defining the mixture density
from (2.1),
(2.6)
f (z) = p0 f0 (z) + p1 f1 (z),
Bayes’ rule gives
fdr(z) ≡ Pr{null|Z = z}
(2.7)
= p0 f0 (z)/f (z)
for the probability of a gene being in the null group
given z-score z. Here fdr(z) is the local false discovery
rate (Efron et al., 2001; Efron, 2005).
There is a simple relationship between Fdr(z) and
fdr(z),
(2.8)
Fdr(z) = Ef {fdr(Z)|Z ≤ z},
“Ef ” indicating expectation with respect to the mixture density f (z). That is, Fdr(z) is the mixture average of fdr(Z) for Z ≤ z. In the usual situation where
fdr(z) decreases as |z| gets large, Fdr(z) will be smaller
than fdr(z). Intuitively, if we decide to label all genes
with zi less than some negative value z0 as nonnull,
then fdr(z0 ), the false discovery rate at the boundary point z0 , will be greater than Fdr(z0 ), the average
false discovery rate beyond the boundary. Figure 2 illustrates the geometrical relationship between Fdr(z)
and fdr(z); the Benjamini–Hochberg Fdr control rule
amounts to an upper bound on the secant slope.
For Lehmann alternatives
(2.9)
F1 (z) = F0 (z)γ ,
it turns out that
log
(2.10)
fdr(z)
1 − fdr(z)
= log
[γ < 1],
Fdr(z)
1
+ log
,
1 − Fdr(z)
γ
so
(2.11)
fdr(z) =
˙ Fdr(z)/γ
for small values of Fdr. The prostate data of Figure 1
has γ about 1/2 in each tail, making fdr(z) ∼ 2 Fdr(z)
near the extremes.
The statistics literature has not reached consensus on
the choice of q for the Benjamini–Hochberg control
rule (2.5)—what would be the equivalent of 0.05 for
classical testing—but Bayes factor calculations offer
some insight. Efron (2005, 2006) uses the cutoff point
(2.12)
fdr(z) ≤ 0.20
5
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
F IG . 2. Relationship of Fdr(z) to fdr(z). Heavy curve plots numerator of Fdr, p0 F0 (z), versus denominator F (z); fdr(z) is slope of tangent,
Fdr slope of secant.
for reporting nonnull genes, on the admittedly subjective grounds that fdr values much greater than 0.20 are
dangerously prone to wasting investigators’ resources.
Then (2.6), (2.7) yield posterior odds ratio
Pr{nonnull|z}/ Pr{null|z}
(2.13)
= (1 − fdr(z))/fdr(z)
= p1 f1 (z)/p0 f0 (z)
≥ 0.8/0.2 = 4.
Since (2.2) implies p1 /p0 ≤ 1/9, (2.13) corresponds to
requiring Bayes factor
(2.14)
f1 (z)/f0 (z) ≥ 36
in favor of nonnull in order to declare significance.
Factor (2.14) requires much stronger evidence
against the null hypothesis than in standard one-at-atime testing, where the critical threshold lies somewhere near 3 (Efron and Gous, 2001). The fdr 0.20
threshold corresponds to q-values in (2.5) between
0.05 and 0.15 for moderate choices of γ ; such q-value
thresholds can be interpreted as providing conservative
Bayes factors for Fdr testing.
Model (2.1) ignores the fact that investigators usually begin with hot prospects in mind, genes that have
high prior probability of being interesting. Suppose
p0 (i) is the prior probability that gene i is null, and define p0 as the average of p0 (i) over all N genes. Then
Bayes’ theorem yields this expression for fdri (z) =
Pr{genei null|zi = z}:
fdri (z) = fdr(z)
(2.15)
ri
,
1 − (1 − ri )fdr(z)
ri =
p0 (i) p0
,
1 − p0 (i) 1 − p0
where fdr(z) = p0 f0 (z)/f (z) as before. So for a hot
prospect having p0 (i) = 0.50 rather than p0 = 0.90,
(2.15) changes an uninteresting result like fdr(zi ) =
0.40 into fdri (zi ) = 0.069.
Wonderfully neat and exact results like the Benjamini–
Hochberg Fdr control rule exert a powerful influence
on statistical theory, sometimes more than is good for
applied work. Much of the microarray statistics literature seems to me to be overly concerned with exact
properties borrowed from classical test theory, at the
expense of ignoring the complications of large-scale
testing. Neatness and exactness are mostly missing in
what follows as I examine an empirical Bayes approach
to the application of two-groups/Fdr ideas to situations
like those in Figure 1.
3. EMPIRICAL BAYES METHODS
In practice, the difference between Bayesian and
frequentist statisticians is their self-confidence in assigning prior distributions to complicated probability
models. Large-scale testing problems certainly look
6
B. EFRON
complicated enough, but this is deceptive; their massively parallel structure, with thousands of similar situations each providing information, allows an appropriate prior distribution to be estimated from the data
without upsetting even timid frequentists like myself.
This is the empirical Bayes approach of Robbins and
Stein, 50 years old but coming into its own in the microarray era; see Efron (2003).
Consider estimating the local false discovery rate
fdr(z) = p0 f0 (z)/f (z), (2.7). I will begin with a
“good” case, like the prostate data of Example A in
Section 1, where it is easy to believe in the theoretical
null distribution (1.1),
(3.1)
1
2
f0 (z) = ϕ(z) ≡ √ e−(1/2)z .
2π
The z-values in Example A were obtained by transforming the usual two-sample t statistic “ti ” comparing cancer and normal patients’ expression levels for
gene i, to a standard normal scale via
zi = −1 (F100 (ti ));
(3.2)
here and F100 are the cdf’s of standard normal and
t100 distributions. If we had only gene i’s data to test,
classic theory would tell us to compare zi with f0 (z) =
ϕ(z) as in (3.1).
For the moment I will take p0 , the prior probability of a gene being null, as known. Section 4 discusses p0 ’s estimation, but in fact its exact value does
not make much difference to Fdr(z) or fdr(z), (2.3)
or (2.7), if p0 is near 1 as in (2.2). Benjamini and
Hochberg (1995) take p0 = 1, providing an upper
bound for Fdr(z).
This leaves us with only the denominator f (z) to estimate in (2.7). By definition (2.6), f (z) is the marginal density of all N zi ’s, so we can use all the data
to estimate f (z). The algorithm locfdr, an R function
available from the CRAN library, does this by means
of standard Poisson GLM software (Efron, 2005). Suppose the z-values have been binned, giving bin counts
(3.3)
yk = #{zi in bin k},
k = 1, 2, . . . , K.
The prostate data histogram in panel A of Figure 1 has
K = 49 bins of width = 0.2.
We take the yk to be independent Poisson counts,
(3.4)
ind
yk ∼ P0 (νk ),
k = 1, 2, . . . , K,
with the unknown νk proportional to density f (z) at
midpoint “xk ” of the kth bin, approximately
(3.5)
νk = Nf (xk ).
Modeling log(νk ) as a pth-degree polynomial function
of xk makes (3.4)–(3.5) a standard Poisson general linear model (GLM). The choice p = 7 used in Figure 3
amounts to estimating f (z) by maximum likelihood
within the seven-parameter exponential family
(3.6)
f (z) = exp
7
βj z
j
.
j =0
Notice that p = 2 would make f (z) normal; the extra
parameters in (3.6) allow flexibility in fitting the tails
of f (z). Here we are employing Lindsey’s method; see
Efron and Tibshirani (1996). Despite its unorthodox
look, it is no more than a convenient way to obtain
maximum likelihood estimates in multiparameter families like (3.6).
The heavy curve in Figure 3 is an estimate of the
local false discovery rate for the prostate data,
(3.7)
fdr(z)
= p0 f0 (z)/f(z),
with f(z) constructed as above, f0 (z) = ϕ(z) as in
(3.1), and p0 = 0.93, as estimated in Section 4; fdr(z)
is near 1 for |z| ≤ 2, decreasing to interesting levels for
i ) ≤ 0.2,
|z| > 3. Fifty-one of the 6033 genes have fdr(z
26 on the right and 25 on the left, and these could be reported back to the investigators as likely nonnull candidates. [The standard Benjamini–Hochberg procedure,
(2.5) with q = 0.1, reports 60 nonnull genes, 28 on the
right and 32 on the left.]
At this point the reader might notice an anomaly:
if p0 = 0.93 of the N genes are null, then about
(1 − p0 ) · 6033 = 422 should be nonnull, but only
51 are reported. The trouble is that most of the nonnull genes are located in regions of the z axis where
i ) exceeds 0.5, and these cannot be reported withfdr(z
out also reporting a bevy of null cases. In other words,
the prostate study is underpowered.
The vertical bars in Figure 3 are estimates of the
nonnull counts, the histogram we would see if only
the nonnull genes provided z-values. In terms of (3.3),
(3.7), the nonnull counts “yk(1) ” are
(3.8)
k ]yk ,
yk = [1 − fdr
(1)
k = fdr(x
k ), the estimated fdr value at the
where fdr
k approximates the noncenter of bin k. Since 1 − fdr
null probability for a gene in bin k, formula (3.8) is an
obvious estimate for the expected number of nonnulls.
Power diagnostics are obtained from comparisons of
fdr(z)
with the nonnull histogram. High power would
k was small where y (1) was large.
be indicated if fdr
k
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
7
F IG . 3. Heavy curve is estimated local false discovery rate fdr(z)
for prostate data. Fifty-one genes, 26 on the right and 25 on the left, have
) < 0.20. Vertical bars estimate histogram of the nonnull counts (plotted negatively, divided by 50). Most of the nonnull genes will not
fdr(z
i
be reported.
That obviously is not the case in Figure 3. A simple
power diagnostic is
(3.9)
E
fdr
(1)
=
K
K
(1) (1)
yk fdr
yk ,
k
k=1
k=1
(1)
the expected nonnull fdr. We want E
fdr to be small,
perhaps near 0.2, so that a typical nonnull gene will
show up on a list of likely prospects. The prostate data
(1)
has E
fdr = 0.68, indicating low power. If the whole
study were rerun, we could expect a different list of 50
likely nonnull genes, barely overlapping with the first
list. Section 3 of Efron (2006) discusses power calculations for microarray studies, presenting more elaborate
power diagnostics.
Stripped of technicalities, the idea underlying false
discovery rates is appealingly simple, and in fact does
not depend on the literal validity of the two-groups
model (2.1). Consider the bin zi ∈ [3.1, 3.3] in the
prostate data histogram; 17 of the 6033 genes fall
into this bin, compared to expected number 2.68 =
p0 Nϕ(3.2) of null genes, giving
(3.10)
fdr = 2.68/17 = 0.16
as an estimated false discovery rate. (The smoothed es = 0.24.) The implication is that
timate in Figure 3 is fdr
only about one-sixth of the 17 are null genes. This conclusion can be sharpened, as in Lehmann and Romano
(2005), but (3.10) catches the main idea.
Notice that we do not need all the null genes to have
the same density f0 (z); it is enough to assume that the
average null density is f0 (z), ϕ(z) in this case, in order
to calculate the numerator 2.68. (This is an advantage
of false discovery rate methods, which only control
rates, not individual probabilities.) The nonnull density f1 (z) in (2.1) plays no role at all since the denominator 17 is an observed quantity. Exchangeability is the
key assumption in interpreting (3.10): we expect about
1/6 of the 17 genes to be null, and assign posterior null
probability 1/6 to all 17. Nonexchangeability, in the
form of differing prior information among the 17, can
be incorporated as in (2.15).
Density estimation has a reputation for difficulty,
well-deserved in general situations. However, there are
good theoretical reasons, presented in Section 6 of
Efron (2005), for believing that mixtures of z-values
are quite smooth, and that (3.7) will efficiently estimate fdr(z). Independence of the zi ’s is not required,
only that f(z) is a reasonably close estimate of f (z).
8
B. EFRON
TABLE 1
(local fdr), and log Fdr(z),
Boldface, standard errors of log fdr(z),
(tail-area), 250 replications of model (3.11), N = 1500.
Parentheses, average from formula (5.9), Efron (2006); fdr is true
value (2.7). Empirical Null results explained in Section 4
Theoretical null
Empirical null
z
fdr
local
(formula)
tail
local
(formula)
tail
1.5
2.0
2.5
3.0
3.5
4.0
0.88
0.69
0.38
0.12
0.03
0.005
0.05
0.08
0.09
0.08
0.10
0.11
(0.05)
(0.09)
(0.10)
(0.10)
(0.13)
(0.15)
0.05
0.05
0.05
0.06
0.07
0.10
0.04
0.09
0.16
0.25
0.38
0.50
(0.04)
(0.10)
(0.16)
(0.25)
(0.38)
(0.51)
0.10
0.15
0.23
0.32
0.42
0.52
A variety of local fdr estimation methods have been
suggested, using parametric, semiparametric, nonparametric and Bayes methods: Pan et al. (2003), Pounds
and Morris (2003), Allison et al. (2002), Heller and
Qing (2003), Broberg (2005), Aubert et al. (2004),
Liao et al. (2004) and Do et al. (2005), all performing
reasonably well. The Poisson GLM methodology of
locfdr has the advantage of easy implementation with
familiar software, and a closed-form error analysis.
Estimation efficiency becomes a more serious problem on the “Empirical Null” side of Table 1, where we
can no longer trust the theoretical null f0 (z) ∼ N(0, 1).
This is the subject of Section 4.
4. THE EMPIRICAL NULL DISTRIBUTION
Table 1 reports on a small simulation study in which
ind
(3.11) zi ∼ N(μi , 1)
⎧
μi = 0,
⎪
⎪
⎨ with probability 0.9,
⎪
μ ∼ N(3, 1),
⎪
⎩ i
with probability 0.1,
for i = 1, 2, . . . , N = 1500. The table shows standard
deviations for log(fdr(z)),
(3.7), from 250 simulations
of (3.11), and also using a delta-method formula derived in Section 5 of Efron (2006), incorporated in the
locfdr algorithm. Rather than (3.6), f (z) was modeled
by a seven-parameter natural spline basis, locfdr’s default, though this gave nearly the same results as (3.6).
Also shown are standard deviations for the correspond
obtained by substituting tail-areaquantity log(Fdr(z))
z
f(z ) dz in (2.3). [This is a little less
ing F(z) = −∞
variable than using F̄ (z), (2.4).]
The “Theoretical Null” side of the table shows that
fdr(z)
is more variable than Fdr(z),
but both are more
than accurate enough for practical use. At z = 3,
for example, fdr(z)
only errs by about 8%, yielding
fdr(z) =
˙ 0.12 ± 0.01. Standard errors are roughly proportional to N −1/2 , so even reducing N to 250 gives
fdr(3)
=
˙ 0.12 ± .025, and similarly for other values
of z, accurate enough to make pictures like Figure 3
believable.
Empirical Bayes is a bipolar methodology, with alternating episodes of frequentist and Bayesian activity.
[or F dr, (2.5)] to fdr
beFrequentists may prefer Fdr
cause of connections with classical tail-area hypothesis
testing, or because cdf’s are more straightforward to es for its
timate than densities, while Bayesians prefer fdr
more apt a posteriori interpretation. Both, though, combine the Bayesian two-groups model with frequentist
estimation methods, and deliver the same basic information.
We have been assuming that f0 (z), the null density in (2.1), is known on theoretical grounds, as in
(3.1). This leads to false discovery estimates such as
fdr(z)
= p0 f0 (z)/f(z) and Fdr(z)
= p0 F0 (z)/F(z),
where only denominators need be estimated. Most applications of Benjamini and Hochberg’s control algorithm (2.5) make the same assumption (sometimes augmented with permutation calculations, which usually
produce only minor corrections to the theoretical null,
as discussed in Section 5). Use of the theoretical null is
mandatory in classic one-at-a-time testing, where theory provides the only information available for null behavior. But things change in large-scale simultaneous
testing situations: serious defects in the theoretical null
may become obvious, while empirical Bayes methods
can provide more realistic null distributions.
Figure 4 shows z-value histograms for two additional
microarray studies, described more fully in Efron
(2006). These are of the same form as the prostate
data: n subjects in two disease categories provide expression levels for N genes; two-sample t-statistics ti
comparing the categories are computed for each gene,
and then transformed to z-values zi = −1 (Fn−2 (ti )),
as in (3.2). Unlike panel A of Figure 1, however, neither histogram obeys the theoretical N(0, 1) null near
z = 0. The BRCA data has a much wider central peak,
while the HIV peak is too narrow. The lighter curves
in Figure 4 are empirical null estimates (Efron, 2004),
normal curves fit to the central peak of the z-value histograms. The idea here is simple enough: we make the
“zero assumption,”
Z ERO ASSUMPTION .
(4.1)
Most of the z-values near
0 come from null genes,
9
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
F IG . 4. z-values from two microarray studies. BRCA data (Hedenfalk et al., 2001), comparing seven breast cancer patients having BRCA1
mutation to eight with BRCA2 mutation N = 3226 genes. HIV data (van’t Wout et al., 2003) comparing four HIV+ males with four HIV−
males, N = 7680 genes. Theoretical N (0, 1) null, heavy curve is too narrow for BRCA data, too wide for HIV data. Light curves are empirical
nulls: normal densities fit to the central histogram counts.
(discussed further below), generalize the N(0, 1) theoretical null to N(δ0 , σ02 ), and estimate (δ0 , σ02 ) from the
histogram counts near z = 0. Locfdr uses two different estimation methods, analytical and geometric, described next.
The beaded curve shows the best quadratic approximation to log f(z) near 0. Matching its coefficients
0 ), for
(β0 , β1 , β2 ) to (4.3) yields estimates (
δ0 , σ0 , p
σ0 = (2β2 )−1/2 ,
instance, Figure 5 shows the geometric method in action on
the HIV data. The heavy solid curve is log f(z), fit
from (3.6) using Lindsey’s method, as described in
Efron and Tibshirani (1996). The two-groups model
and the zero assumption suggest that if f0 is normal, f (z) should be well-approximated near z = 0 by
p0 ϕδ0 ,σ0 (z), with
(4.4)
(4.2) ϕδ0 ,σ0 (z) ≡ (2πσ02 )−1/2 exp −
1 z − δ0
2
σ0
2 making log f (z) approximately quadratic,
(4.3)
1 δ02
log f (z) =
˙ log p0 −
+ log(2πσ02 )
2 σ02
δ0
1
+ 2 z − 2 z2 .
σ0
2σ0
,
δ0 = −0.107,
σ0 = 0.753,
0 = 0.931,
p
for the HIV data. Trying the same method with the theoretical null, that is, taking (δ0 , σ0 ) = (0, 1) in (4.3),
0 equals the impossible
gives a very poor fit, and p
value 1.20.
The analytic method makes more explicit use of the
zero assumption, stipulating that the nonnull density
f1 (z) in the two-groups model (2.1) is supported outside some given interval [a, b] containing zero (actually chosen by preliminary calculations). Let N0 be the
number of zi in [a, b], and define
b − δ0
a − δ0
P0 (δ0 , σ0 ) = −
σ0
σ0
(4.5)
θ = p0 P0 .
and
10
B. EFRON
F IG . 5. Geometric estimate of null proportion p0 and empirical null mean and standard deviation (δ0 , σ0 ) for the HIV data. Heavy curve
is log f(z), estimated as in (3.3)–(3.6); beaded curve is best quadratic approximation to log f(z) near z = 0.
Then the likelihood function for z0 , the vector of N0
z-values in [a, b], is
(4.6)
fδ0 ,σ0 ,p0 (z0 ) = [θ N0 (1 − θ )N−N0 ]
·
ϕδ ,σ (zi )
0 0
zi ∈z0
P0 (δ0 , σ0 )
.
This is the product of two exponential family likelihoods, which is numerically easy to solve for the
0 ), equaling
δ0 , σ0 , p
maximum likelihood estimates (
(−0.120, 0.787, 0.956) for the HIV data.
Both methods are implemented in locfdr. The analytic method is somewhat more stable but can be more
biased than geometric fitting. Efron (2004) shows that
geometric fitting gives nearly unbiased estimates of δ0
and σ0 for p0 ≥ 0.90. Table 2 shows how the two methods fared in the simulation study of Table 1.
A healthy literature has sprung up on the estimation
of p0 , as in Pawitan et al. (2005) and Langlass et al.
(2005), all of which assumes the validity of the theoretical null. The zero assumption plays a central role in
this literature [which mostly works with two-sided pvalues rather than z-values, e.g., pi = 2(1 − F100 (|ti |))
in (3.2), making the “zero region” occur near p = 1].
The two-groups model is unidentifiable if f0 is unspecified in (2.1), since we can redefine f0 as f0 + cf1 , and
p1 as p1 − cp0 for any c ≤ p1 /p0 . With p1 small, (2.2),
and f1 supposed to yield zi ’s far from 0 for the most
part, the zero assumption is a reasonable way to impose
identifiability on the two-groups model. Section 6 considers the meaning of the null density more carefully,
0
among other things explaining the upward bias of p
seen in Table 2.
The empirical null is an expensive luxury from the
point of view of estimation efficiency. Comparing the
right-hand side of Table 1 with the left reveals factors
of 2 or 3 increase in standard error relative to the theoretical null, near the crucial point where fdr(z) = 0.2.
Section 4 of Efron (2005) pins the increased variability entirely on the estimation of (δ0 , σ0 ); even knowing
the true values of p0 and f (z) would reduce the stan
by less than 1%. (Using taildard error of log fdr(z)
TABLE 2
0 ), simulation study of Table 1.
δ0 , σ0 , p
Comparison of estimates (
“Formula” is average from delta-method standard deviation
formulas, Section 5 in Efron (2006), as implemented in locfdr
Geometric
δ0 :
σ0 :
0 :
p
Analytic
mean
stdev
(formula)
mean
stdev
(formula)
0.02
1.02
0.92
0.056
0.029
0.013
(0.062)
(0.033)
(0.015)
0.04
1.04
0.93
0.031
0.031
0.009
(0.032)
(0.031)
(0.011)
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
TABLE 3
Number of genes identified as true discoveries by two-sided
Benjamini–Hochberg procedure, 0.10 control level
BRCA data:
HIV data:
Theoretical null
Empirical null
107
22
0
180
Empirical null densities as in Figure 4.
area Fdr’s rather than local fdr’s does not help—here
the local version is less variable.)
The reason for considering empirical nulls is that the
theoretical N(0, 1) null does not seem to fit the data in
situations like Figure 4. For the BRCA data we can
see that the histogram is overdispersed compared to
N(0, 1) around z = 0; the implication is that there will
be more null counts far from zero than the theoretical
null predicts, making N(0, 1) false discovery rate calculations like (3.10) too optimistic. The opposite happens with the HIV data.
There is a lot at stake here for both Bayesians and
frequentists. Table 3 shows the number of gene discoveries identified by the standard Benjamini–Hochberg
two-sided Fdr procedure, q = 0.10 in (2.5). The HIV
results are much more dramatic using the empirical
null f0 (z) ∼ N(−0.11, 0.752 ) and in fact we will see
in the next section that σ0 = 0.75 is quite believable
in this case. The BRCA data has been used in the microarray literature to compare analysis techniques, under the presumption that better techniques will produce
more discoveries; recently, for instance, in Storey et al.
(2005) and Pawitan et al. (2005). Table 3 suggests caution in this interpretation, where using the empirical
null negates any discoveries at all.
The z-values in panel C of Figure 1, proteomics
data, were calculated from standard Cox likelihood
tests that should yield N(0, 1) null results asymptotically. A N(−0.02, 1.292 ) empirical null was obtained
from the analytic method, resulting in only one peak
< 0.2; using the theoretical null gave six such
with fdr
peaks.
In panel B of Figure 1, the z-values were obtained
from familiar binomial calculations, each zi being calculated as
(4.7)
ad − p
dis − )
z = (p
ad )
dis (1 − p
dis ) −1/2
ad (1 − p
p
p
+
,
nad
ndis
where nad was the number of advantaged students in
ad the proportion passing the test,
the high school, p
·
11
dis for the disadvantaged stuand likewise ndis and p
dents; = 0.192 was the overall difference, median
ad ) − median (p
dis ). Here the empirical null standard
(p
deviation σ0 equals 1.52, half again bigger than the
theoretical standard deviation we would use if we had
only one school’s data. An empirical null fdr analysis
< 0.20, 30 on the left and
yielded 75 schools with fdr
45 on the right. Example B is discussed a bit further in
the next two sections, where its use in the two-groups
model is questioned.
My point here is not that the empirical null is always
the correct choice. The opposite advice, always use the
theoretical null, has been inculcated by a century of
classic one-case-at-a-time testing to the point where
it is almost subliminal, but it exposes the statistician
to obvious criticism in situations like the BRCA and
HIV data. Large-scale simultaneous testing produces
mass information of a Bayesian nature that impinges
on individual decisions. The two-groups model helps
bring this information to bear, after one decides on the
proper choice of f0 in (2.1). Section 5 discusses this
choice, in the form of a list of reasons why the theoretical null, and its close friend the permutation null,
might go astray.
5. THEORETICAL, PERMUTATION AND
EMPIRICAL NULL DISTRIBUTIONS
Like most statisticians, I have spent my professional
life happily testing hypotheses against theoretical null
distributions. It came as somewhat of a shock then,
when pictures like Figure 4 suggested that the theoretical null might be more theoretical than I had supposed. Once suspicious, it becomes easy to think of
reasons why f0 (z), the crucial element in the twogroups model (2.1), might not obey classical guidelines. This section presents four reasons why the theoretical null might fail, and also gives me a chance to
say something about the strengths and weaknesses of
permutation null distributions.
R EASON 1(Failed mathematical assumptions). The
usual derivation of the null hypothesis distribution
for a two-sample t-statistic assumes independent and
identically distributed (i.i.d.) normal components. For
the BRCA data of Figure 4, direct inspection of the
3226 by 15 matrix “X” of expression values reveals
markedly nonnormal components, skewed to the right
(even after the columns of X have been standardized
to mean 0 and standard deviation 1, as in all my examples here). Is this causing the failure of the N(0, 1)
theoretical null?
12
B. EFRON
Permutation techniques offer quick relief from such
concerns. The columns of X are randomly permuted,
giving a matrix X ∗ with corresponding t-values ti∗ and
z-values zi∗ = −1 (Fn−2 (ti∗ )). This is done some large
number of times, perhaps 100, and the empirical distribution of the 100 · N zi∗ ’s used as a permutation null.
The well-known SAM algorithm (Tusher, Tibshirani
and Chu, 2001) effectively employs the permutation
null cdf in the numerator of the Fdr formula (2.3).
Applied to the BRCA matrix, the permutation null
came out nearly N(0, 1) (as did simply simulating the
entries of X ∗ by independent draws from all 3226 · 15
entries of X), so nonnormal distributions were not the
cause of BRCA’s overwide histogram. In practice the
permutation null usually approximates the theoretical
null closely, as a long history of research on the permutation t-test demonstrated; see Section 5.9 of Lehmann
and Romano (2005).
R EASON 2 (Unobserved covariates). The BRCA
study is observational rather than experimental—the
15 women were observed to be BRCA1 or BRCA2, not
assigned, and likewise with the HIV and prostate studies. There are likely to be covariates—age, race, general health—that affect the microarray expression levels differently for different genes. If these were known
to us, they could be factored out using a separate linear
model on each gene’s data, providing a new and improved zi obtained from the “Treatment” coefficient in
the model. This would reduce the spread of the z-value
histogram, perhaps even restoring the N(0, 1) theoretical null for the BRCA data.
Unobserved covariates act to broaden the null distribution f0 (z). They also broaden the nonnull distribution f1 (z) in (2.1), and the mixture density f (z), but
this does not correct fdr estimates like (3.10), where the
numerator, which depends entirely on f0 , is the only
estimated quantity. Section 4 of Efron (2004) provides
an analysis of a simplified model with unobserved covariates. Permutation techniques cannot recognize unobserved covariates, as the model demonstrates.
R EASON 3 (Correlation across arrays). False discovery rate methodology does not require independence among the test statistics zi . However, the theoretical null distribution does require independence of the
expression values used to calculate each zi ; in terms of
the elements xij of the expression matrix X, for gene i
we need independence among xi1 , xi2 , . . . , xin in order
to validate (1.1).
Experimental difficulties can undercut acrossmicroarray independence, while remaining undetectable in a permutation analysis. This happened in both
studies of Figure 4 (Efron, 2004, 2006). The BRCA
data showed strong positive correlations among the
first four BRCA2 arrays, and also among the last
four. This reduces the effective degrees of freedom for
each t-statistic below the nominal 13, making ti and
zi = −1 (F13 (ti )) overdispersed.
R EASON 4 (Correlation across genes). Benjamini
and Hochberg’s 1995 paper verified Fdr control for rule
(2.5) under the assumption of independence among the
N z-values (relaxed a little in Benjamini and Yekutieli, 2001). This seems fatal for microarray applications since we expect genes to be correlated in their actions. A great virtue of the empirical Bayes/two-groups
approach is that independence is not necessary; with
Fdr(z)
= p0 F0 (z)/F(z), for instance, Fdr(z)
can provide a reasonable estimate of Pr{null|Z ≤ z} as long
as F(z) is roughly unbiased for F (z)—in formal terms
requiring consistency but not independence—and like
wise for the local version fdr(z)
= p0 f0 (z)/f(z), (3.7).
There is, however, a black cloud inside the silver lining: the assumption that the null density f0 (z) is known
to the statistician. The empirical null estimation methods of Section 4 do not require z-value independence,
and so disperse the black cloud, at the expense of increased variability in fdr estimates. Do we really need
to use an empirical null? Efron (2007) discusses the
following somewhat disconcerting result: even if the
theoretical null distribution zi ∼ N(0, 1) holds exactly
true for all null genes, Reasons 1–3 above not causing
trouble, correlation among the zi ’s can make the overall null distribution effectively much wider or much
narrower than N(0, 1).
Microarray data sets tend to have substantial z-value
correlations. Consider the BRCA data: there are more
than five million correlations ρij between pairs of gene
z-values zi and zj ; by examining the row-wise correlations in the X matrix we can estimate that the distribution of the ρij ’s has approximately mean 0 and variance
α 2 = 0.1532 ,
(5.1)
ρ ∼ (0, α 2 ).
(The zero mean is a consequence of standardizing the
columns of X.) This is a lot of correlation—as much as
if the BRCA genes occurred in 10 independent groups,
but with common interclass correlation 0.50 for all
genes within a group.
Section 3 of Efron (2006) shows that under assumptions (1.1)–(5.1), the ensemble of null-gene z-values
will behave roughly as
(5.2)
zi ∼N(0,
˙
σ02 )
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
with
(5.3)
√
σ02 = 1 + 2A,
A ∼ (0, α 2 ).
If the variable A equaled α = 0.153, for instance, giving σ0 = 1.10, then the expected number of null counts
below z = −3 would be about p0 N(−3/1.10) rather
than p0 N(−3), more than twice as many. There is
even more correlation in the HIV data, α =
˙ 0.42,
enough so that a moderately negative value of A could
cause σ0 = 0.75, as in Figure 4.
The random variable A acts like an observable ancillary in the two-groups situation—observable because
we can estimate σ0 from the central counts of the zσ0 is essentially the
value histogram, as in Section 4; half-width of the central peak.
Figure 6 is a cautionary story on the dangers of igσ0 . A simulation model with
noring (5.4)
zi ∼ N(0, 1),
i = 1, 2, . . . , 2700,
zi ∼ N(2.5, 1.5),
and
i = 2701, . . . , 3000,
was run, in which the null zi ’s, the first 2700, were
correlated to the same degree as in the BRCA data,
α = 0.153. For each of 1000 simulations of (5.4), a
standard Benjamini–Hochberg Fdr analysis (2.5) (i.e.,
13
using the theoretical null for F0 ) was run at control
level q = 0.10, and used to identify a set of nonnull
genes.
Each of the thousand points in Figure 6 is (
σ0 , Fdp),
where σ0 is half the distance between the 16th and 86th
percentiles of the 3000 zi ’s, and Fdp is the “False discovery proportion,” the proportion of identified genes
that were actually null. Fdp averaged 0.091, close to the
target value q = 0.10, but with a strong dependence on
σ0 : the lowest 5% of σ0 ’s corresponded to Fdp’s averaging only 0.03, while the upper 5% average was 0.29,
a factor of 9 difference.
The point here is not that the claimed q-value 0.10
is wrong, but that in any one simulation we may be
able to see, from σ0 , that it is probably misleading. Using the empirical null counteracts this fallacy which,
again, is not apparent from the permutation null. (Section 4 of Efron, 2007, discusses more elaborate permutation methods that do bear on Figure 6. See Qui et al.,
2005, for a gloomier assessment of correlation effects
in microarray analyses.)
What is causing the overdispersion in the Education data of panel B, (4.7)? Correlation across schools,
Reason 44, seems ruled out by the nature of the sampling, leaving Reasons 2 and 3 as likely candidates; un-
F IG . 6. Benjamini–Hochberg Fdr control procedure (2.5), q = 0.1, run for 1000 simulations of correlated model (5.4); true false discovery
σ0 . Overall Fdp averaged 0.091, close to q, but with a strong dependence on σ0 .
proportion Fdp plotted versus half-width estimate 14
B. EFRON
observed covariates are an obvious threat here, while
within-school sampling dependences (Reason 3) are
certainly possible. Fdr analysis yields eight times as
many “significant” schools based on the theoretical
null rather than f0 ∼ N(−0.35, 1.512 ), but looks completely untrustworthy to me.
Sometimes the theoretical null distribution is fine, of
course. The prostate data had (
δ0 , σ0 ) = (0.00, 1.06)
according to the analytic method of (4.6), close enough
to (0, 1) to make theoretical null calculations believable. However, there are lots of things that can go
wrong with the theoretical null, and lots of data to
check it with in large-scale testing situations, making it
a matter of due diligence for the statistician to do such
checking, even if only by visual inspection of the zvalue histogram. All simultaneous testing procedures,
not just false discovery rates, go wrong if the null distribution is misrepresented.
6. A ONE-GROUP MODEL
Classical one-at-a-time hypothesis testing depends
on having a unique null density f0 (z), such as Student’s t distribution for the normal two-sample situation. The assumption of unique f0 has been carried
over into most of the microarray testing literature, including our definition (2.1) of the two-groups model.
Realistic examples of large-scale inference are apt
to be less clearcut, with true effect sizes ranging continuously from zero or near zero to very large. Here
we consider a “one-group” structural model that allows for a range of effects. We can still usefully apply
fdr methods to data from one-group models; doing so
helps clarify the choice between theoretical and empirical null hypotheses, and explicates the biases inherent
in model (2.1). The discussion in this section, as in Section 2, will be mostly theoretical, involving probability
models rather than collections of observed z-values.
Model (2.1) does not require knowing how the
z-values were generated, a substantial practical advantage of the two-groups formulation. In contrast, onegroup analysis begins with a specific Bayesian structural model. We assume that the ith case has an unobserved true value μi distributed according to some
density g(μ), and that the observed zi is normally distributed around μi ,
(6.1)
μ ∼ g(·)
and
z|μ ∼ N(μ, 1).
The density g(μ) is allowed to have discrete atoms. It
might have an atom at zero but this is not required, and
in any case there is no a priori partition of g(μ) into
null and nonnull components.
As an example, suppose g(μ) is a mixture of 90%
N(0, 0.52 ) and 10% N(2.5, 0.52 ),
(6.2) g(μ) = 0.9 · ϕ0,0.5 (μ) + 0.1 · ϕ2.5,0.5 (μ)
in notation (4.2). The histogram in Figure 7 shows N =
3000 draws of μi from (6.2). I am thinking of this as
a situation having a large proportion of uninteresting
cases centered near, but not exactly at, zero, and a small
proportion of interesting cases centered far to the right.
We still want to use the observed zi ’s from (6.2) to flag
cases that are likely to be interesting.
The density of z in model (6.1) is
f (z) =
(6.3)
∞
−∞
ϕ(μ − z)g(μ) dμ,
√ ϕ(x) = exp(−x 2 /2)/ 2π ,
shown as the smooth curve in the left-hand panel,
(6.4) f (z) = 0.9 · ϕ0,1.12 (z) + 0.1 · ϕ2.5,1.12 (z).
The effect of noise in going from μi to zi ∼ N(μi , 1)
has blurred the strongly bimodal μ-histogram into a
smoothly unimodal f (z).
We can still employ the tactic of Figure 5, fitting a
quadratic curve to log f (z) around z = 0 to estimate
p0 and the empirical null density f0 (z). Using the formulas described later in this section gives
(6.5) p0 = 0.93
and
f0 (z) ∼ N(.02, 1.142 ),
and corresponding fdr curve p0 f0 (z)/f (z), labeled
“Emp null” in the right-hand panel of Figure 7.
Looking at the histogram, it is reasonable to consider
“interesting” those cases with μi ≥ 1.5, and “uninteresting” μi < 1.5. The curve labeled “Bayes” in Figure 7 is the posterior probability Pr{uninteresting|z}
based on full knowledge of (6.1), (6.2). The empirical null fdr curve provides an excellent estimate of the
full Bayes result, without the prior knowledge. [An fdr
based on the theoretical N(0, 1) null is seen to be far
off.]
Unobserved covariates, Reason 2 in Section 4, can
easily produce blurry null hypotheses like that in (6.2).
My point here is that the two-group model will handle
blurry situations if the null hypothesis is empirically
estimated. Or, to put things negatively, theoretical or
permutation null methods are prone to error in such
situations, no matter what kind of analysis technique
is used.
Comparing (6.5) with (6.4) shows that f0 (z) is just
about right, but p0 is substantially larger than the value
0.90 we might expect. The ϕ2.5,.5 component of g(μ)
15
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
F IG . 7. Left panel: Histogram shows N = 3000 draws of μi from model (6.2); smooth curve is corresponding density f (z), (6.3). Right
panel: “Emp null” is fdr(z) based on empirical null; it closely matches full Bayes posterior probability “Bayes” = Pr{μk < 1.5|z} from
(6.1)–(6.2); “Theo null” is fdr(z) based on theoretical null, a poor match to Bayes.
puts some of its z-values near zero, weakening the zero
assumption (4.1) and biasing p0 upward. The same
thing happened in Table 2 even though model (3.11)
is “unblurred,” g(μ) having a point mass at μ = 0.
Fortunately, p0 is the least important part of the twogroups model for estimating fdr(z), under assumption
(2.2). “Bias” can be a misleading term in model (6.1)
since it presupposes that each μi is clearly defined as
null or nonnull. This seems clear enough in (3.11). The
null/nonnull distinction is less clear in (6.2), though it
still makes sense to search for cases that have μi unusually far from 0.
The results in (6.5) come from a theoretical analysis of model (6.1). The idea in what follows is to generalize the construction in Figure 5 by approximating
(z) = log f (z) with Taylor series other than quadratic.
The J th Taylor approximation to (z) is
(6.6)
J (z) =
J
(j ) (0)zj /j !,
Let f0 (z) indicate the subdensity p0 f0 (z), the numerator of fdr(z) in (2.7). The choice
(6.8)
matches f (z) at z = 0 (a convenient form of the zero
assumption) and leads to an fdr expression
(6.9)
(6.7)
(j )
d j log f (z) (0) =
.
dzj
z=0
fdr(z) = eJ (z) /f (z).
Larger choices of J match f0 (z) more accurately to
f (z), increasing ratio (6.9); the interesting z-values,
those with small fdr’s, are pushed farther away from
zero as we allow more of the data structure to be explained by the null density.
Bayesian model (6.1) provides a helpful interpretation of the derivatives (j ) (0):
L EMMA . The derivative (j ) (0), (6.7), is the j th
cumulant of the posterior distribution of μ given z = 0,
except that (2) (0) is the second cumulant minus 1.
Thus
j =0
where (0) (0) = log f (0) and for j ≥ 1
f0 (z) = eJ (z)
(6.10)
(1) (0) = E0
and
−(2) (0) = 1 − V0 ≡ V̄0 ,
where E0 and V0 are the posterior mean and variance
of μ given z = 0.
16
B. EFRON
TABLE 4
Expressions for p0 , f0 and fdr, first three choices of J in (6.8),
(6.9); V̄0 = 1 − V0 ; J = 0 gives theoretical null, J = 2
empirical null; f (z) from (6.3)
J
0
1
2
p0
√
f (0) 2π
√
2
f (0) 2π eE0 /2
2
f (0) 2π eE0 /2V̄0
f0 (z)
N (0, 1)
N (E0 , 1)
N(E0 /V̄0 , 1/V̄0 )
fdr(z)
f (0)e−z
f (z)
2 /2
f (0)eE0 z−z
f (z)
2 /2
V̄0
f (0)eE0 z−V̄0 z
f (z)
2 /2
Proof of the lemma appears in Section 7 of Efron
(2005).
For J = 0, 1, 2, formulas (6.8), (6.9) yield simple expressions for p0 and f0 (z) in terms of f (0), E0 and
V̄0 . These are summarized in Table 4, with p0 obtained
from
(6.11)
p0 =
∞
−∞
f0 (z) dz.
Formulas are also available for Fdr(z), (2.8).
The choices J = 0, 1, 2 in Table 4 result in a normal
null density f0 (z), the only difference being the means
and variances. Going to J = 3 allows for an asymmetric choice of f0 (z),
f (0) E0 z−V̄0 z2 /2+S0 z3 /6
e
(6.12)
,
fdr(z) =
f (z)
where S0 is the posterior third central moment of μ
given z = 0 in model (6.1). The program locfdr uses a
variant, the “split normal,” to model asymmetric null
densities, with the exponent of (6.12) replaced by a
quadratic spline in z.
The lemma bears on the difference between empirical and theoretical nulls. Suppose that the probability
mass of g(μ) occurring within a few units of the origin
is concentrated in an atom at μ = 0. Then the posterior
mean and variance (E0 , V0 ) of μ given z = 0 will be
near 0, making (E0 , V̄0 )=(0,
˙
1). In this case the empirical null (J = 2) will approximate the theoretical null
(J = 0). Otherwise the two nulls differ; in particular,
any mass of g(μ) near zero increases V0 , swelling the
standard deviation (1 − V0 )−1/2 of the empirical null.
The two-groups model (2.1), (2.2) puts one in a
hypothesis-testing frame of mind: a large group of uninteresting cases is to be statistically separated from
a small interesting group. Even blurry situations like
(6.2) exhibit a clear grouping, as in Figure 7. None
of this is necessary for the one-group model (6.1). We
might, for example, suppose that g(μ) is normal,
(6.13)
μ ∼ N(A, B 2 ),
and proceed in an empirical Bayes way to estimate A
and B and then apply Bayes estimation to the individual cases.
This line of thought leads directly to James–Stein estimation (Efron and Morris, 1975). Estimation, as opposed to testing, is the key word here—with possible
effect sizes μi varying continuously rather than having
a large clump of values near zero. The Education data
of panel B, Figure 1, could reasonably be analyzed this
way, instead of through simultaneous testing. Scientific
context, which says that there is likely to be a large
group of (nearly) unaffected genes, as in (2.2), is what
makes the two-groups model a reasonable Bayes prior
for microarray studies.
7. BAYESIAN AND FREQUENTIST CONFIDENCE
STATEMENTS
False discovery rate methods provide a happy marriage between Bayesian and frequentist approaches
to multiple testing, as shown in Section 2. Empirical Bayes techniques based on the two-groups model
seem to give us the best of both statistical philosophies.
Things do not always work out so peaceably; in these
next two sections I want to discuss contentious situations where the divorce court looms as a possibility.
An insightful and ingenious paper by Benjamini and
Yekutieli (2005) discusses the following problem in simultaneous significance testing: having applied false
discovery rate methods to select a set of nonnull cases,
how can confidence intervals be assigned to the true
effect size for each selected case? (The paper and the
ensuing discussion are much more general, but this is
all I need for the illustration here.)
Figure 8 concerns Benjamini and Yekutieli’s solution applied to the following simulated data set: N =
10,000 (μi , zi ) pairs were generated as in (6.1), with
90% of the μi zero, the null cases, and 10% distributed
N(−3, 1),
(7.1)
g(μ) = 0.90 · δ0 (μ) + 0.10 · ϕ−3,1 (μ),
δ0 (μ) a delta function at μ = 0. The Fdr procedure
(2.5) was applied with q0 = 0.05, yielding 566 nonnull
“discoveries,” those having zi ≤ −2.77.
The Benjamini–Yekutieli “false coverage rate”
(FCR) control procedure provides upper and lower
bounds for the true effect size μi corresponding to each
zi less than −2.77; these are indicated by heavy diagonal lines in Figure 8, constructed as described in
BY’s Definition 1. This construction guarantees that
17
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
F IG . 8. Benjamini–Yekutieli FCR controlling intervals applied to simulated sample of 10,000 cases from (6.1), (7.1). 566 cases have
zi ≤ z0 = −2.77, the Fdr (0.05) threshold. Plotted points are (zi , μi ) for the 1000 nonnull cases; 14 null cases with zi ≤ z0 indicated by
“+.” Heavy diagonal lines indicate FCR 95% interval limits; light lines are Bayes 95% posterior intervals given μi = 0. Beaded curve at
top is fdr(zi ), posterior probability μi = 0.
the expected proportion of the 566 intervals not containing the true μi , the false coverage rate, is bounded
by q = 0.05.
In a real application only the zi ’s and their BY confidence intervals could be seen, but in a simulation we
can plot the actual (zi , μi ) pairs, and compare them
to the intervals. Figure 8 plots (zi , μi ) for the 1000
nonnull cases, those from μi ∼ N(−3, 1) in (7.1). Of
these, 55,2 plotted as heavy points, lie to the left of
z0 = −2.77, the Fdr threshold, with the other 448 plotted as light points; 14 null cases, μi = 0, plotted as
“+,” also had zi < z0 .
The first thing to notice is that the FCR property is
satisfied: only 17 of the 566 intervals have failed to
contain μi (14 of these the +’s), giving 3% noncoverage. The second thing, though, is that the
√ intervals
are frighteningly wide—zi ± 2.77, about 2 longer
than the usual individual 95% intervals zi ± 1.96—and
poorly centered, particularly at the left where all the
μi ’s fall in their intervals’ upper halves.
An interesting comparison is with Bayes’ rule applied to (6.1), (7.1), which yields
(7.2)
Pr{μ = 0|zi } = fdr(zi ),
where
(7.3)
fdr(z) = 0.9 · ϕ0,1 (z)
· [0.9 · ϕ0,1 (z) + 0.1 · ϕ−3,√2 (z)]−1
as in (2.7), and
zi − 3 1
, .
2
2
That is, μi is null with probability fdr(zi ), and N((zi −
3)/2, 1/2) with probability 1 − fdr(zi ). The dashed
lines indicate the posterior 95%
√ intervals
√ given that μi
is nonnull, (zi −3)/2±1.96/ 2, now 2 shorter than
the usual individual intervals; at the top of Figure 9 the
beaded curve shows fdr(zi ).
The frequentist FCR intervals and the Bayes intervals are pursuing the same goal, to include the nonnull scores μi with 95% probability. At zi = −2.77 the
FCR assessment is Pr{μ ∈ [−5.54, 0]} = 0.95; Bayes’
rule states that μi = 0 with probability fdr(−2.77) =
0.25, and if μi = 0, then μi ∈ [−4.27, −1.49] with
probability 0.95. This kind of disconnected description is natural to the two-groups model. A principal
cause of FCR’s oversized intervals (the paper shows
(7.4)
g(μi |μi = 0, zi ) ∼ N
18
B. EFRON
F IG . 9. Computing a p-value for z̄S = 0.842, average of 15 z-values in CTL pathway, p53 data Solid histogram 500 row randomizations
give p-value 0.002. Line histogram 500 column permutations give p-value 0.048.
that no FCR-controlling intervals can be much narrower) comes from using a single connected set to describe a disconnected situation.
Of course Bayes’ rule will not be easily available
to us in most practical problems. Is there an empirical Bayes solution? Part of the solution certainly is
there: estimating fdr(z) as in Section 3. Estimating
g(μi |μi = 0, zi ), (7.4), is more challenging. A straightforward approach uses the nonnull counts (3.8) to estimate the nonnull density f1 (z) in (2.1), deconvolutes
f1 (z) to estimate the nonnull component “g1 (μ)” in
(7.1), and applies Bayes’ rule directly to g1 . This works
reasonably well in Figure 8’s example, but deconvolution calculations are notoriously tricky and I have not
been able to produce a stable general algorithm.
Good frequentist methods like the FCR procedure
enjoy the considerable charm of an exact error bound,
without requiring a priori specifications, and of course
there is no law that they have to agree with any particular Bayesian analysis. In large-scale situations, however, empirical Bayes information can overwhelm both
frequentist and Bayesian predilections, hopefully leading to a more satisfactory compromise between the two
sets of intervals appearing in Figure 8.
8. IS A SET OF GENES ENRICHED?
Microarray experiments, through a combination of
insufficient data per gene and massively multiple si-
multaneous inference, often yield disappointing results. In search of greater detection power, enrichment
analysis considers the combined outcomes of biologically defined sets of genes, such as pathways. As a
hypothetical example, if the 20 z-values in a certain
pathway all were positive, we might infer significance
to the pathway’s effect, whether or not any of the individual zi ’s were deemed nonnull.
Our example here will involve the p53 data, from
Subramanian et al. (2005), N = 10,100 genes on n =
50 microarrays, zi ’s as in (3.2), whose z-value histogram looks like a slightly short-tailed normal distribution having mean 0.04 and standard deviation 1.06.
Fdr analysis (2.5), q = 0.1, yielded just one nonnull
gene, while enrichment analysis indicated seven or
eight significant gene sets, as discussed at length in
Efron and Tibshirani (2006).
Figure 9 concerns the CTL pathway, a set of 15 genes
relating to the development of so-called killer T cells,
#95 in a catalogue of 522 gene-sets provided by Subramanian et al. (2005). For a given gene-set “S” with
m members, let z̄S denote the mean of the m z-values
within S; z̄S is the enrichment statistic suggested in the
Bioconductor R package limma (Smyth, 2004),
(8.1)
z̄S = 0.842
for the CTL pathway. How significant is this result? I
will consider assigning an individual p-value to (8.1),
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
not taking into account multiple inference for a catalogue of possible gene-sets (which we could correct
for later using Fdr methods, for instance, to combine
the individual p-values).
Limma computes p-values by “row randomization,”
that is, by randomizing the order of rows of the N × n
expression matrix X, and recomputing the statistic of
interest. For a simple average like (8.1) this amounts
to choosing random subsets of size m = 15 from the
N = 10, 100 zi ’s and comparing z̄S to the distribution
of the randomized values z̄S∗ . Five hundred rowrands
produced only one z̄S∗ > z̄S , giving p-value 1/500 =
0.002.
Subramanian et al. calculate p-values by permuting
the columns of X rather than the rows. The permutations yield a much wider distribution than the row randomizations in Figure 9, with corresponding p-value
0.048. The reason is simple: the genes in the CTL pathway have highly correlated expression levels that increase the variance of z̄S∗ ; column-wise permutations
of X preserve the correlations across genes, while row
randomizations destroy them.
At this point it looks like column permutations
should always give the right answer. Wrong! For the
BRCA data in Figure 4, the ensemble of z-values
has (mean, standard deviation) about (0, 1.50), compared to (0, 1) for zi∗ ’s from column permutations. This
shrinks the permutation variability of z̄S∗ , compared to
what one would get from a random selection of genes
for S, and can easily reverse the relationship in Figure 9.
The trouble here is that there are two obvious, but
different, null hypotheses for testing enrichment:
Randomization null hypothesis S has been chosen
by random selection of m genes from the full set of N
genes.
Permutation null hypothesis The order of the n microarrays has been chosen at random with respect to
the patient characteristics (e.g., with the patient being
in the normal or cancer category in Example A of the
Introduction).
Efron and Tibshirani (2006) suggest a compromise
method, restandardization, that to some degree accommodates both null hypotheses. Instead of permuting
z̄S in (8.1), restandardization permutes (z̄S − μz )/σz ,
where (μz , σz ) are the mean and standard deviation of
all N zi ’s. Subramanian et al. do something similar using a Kolmogorov–Smirnov enrichment statistic.
All of these methods are purely frequentistic. Theoretically we might consider applying the two-groups/
19
empirical Bayes approach to sets of z-values “zS ,” just
as we did for individual zi ’s in Sections 2 and 3. For at
least three reasons that turns out to be extremely difficult:
• My technique for estimating the mixture density f ,
as in (3.6), becomes exponentially more difficult in
higher dimensions.
• There is not likely to be satisfactory theoretical null
f0 for the correlated components of z̄S , while estimating an empirical null faces the same “curse of
dimensionality” as for f .
• As discussed following (3.10), false discovery rate
interpretation depends on exchangeability, essentially an equal a priori interest in all N genes. There
may be just one gene-set S of interest to an investigator, or a catalogue of several hundred S’s as in
Subramanian et al., but we certainly are not interested in all possible gene-sets. It would be a daunting exercise in subjective, as opposed to empirical,
Bayesianism to assign prior probabilities to any particular gene-set S.
Having said this, it turns out there is one “gene-set”
situation where the two-groups/empirical Bayes approach is practical (though it does not involve genes).
Looking at panel D of Figure 1, the Imaging data, the
obvious spatial correlation among z-values suggests local averaging to reduce the effects of noise.
This has been carried out in Figure 10: at voxel
i of the N = 15,445 voxels, the average of z-values
for those voxels within city-block distance 2 has been
computed, say “z̄i .” The results for the same horizontal
slice as in panel D are shown using a similar symbol
code. Now that we have a single number zi for each
estimates
voxel, we can compute the empirical null fdr
as in Section 4. The voxels labeled “enriched” in Fig i ) ≤ 0.2.
ure 10 are those having fdr(z̄
Enrichment analysis looks much more familiar in
this example, being no more than local spatial smoothing. The convenient geometry of three-dimensional
space has come to our rescue, which it emphatically
fails to do in the microarray context.
9. CONCLUSION
Three forces influence the state of statistical science at any one time: mathematics, computation and
applications, by which I mean the type of problems
subject-area scientists bring to us for solution. The
Fisher–Neyman–Pearson theory of hypothesis testing
was fashioned for a scientific world where experimentation was slow and difficult, producing small data sets
20
B. EFRON
F IG . 10. Enrichment analysis of Imaging data, panel D of Figure 1; z-value for original 15,445 voxels have been averaged over “gene-sets”
of neighboring voxels with city-block distance ≤ 2. Coded as “−” for z̄i < 0, “+” for z̄i ≥ 0; solid rectangles, labeled as “Enriched,” show
) ≤ 0.2, using empirical null.
voxels with fdr(z̄
i
focused on answering single questions. It was wonderfully successful within this milieu, combining elegant
mathematics and limited computational equipment to
produce dependable answers in a wide variety of application areas.
The three forces have changed relative intensities recently. Computation has become literally millions of
times faster and more powerful, while scientific applications now spout data in fire-hose quantities. (Mathematics, of course, is still mathematics.) Statistics is
changing in response, as it moves to accommodate
massive data sets that aim to answer thousands of questions simultaneously. Hypothesis testing is just one part
of the story, but statistical history suggests that it could
play a central role: its development in the first third of
the twentieth century led directly to confidence intervals, decision theory and the flowering of mathematical
statistics.
I believe, or maybe just hope, that our new scientific
environment will also inspire a new look at old philosophical questions. Neither Bayesians nor frequentists
are immune to the pressures of scientific necessity.
Lurking behind the specific methodology of this paper is the broader, still mainly unanswered, question of
how one should combine evidence from thousands of
parallel but not identical hypothesis testing situations.
What I called “empirical Bayes information” accumulates in a way that is not well understood yet, but still
has to be acknowledged: in the situations of Figure 4,
the frequentist is not free to stick with classical null hypotheses, while the Bayesian cannot use prior (6.13), at
least not without the risk of substantial inferential confusion.
Classical statistics developed in a data-poor environment, as Fisher’s favorite description, “small-sample
theory,” suggests. By contrast, modern-day disciplines
such as machine learning seem to struggle with the difficulties of too much data. Both problems, too little and
too much data, can afflict microarray studies. Massive
data sets like those in Figure 1 are misleadingly comforting in their suggestion of great statistical accuracy.
As I have tried to show here, the power to detect interesting specific cases, genes, may still be quite low. New
methods are needed, perhaps along the lines of “enrichment,” as well as a theory of experimental design
explicitly fashioned for large-scale testing situations.
One floor up from the philosophical basement lives
the untidy family of statistical models. In this paper
I have tried to minimize modeling decisions by working directly with z-values. The combination of the twogroups model and false discovery rates applied to the zvalue histogram is notably light on assumptions, more
so when using an empirical null, which does not even
require independence across the columns of X (i.e.,
across microarrays, a dangerous assumption as shown
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL
in Section 6 of Efron, 2004). There will certainly be situations when modeling inside the X matrix, as in Newton et al. (2004) or Kerr, Martin and Churchill (2000),
yields more information than z-value procedures, but I
will leave that for others to discuss.
ACKNOWLEDGMENTS
This research was supported in part by the National
Science Foundation Grant DMS-00-72360 and by National Institute of Health Grant 8R01 EB002784.
REFERENCES
A LLISON , D., G ADBURY, G., H EO , M., F ERNANDEZ , J., L EE ,
C. K., P ROLLA , T. and W EINDRUCH , R. (2002). A mixture
model approach for the analysis of microarray gene expression
data. Computat. Statist. Data Anal. 39 1–20. MR1895555
AUBERT, J., BAR - HEN , A., DAUDIN , J. and ROBIN , S. (2004).
Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics 5 125.
B ENJAMINI , Y. and H OCHBERG , Y. (1995). Controlling the false
discovery rate: A practical and powerful approach to multiple
testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392
B ENJAMINI , Y. and Y EKUTIELI , D. (2001). The control of the
false discovery rate under dependency. Ann. Statist. 29 1165–
1188. MR1869245
B ENJAMINI , Y. and Y EKUTIELI , D. (2005). False discovery rateadjusted multiple confidence intervals for selected parameters.
J. Amer. Statist. Assoc. 100 71–93. MR2156820
B ROBERG , P. (2005). A comparative review of estimates of the
proportion unchanged genes and the false discovery rate. BMC
Bioinformatics 6 199.
D O , K.-A., M UELLER , P. and TANG , F. (2005). A Bayesian mixture model for differential gene expression. J. Roy. Statist. Soc.
Ser. C 54 627–644. MR2137258
D UDOIT, S., S HAFFER , J. and B OLDRICK , J. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18 71–
103. MR1997066
E FRON , B. (2003). Robbins, empirical Bayes, and microarrays.
Ann. Statist. 31 366–378. MR1983533
E FRON , B. (2004). Large-scale simultaneous hypothesis testing:
The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–
104. MR2054289
E FRON , B. (2005). Local false discovery rates. Available at http:
//www-stat.stanford.edu/~brad/papers/False.pdf.
E FRON , B. (2006). Size, power, and false discovery rates.
Available at http://www-stat.stanford.edu/~brad/papers/Size.
pdf. Ann. Appl. Statist. To appear.
E FRON , B. (2007). Correlation and large-scale simultaneous
significance testing. J. Amer. Statist. Assoc. 102 93–103.
MR2293302
E FRON , B. and G OUS , A. (2001). Scales of evidence for model
selection: Fisher versus Jeffreys. Model Selection IMS Monograph 38 208–256. MR2000754
E FRON , B. and M ORRIS , C. (1975). Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 70 311–
319. MR0391403
21
E FRON , B. and T IBSHIRANI , R. (1996). Using specially designed
exponential families for density estimation. Ann. Statist. 24
2431–2461. MR1425960
E FRON , B. and T IBSHIRANI , R. (2006). On testing the significance of sets of genes. Available at http://www-stat.stanford.
edu/~brad/papers/genesetpaper.pdf. Ann. Appl. Statist. To appear.
E FRON , B. and T IBSHIRANI , R. (2002). Empirical Bayes methods
and false discovery rates for microarrays. Genetic Epidemiology
23 70–86.
E FRON , B., T IBSHIRANI , R., S TOREY, J. and T USHER , V. (2001).
Empirical Bayes analysis of a microarray experiment. J. Amer.
Statist. Assoc. 96 1151–1160. MR1946571
H EDENFALK , I., D UGGEN , D., C HEN , Y., ET AL . (2001). Gene
expression profiles in hereditary breast cancer. New Engl. J.
Medicine 344 539–548.
H ELLER , G. and Q ING , J. (2003). A mixture model approach for
finding informative genes in microarray studies. Unpublished
manuscript.
K ERR , M., M ARTIN , M. and C HURCHILL , G. (2000). Analysis of
variance in microarray data. J. Comp. Biology 7 819–837.
L ANGASS , M., L INDQUIST, B. and F ERKINSTAD , E. (2005). Estimating the proportion of true null hypotheses, with application
to DNA microarray data. J. Roy. Statist. Soc. Ser. B 67 555–572.
MR2168204
L EE , M. L. T., K UO , F., W HITMORE , G. and S KLAR , J. (2000).
Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. 97 9834–9838.
L EHMANN , E. and ROMANO , J. (2005). Generalizations of the
familywise error rate. Ann. Statist. 33 1138–1154. MR2195631
L EHMANN , E. and ROMANO , J. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York. MR2135927
L EWIN , A. R ICHARDSON , S., M ARSHALL , C., G LASER , A. and
A ITMAN , Y. (2006). Bayesian modeling of differential gene expression. Biometrics 62 1–9. MR2226550
L IANG , C., R ICE , J., DE PATER , I., A LCOCK , C., A XELROD ,
T., WANG , A. and M ARSHALL , S. (2004). Statistical methods for detecting stellar occultations by Kuiper belt objects: The
Taiwanese-American occultation survey. Statist. Sci. 19 265–
274. MR2146947
L IAO , J., L IN , Y., S ELVANAYAGAM , Z. and W EICHUNG , J.
(2004). A mixture model for estimating the local false discovery
rate in DNA microarray analysis. Bioinformatics 20 2694–2701.
N EWTON , M., K ENDZIORSKI , C., R ICHMOND , C., B LATTNER ,
F. and T SUI , K. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comp. Biology 8 37–52.
N EWTON , M., N OVEIRY, A., S ARKAR , D. and A HLQUIST, P.
(2004). Detecting differential gene expression with a semiparametric hierarchical mixture model. Biostatistics 5 155–176.
PAN , W., L IN , J. and L E , C. (2003). A mixture model approach to
detecting differentially expressed genes with microarray data.
Functional and Integrative Genomics 3 117–124.
PARMIGIANI , G., G ARRETT, E., A MBAZHAGAN , R. and
G ABRIELSON , E. (2002). A statistical framework for
expression-based molecular classification in cancer. J. Roy.
Statist. Soc. Ser. B 64 717–736. MR1979385
PAWITAN , Y., M URTHY, K., M ICHIELS , J. and P LONER , A.
(2005). Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 21 3865–3872.
22
B. EFRON
P OUNDS , S. and M ORRIS , S. (2003). Estimating the occurrence
of false positives and false negatives in microarray studies by
approximating and partitioning the empirical distribution of the
p-values. Bioinformatics 19 1236–1242.
Q UI , X., B ROOKS , A., K LEBANOV, L. and YAKOVLEV, A.
(2005). The effects of normalization on the correlation structure
of microarray data. BMC Bioinformatics 6 120.
ROGOSA , D. (2003). Accuracy of API index and school base report elements: 2003 Academic Performance Index, California
Department of Education. Available at http://www.cde.cagov/
ta/ac/ap/researchreports.asp.
S CHWARTZMAN , A., D OUGHERTY, R. F. and TAYLOR , J. E.
(2005). Cross-subject comparison of principal diffusion direction maps. Magnetic Resonance in Medicine 53 1423–1431.
S INGH , D., F EBBO , P., ROSS , K., JACKSON , D., M ANOLA ,
J., L ADD , C. TAMAYO , P., R ENSHAW, A., D’A MICO , A.,
R ICHIE , J., L ANDER , E., L ODA , M., K ANTOFF , P., G OLUB ,
T. and S ELLERS , R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 302–309.
S MYTH , G. (2004). Linear models and empirical Bayes methods
for assessing differential expression in microarray experiments.
Stat. Appl. Genet. Mol. Biol. 3 1–29. MR2101454
S TOREY, J. (2002). A direct approach to false discovery rates. J.
Roy. Statist. Soc. Ser. B 64 479–498. MR1924302
S TOREY, J., TAYLOR , J. and S IEGMUND , D. (2005). Strong control conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J.
Roy. Statist. Soc. Ser. B 66 187–205.
S UBRAMANIAN , A., TAMAYO , P., M OOTHA , V. K., M UKHER JEE , S., E BERT, B. L., G ILLETTE , M. A., PAULOVICH ,
A., P OMEROY, S. L., G OLUB , T. R., L ANDER , E. S.
and M ESIROV, J. P. (2005). Gene set enrichment analysis:
A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102 15545–15550.
T URNBULL , B. (2006). BEST proteomics data. Available at www.
stanford.edu/people/brit.turnbull/BESTproteomics.pdf.
T USHER , V., T IBSHIRANI , R. and C HU , G. (2001). Significance
analysis of microarrays applied to transcriptional responses to
ionizing radiation. Proc. Natl. Acad. Sci. USA 98 5116–5121.
VAN ’ T W OUT, A., L EHRMA , G., M IKHEEVA , S., O’K EEFFE ,
G., K ATZE , M., B UMGARNER , R., G EISS , G. and M ULLINS ,
J. (2003). Cellular gene expression upon human immunodeficiency virus type 1 infection of CD$+ T-Cell lines. J. Virology
77 1392–1402.
Téléchargement