http://engr.case.edu/ray_soumya/mlrg/empirical_bayes_efron.pdf

Statistical Science
2008, Vol. 23, No. 1, 1–22
DOI: 10.1214/07-STS236
©Institute of Mathematical Statistics, 2008
Microarrays, Empirical Bayes and the
Two-Groups Model
Bradley Efron
Abstract. The classic frequentist theory of hypothesis testing developed by
Neyman, Pearson and Fisher has a claim to being the twentieth century’s
most influential piece of applied mathematics. Something new is happen-
ing in the twenty-first century: high-throughput devices, such as microar-
rays, routinely require simultaneous hypothesis tests for thousands of indi-
vidual cases, not at all what the classical theory had in mind. In these situa-
tions empirical Bayes information begins to force itself upon frequentists and
Bayesians alike. The two-groups model is a simple Bayesian construction
that facilitates empirical Bayes analysis. This article concerns the interplay
of Bayesian and frequentist ideas in the two-groups setting, with particular at-
tention focused on Benjamini and Hochberg’s False Discovery Rate method.
Topics include the choice and meaning of the null hypothesis in large-scale
testing situations, power considerations, the limitations of permutation meth-
ods, significance testing for groups of cases (such as pathways in microarray
studies), correlation effects, multiple confidence intervals and Bayesian com-
petitors to the two-groups model.
Key words and phrases: Simultaneous tests, empirical null, false discovery
rates.
1. INTRODUCTION
Simultaneous hypothesis testing was a lively re-
search topic during my student days, exemplified by
Rupert Miller’s classic text “Simultaneous Statistical
Inference” (1966, 1981). Attention focused on testing
Nnull hypotheses at the same time, where Nwas typi-
cally less than half a dozen, though the requisite tables
might go up to N=20. Modern scientific technology,
led by the microarray, has upped the ante in dramatic
fashion: my examples here will have Ns ranging from
200 to 10,000, while N =500,000, from SNP analy-
ses, is waiting in the wings. [The astrostatistical appli-
cations in Liang et al. (2004) envision N=1010 and
more!]
Miller’s text is relentlessly frequentist, reflecting a
classic Neyman–Pearson testing framework, with the
Bradley Efron is Professor, Department of Statistics,
Stanford University, Stanford, California 94305, USA
(e-mail: [email protected]d.edu).
1Discussed in 10.1214/07-STS236B,10.1214/07-STS236C,
10.1214/07-STS236D and 10.1214/07-STS236A; rejoinder at
10.1214/08-STS236REJ.
main goal being preservation of “α,” overall test size,
in the face of multiple inference. Most of the current
microarray statistics literature shares this goal, and also
its frequentist viewpoint, as described in the nice re-
view article by Dudoit and Boldrick (2003).
Something changes, though, when Ngets big: with
thousands of parallel inference problems to consider si-
multaneously, Bayesian considerations begin to force
themselves even upon dedicated frequentists. The
“two-groups model” of the title is a particularly simple
Bayesian framework for large-scale testing situations.
This article explores the interplay of frequentist and
Bayesian ideas in the two-groups setting, with particu-
lar attention paid to False Discovery Rates (Benjamini
and Hochberg, 1995).
Figure 1concerns four examples of large-scale si-
multaneous hypothesis testing. Each example consists
of Nindividual cases, with each case represented by
its own z-value “zi,” for i=1,2,...,N.Thezi’s
are based on familiar constructions that, theoretically,
should yield standard N(0,1)normal distributions un-
1
2B. EFRON
FIG.1. Four examples of large-scale simultaneous inference,each panel indicating Nz-values as explained in the text.Panel A, prostate
cancer microarray study,N=6033 genes;panel B, comparison of advantaged versus disadvantaged students passing mathematics compe-
tency tests,N=3748 high schools;panel C, proteomics study,N=230 ordered peaks in time-of-flight spectroscopy experiment;panel D,
imaging study comparing dyslexic versus normal children,showing horizontal slice of 655 voxels out of N=15,455, coded for zi<0,
+for zi0and solid circle for zi>2.
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL 3
der a classical null hypothesis,
theoretical null : ziN(0,1).(1.1)
Here is a brief description of the four examples, with
further information following as needed in the sequel.
EXAMPLE A [Prostate data, Singh et al. (2002)].
N=6033 genes on 102 microarrays, n1=50 healthy
males compared with n2=52 prostate cancer patients;
zis based on two-sample tstatistics comparing the two
categories.
EXAMPLE B [Education data, Rogosa (2003)]. N=
3748 California high schools; zis based on binomial
test of proportion advantaged versus proportion dis-
advantaged students passing mathematics competency
tests.
EXAMPLE C [Proteomics data, Turnbull (2006)].
N=230 ordered peaks in time-of-flight spectroscopy
study of 551 heart disease patients. Each peak’s z-value
was obtained from a Cox regression of the patients’
survival times, with the predictor variable being the
551 observed intensities at that peak.
EXAMPLE D [Imaging data, Schwartzman et al.
(2005]). N=15,445 voxels in a diffusion tensor
imaging (DTI) study comparing 6 dyslexic with six
normal children; zis based on two-sample tstatistics
comparing the two groups. The figure shows only a
single horizontal brain section having 655 voxels, with
” indicating zi<0, “+”forzi0, and solid circles
for zi>2.
Our four examples are enough alike to be usefully
analyzed by the two-groups model of Section 2, but
there are some striking differences, too: the theoretical
N(0,1)null (1.1) is obviously inappropriate for the ed-
ucation data of panel B; there is a hint of correlation of
z-value with peak number in panel C, especially near
the right limit; and there is substantial spatial correla-
tion appearing in the imaging data of panel D.
My plan here is to discuss a range of inference prob-
lems raised by large-scale hypothesis testing, many of
which, it seems to me, have been more or less under-
emphasized in a literature focused on controlling Type-
I errors: the choice of a null hypothesis, limitations of
permutation methods, the meaning of “null” and “non-
null” in large-scale settings, questions of power, test
of significance for groups of cases (e.g., pathways in
microarray studies), the effects of correlation, multiple
confidence statements and Bayesian competitors to the
two-groups model. The presentation is intended to be
as nontechnical as possible, many of the topics being
discussed more carefully in Efron (2004, 2005, 2006).
References will be provided as we go along, but this is
not intended as a comprehensive review. Microarrays
have stimulated a burst of creativity from the statistics
community, and I apologize in advance for this arti-
cle’s concentration on my own point of view, which
aims at minimizing the amount of statistical model-
ing required of the statistician. More model-intensive
techniques, including fully Bayesian approaches, as in
Parmigiani et al. (2002) or Lewin et al. (2006), have
their own virtues, which I hope will emerge in the Dis-
cussion.
Section 2discusses the two-groups model and false
discovery rates in an idealized Bayesian setting. Empir-
ical Bayes methods are needed to carry out these ideas
in practice, as discussed in Section 3. This discussion
assumes a “good” situation, like that of Example A,
where the theoretical null (1.1) fits the data. When it
does not, as in Example B,theempirical null methods
of Section 4come into play. These raise interpretive
questions of their own, as mentioned above, discussed
in the later sections.
We are living through a scientific revolution pow-
ered by the new generation of high-throughput obser-
vational devices. This is a wonderful opportunity for
statisticians, to redemonstrate our value to the scien-
tific world, but also to rethink basic topics in statistical
theory. Hypothesis testing is the topic here, a subject
that needs a fresh look in contexts like those of Fig-
ure 1.
2. THE TWO-GROUPS MODEL AND FALSE
DISCOVERY RATES
The two-groups model is too simple to have a single
identifiable author, but it plays an important role in the
Bayesian microarray literature, as in Lee et al. (2000),
Newton et al. (2001) and Efron et al. (2001). We sup-
pose that the Ncases (“genes” as they will be called
now in deference to microarray studies, though they
are not genes in the last three examples of Figure 1)
are each either null or nonnull with prior probability
p0or p1=1p0, and with z-values having density
either f0(z) or f1(z),
p0=Pr{null}f0(z) density if null,
(2.1)
p1=Pr{nonnull}f1(z) density if nonnull.
The usual purpose of large-scale simultaneous test-
ing is to reduce a vast set of possibilities to a much
smaller set of scientifically interesting prospects. In
4B. EFRON
Example A, for instance, the investigators were proba-
bly searching for a few genes, or a few hundred at most,
worthy of intensive study for prostate cancer etiology.
I will assume
p00.90(2.2)
in what follows, limiting the nonnull genes to no more
than 10%.
False discovery rate (Fdr) methods have developed
in a strict frequentist framework, beginning with Ben-
jamini and Hochberg’s seminal 1995 paper, but they
also have a convincing Bayesian rationale in terms of
the two-groups model. Let F0(z) and F1(z) denote
the cumulative distribution functions (cdf) of f0(z)
and f1(z) in (2.1), and define the mixture cdf F(z)=
p0F0(z) +p1F1(z). Then Bayes’ rule yields the a pos-
teriori probability of a gene being in the null group of
(2.1) given that its z-value Zis less than some thresh-
old z,say“Fdr(z),” as
Fdr(z) Pr{null|Zz}
(2.3)
=p0F0(z)/F (z).
[Here it is notationally convenient to consider the neg-
ative end of the zscale, values like z=−3. Defini-
tion (2.3) could just as well be changed to Z>zor
Z>|z|.] Benjamini and Hochberg’s (1995) false dis-
covery rate control rule begins by estimating F(z)with
the empirical cdf
¯
F(z)=#{ziz}/N,(2.4)
yielding Fdr(z) =p0F0(z)/ ¯
F(z). The rule selects a
control level “q,” say q=0.1, and then declares as
nonnull those genes having z-values zisatisfying zi
z0,wherez0is the maximum value of zsatisfying
Fdr(z0)q(2.5)
[usually taking p0=1in(2.3), and F0the theoretical
null, the standard normal cdf (z) of (1.1)].
The striking theorem proved in the 1995 paper
was that the expected proportion of null genes re-
ported by a statistician following rule (2.5) will be
no greater than q. This assumes independence among
the zis, extended later to various dependence models
in Benjamini and Yekutieli (2001). The theorem is a
purely frequentist result, but as pointed out in Storey
(2002) and Efron and Tibshirani (2002), it has a sim-
ple Bayesian interpretation via (2.3): rule (2.5)ises-
sentially equivalent to declaring nonnull those genes
whose estimated tail-area posterior probability of be-
ing null is no greater than q. It is usually a good sign
when Bayesian and frequentist ideas converge on a sin-
gle methodology, as they do here.
Densities are more natural than tail areas for Baye-
sian fdr interpretation. Defining the mixture density
from (2.1),
f(z)=p0f0(z) +p1f1(z),(2.6)
Bayes’ rule gives
fdr(z) Pr{null|Z=z}
(2.7)
=p0f0(z)/f (z)
for the probability of a gene being in the null group
given z-score z.Herefdr(z) is the local false discovery
rate (Efron et al., 2001; Efron, 2005).
There is a simple relationship between Fdr(z) and
fdr(z),
Fdr(z) =Ef{fdr(Z)|Zz},(2.8)
Ef” indicating expectation with respect to the mix-
ture density f(z).Thatis,Fdr(z) is the mixture aver-
age of fdr(Z) for Zz. In the usual situation where
fdr(z) decreases as |z|gets large, Fdr(z) will be smaller
than fdr(z). Intuitively, if we decide to label all genes
with ziless than some negative value z0as nonnull,
then fdr(z0), the false discovery rate at the bound-
ary point z0, will be greater than Fdr(z0), the average
false discovery rate beyond the boundary. Figure 2il-
lustrates the geometrical relationship between Fdr(z)
and fdr(z); the Benjamini–Hochberg Fdr control rule
amounts to an upper bound on the secant slope.
For Lehmann alternatives
F1(z) =F0(z)γ,[γ<1],(2.9)
it turns out that
logfdr(z)
1fdr(z)
(2.10)
=logFdr(z)
1Fdr(z) +log1
γ,
so
fdr(z) ˙=Fdr(z)/γ(2.11)
for small values of Fdr. The prostate data of Figure 1
has γabout 1/2 in each tail, making fdr(z) 2Fdr(z)
near the extremes.
The statistics literature has not reached consensus on
the choice of qfor the Benjamini–Hochberg control
rule (2.5)—what would be the equivalent of 0.05 for
classical testing—but Bayes factor calculations offer
some insight. Efron (2005, 2006) uses the cutoff point
fdr(z) 0.20(2.12)
MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL 5
FIG.2. Relationship of Fdr(z) to fdr(z).Heavy curve plots numerator of Fdr,p0F0(z),versus denominator F(z);fdr(z) is slope of tangent,
Fdr slope of secant.
for reporting nonnull genes, on the admittedly subjec-
tive grounds that fdr values much greater than 0.20 are
dangerously prone to wasting investigators’ resources.
Then (2.6), (2.7) yield posterior odds ratio
Pr{nonnull|z}/Pr{null|z}
=(1fdr(z))/fdr(z)
(2.13)
=p1f1(z)/p0f0(z)
0.8/0.2=4.
Since (2.2) implies p1/p01/9, (2.13) corresponds to
requiring Bayes factor
f1(z)/f0(z) 36(2.14)
in favor of nonnull in order to declare significance.
Factor (2.14) requires much stronger evidence
against the null hypothesis than in standard one-at-a-
time testing, where the critical threshold lies some-
where near 3 (Efron and Gous, 2001). The fdr 0.20
threshold corresponds to q-values in (2.5) between
0.05 and 0.15 for moderate choices of γ;suchq-value
thresholds can be interpreted as providing conservative
Bayes factors for Fdr testing.
Model (2.1) ignores the fact that investigators usu-
ally begin with hot prospects in mind, genes that have
high prior probability of being interesting. Suppose
p0(i) is the prior probability that gene iis null, and de-
fine p0as the average of p0(i) over all Ngenes. Then
Bayes’ theorem yields this expression for fdri(z) =
Pr{geneinull|zi=z}:
fdri(z) =fdr(z) ri
1(1ri)fdr(z) ,
(2.15) ri=p0(i)
1p0(i) p0
1p0,
where fdr(z) =p0f0(z)/f (z) as before. So for a hot
prospect having p0(i) =0.50 rather than p0=0.90,
(2.15) changes an uninteresting result like fdr(zi)=
0.40 into fdri(zi)=0.069.
Wonderfully neat and exact results like the Benjamini–
Hochberg Fdr control rule exert a powerful influence
on statistical theory, sometimes more than is good for
applied work. Much of the microarray statistics liter-
ature seems to me to be overly concerned with exact
properties borrowed from classical test theory, at the
expense of ignoring the complications of large-scale
testing. Neatness and exactness are mostly missing in
what follows as I examine an empirical Bayes approach
to the application of two-groups/Fdr ideas to situations
like those in Figure 1.
3. EMPIRICAL BAYES METHODS
In practice, the difference between Bayesian and
frequentist statisticians is their self-confidence in as-
signing prior distributions to complicated probability
models. Large-scale testing problems certainly look
1 / 22 100%

http://engr.case.edu/ray_soumya/mlrg/empirical_bayes_efron.pdf

La catégorie de ce document est-elle correcte?
Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans linterface ou les textes ? Ou savez-vous comment améliorer linterface utilisateur de StudyLib ? Nhésitez pas à envoyer vos suggestions. Cest très important pour nous !