Statistical Science 2008, Vol. 23, No. 1, 1–22 DOI: 10.1214/07-STS236 © Institute of Mathematical Statistics, 2008 Microarrays, Empirical Bayes and the Two-Groups Model Bradley Efron Abstract. The classic frequentist theory of hypothesis testing developed by Neyman, Pearson and Fisher has a claim to being the twentieth century’s most influential piece of applied mathematics. Something new is happening in the twenty-first century: high-throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and Bayesians alike. The two-groups model is a simple Bayesian construction that facilitates empirical Bayes analysis. This article concerns the interplay of Bayesian and frequentist ideas in the two-groups setting, with particular attention focused on Benjamini and Hochberg’s False Discovery Rate method. Topics include the choice and meaning of the null hypothesis in large-scale testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray studies), correlation effects, multiple confidence intervals and Bayesian competitors to the two-groups model. Key words and phrases: rates. Simultaneous tests, empirical null, false discovery main goal being preservation of “α,” overall test size, in the face of multiple inference. Most of the current microarray statistics literature shares this goal, and also its frequentist viewpoint, as described in the nice review article by Dudoit and Boldrick (2003). Something changes, though, when N gets big: with thousands of parallel inference problems to consider simultaneously, Bayesian considerations begin to force themselves even upon dedicated frequentists. The “two-groups model” of the title is a particularly simple Bayesian framework for large-scale testing situations. This article explores the interplay of frequentist and Bayesian ideas in the two-groups setting, with particular attention paid to False Discovery Rates (Benjamini and Hochberg, 1995). Figure 1 concerns four examples of large-scale simultaneous hypothesis testing. Each example consists of N individual cases, with each case represented by its own z-value “zi ,” for i = 1, 2, . . . , N . The zi ’s are based on familiar constructions that, theoretically, should yield standard N(0, 1) normal distributions un- 1. INTRODUCTION Simultaneous hypothesis testing was a lively research topic during my student days, exemplified by Rupert Miller’s classic text “Simultaneous Statistical Inference” (1966, 1981). Attention focused on testing N null hypotheses at the same time, where N was typically less than half a dozen, though the requisite tables might go up to N = 20. Modern scientific technology, led by the microarray, has upped the ante in dramatic fashion: my examples here will have N ’s ranging from 200 to 10,000, while N = 500,000, from SNP analyses, is waiting in the wings. [The astrostatistical applications in Liang et al. (2004) envision N = 1010 and more!] Miller’s text is relentlessly frequentist, reflecting a classic Neyman–Pearson testing framework, with the Bradley Efron is Professor, Department of Statistics, Stanford University, Stanford, California 94305, USA (e-mail: [email protected]). 1 Discussed in 10.1214/07-STS236B, 10.1214/07-STS236C, 10.1214/07-STS236D and 10.1214/07-STS236A; rejoinder at 10.1214/08-STS236REJ. 1 2 B. EFRON F IG . 1. Four examples of large-scale simultaneous inference, each panel indicating N z-values as explained in the text. Panel A, prostate cancer microarray study, N = 6033 genes; panel B, comparison of advantaged versus disadvantaged students passing mathematics competency tests, N = 3748 high schools; panel C, proteomics study, N = 230 ordered peaks in time-of-flight spectroscopy experiment; panel D, imaging study comparing dyslexic versus normal children, showing horizontal slice of 655 voxels out of N = 15,455, coded “−” for zi < 0, “+” for zi ≥ 0 and solid circle for zi > 2. MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL der a classical null hypothesis, (1.1) theoretical null : zi ∼ N(0, 1). Here is a brief description of the four examples, with further information following as needed in the sequel. E XAMPLE A [Prostate data, Singh et al. (2002)]. N = 6033 genes on 102 microarrays, n1 = 50 healthy males compared with n2 = 52 prostate cancer patients; zi ’s based on two-sample t statistics comparing the two categories. E XAMPLE B [Education data, Rogosa (2003)]. N = 3748 California high schools; zi ’s based on binomial test of proportion advantaged versus proportion disadvantaged students passing mathematics competency tests. E XAMPLE C [Proteomics data, Turnbull (2006)]. N = 230 ordered peaks in time-of-flight spectroscopy study of 551 heart disease patients. Each peak’s z-value was obtained from a Cox regression of the patients’ survival times, with the predictor variable being the 551 observed intensities at that peak. E XAMPLE D [Imaging data, Schwartzman et al. (2005]). N = 15,445 voxels in a diffusion tensor imaging (DTI) study comparing 6 dyslexic with six normal children; zi ’s based on two-sample t statistics comparing the two groups. The figure shows only a single horizontal brain section having 655 voxels, with “−” indicating zi < 0, “+” for zi ≥ 0, and solid circles for zi > 2. Our four examples are enough alike to be usefully analyzed by the two-groups model of Section 2, but there are some striking differences, too: the theoretical N(0, 1) null (1.1) is obviously inappropriate for the education data of panel B; there is a hint of correlation of z-value with peak number in panel C, especially near the right limit; and there is substantial spatial correlation appearing in the imaging data of panel D. My plan here is to discuss a range of inference problems raised by large-scale hypothesis testing, many of which, it seems to me, have been more or less underemphasized in a literature focused on controlling TypeI errors: the choice of a null hypothesis, limitations of permutation methods, the meaning of “null” and “nonnull” in large-scale settings, questions of power, test of significance for groups of cases (e.g., pathways in microarray studies), the effects of correlation, multiple confidence statements and Bayesian competitors to the two-groups model. The presentation is intended to be 3 as nontechnical as possible, many of the topics being discussed more carefully in Efron (2004, 2005, 2006). References will be provided as we go along, but this is not intended as a comprehensive review. Microarrays have stimulated a burst of creativity from the statistics community, and I apologize in advance for this article’s concentration on my own point of view, which aims at minimizing the amount of statistical modeling required of the statistician. More model-intensive techniques, including fully Bayesian approaches, as in Parmigiani et al. (2002) or Lewin et al. (2006), have their own virtues, which I hope will emerge in the Discussion. Section 2 discusses the two-groups model and false discovery rates in an idealized Bayesian setting. Empirical Bayes methods are needed to carry out these ideas in practice, as discussed in Section 3. This discussion assumes a “good” situation, like that of Example A, where the theoretical null (1.1) fits the data. When it does not, as in Example B, the empirical null methods of Section 4 come into play. These raise interpretive questions of their own, as mentioned above, discussed in the later sections. We are living through a scientific revolution powered by the new generation of high-throughput observational devices. This is a wonderful opportunity for statisticians, to redemonstrate our value to the scientific world, but also to rethink basic topics in statistical theory. Hypothesis testing is the topic here, a subject that needs a fresh look in contexts like those of Figure 1. 2. THE TWO-GROUPS MODEL AND FALSE DISCOVERY RATES The two-groups model is too simple to have a single identifiable author, but it plays an important role in the Bayesian microarray literature, as in Lee et al. (2000), Newton et al. (2001) and Efron et al. (2001). We suppose that the N cases (“genes” as they will be called now in deference to microarray studies, though they are not genes in the last three examples of Figure 1) are each either null or nonnull with prior probability p0 or p1 = 1 − p0 , and with z-values having density either f0 (z) or f1 (z), (2.1) p0 = Pr{null} f0 (z) density if null, p1 = Pr{nonnull} f1 (z) density if nonnull. The usual purpose of large-scale simultaneous testing is to reduce a vast set of possibilities to a much smaller set of scientifically interesting prospects. In 4 B. EFRON Example A, for instance, the investigators were probably searching for a few genes, or a few hundred at most, worthy of intensive study for prostate cancer etiology. I will assume (2.2) p0 ≥ 0.90 in what follows, limiting the nonnull genes to no more than 10%. False discovery rate (Fdr) methods have developed in a strict frequentist framework, beginning with Benjamini and Hochberg’s seminal 1995 paper, but they also have a convincing Bayesian rationale in terms of the two-groups model. Let F0 (z) and F1 (z) denote the cumulative distribution functions (cdf) of f0 (z) and f1 (z) in (2.1), and define the mixture cdf F (z) = p0 F0 (z) + p1 F1 (z). Then Bayes’ rule yields the a posteriori probability of a gene being in the null group of (2.1) given that its z-value Z is less than some threshold z, say “Fdr(z),” as (2.3) Fdr(z) ≡ Pr{null|Z ≤ z} = p0 F0 (z)/F (z). [Here it is notationally convenient to consider the negative end of the z scale, values like z = −3. Definition (2.3) could just as well be changed to Z > z or Z > |z|.] Benjamini and Hochberg’s (1995) false discovery rate control rule begins by estimating F (z) with the empirical cdf (2.4) F̄ (z) = #{zi ≤ z}/N, yielding Fdr(z) = p0 F0 (z)/F̄ (z). The rule selects a control level “q,” say q = 0.1, and then declares as nonnull those genes having z-values zi satisfying zi ≤ z0 , where z0 is the maximum value of z satisfying (2.5) Fdr(z0 ) ≤ q [usually taking p0 = 1 in (2.3), and F0 the theoretical null, the standard normal cdf (z) of (1.1)]. The striking theorem proved in the 1995 paper was that the expected proportion of null genes reported by a statistician following rule (2.5) will be no greater than q. This assumes independence among the zi ’s, extended later to various dependence models in Benjamini and Yekutieli (2001). The theorem is a purely frequentist result, but as pointed out in Storey (2002) and Efron and Tibshirani (2002), it has a simple Bayesian interpretation via (2.3): rule (2.5) is essentially equivalent to declaring nonnull those genes whose estimated tail-area posterior probability of being null is no greater than q. It is usually a good sign when Bayesian and frequentist ideas converge on a single methodology, as they do here. Densities are more natural than tail areas for Bayesian fdr interpretation. Defining the mixture density from (2.1), (2.6) f (z) = p0 f0 (z) + p1 f1 (z), Bayes’ rule gives fdr(z) ≡ Pr{null|Z = z} (2.7) = p0 f0 (z)/f (z) for the probability of a gene being in the null group given z-score z. Here fdr(z) is the local false discovery rate (Efron et al., 2001; Efron, 2005). There is a simple relationship between Fdr(z) and fdr(z), (2.8) Fdr(z) = Ef {fdr(Z)|Z ≤ z}, “Ef ” indicating expectation with respect to the mixture density f (z). That is, Fdr(z) is the mixture average of fdr(Z) for Z ≤ z. In the usual situation where fdr(z) decreases as |z| gets large, Fdr(z) will be smaller than fdr(z). Intuitively, if we decide to label all genes with zi less than some negative value z0 as nonnull, then fdr(z0 ), the false discovery rate at the boundary point z0 , will be greater than Fdr(z0 ), the average false discovery rate beyond the boundary. Figure 2 illustrates the geometrical relationship between Fdr(z) and fdr(z); the Benjamini–Hochberg Fdr control rule amounts to an upper bound on the secant slope. For Lehmann alternatives (2.9) F1 (z) = F0 (z)γ , it turns out that log (2.10) fdr(z) 1 − fdr(z) = log [γ < 1], Fdr(z) 1 + log , 1 − Fdr(z) γ so (2.11) fdr(z) = ˙ Fdr(z)/γ for small values of Fdr. The prostate data of Figure 1 has γ about 1/2 in each tail, making fdr(z) ∼ 2 Fdr(z) near the extremes. The statistics literature has not reached consensus on the choice of q for the Benjamini–Hochberg control rule (2.5)—what would be the equivalent of 0.05 for classical testing—but Bayes factor calculations offer some insight. Efron (2005, 2006) uses the cutoff point (2.12) fdr(z) ≤ 0.20 5 MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL F IG . 2. Relationship of Fdr(z) to fdr(z). Heavy curve plots numerator of Fdr, p0 F0 (z), versus denominator F (z); fdr(z) is slope of tangent, Fdr slope of secant. for reporting nonnull genes, on the admittedly subjective grounds that fdr values much greater than 0.20 are dangerously prone to wasting investigators’ resources. Then (2.6), (2.7) yield posterior odds ratio Pr{nonnull|z}/ Pr{null|z} (2.13) = (1 − fdr(z))/fdr(z) = p1 f1 (z)/p0 f0 (z) ≥ 0.8/0.2 = 4. Since (2.2) implies p1 /p0 ≤ 1/9, (2.13) corresponds to requiring Bayes factor (2.14) f1 (z)/f0 (z) ≥ 36 in favor of nonnull in order to declare significance. Factor (2.14) requires much stronger evidence against the null hypothesis than in standard one-at-atime testing, where the critical threshold lies somewhere near 3 (Efron and Gous, 2001). The fdr 0.20 threshold corresponds to q-values in (2.5) between 0.05 and 0.15 for moderate choices of γ ; such q-value thresholds can be interpreted as providing conservative Bayes factors for Fdr testing. Model (2.1) ignores the fact that investigators usually begin with hot prospects in mind, genes that have high prior probability of being interesting. Suppose p0 (i) is the prior probability that gene i is null, and define p0 as the average of p0 (i) over all N genes. Then Bayes’ theorem yields this expression for fdri (z) = Pr{genei null|zi = z}: fdri (z) = fdr(z) (2.15) ri , 1 − (1 − ri )fdr(z) ri = p0 (i) p0 , 1 − p0 (i) 1 − p0 where fdr(z) = p0 f0 (z)/f (z) as before. So for a hot prospect having p0 (i) = 0.50 rather than p0 = 0.90, (2.15) changes an uninteresting result like fdr(zi ) = 0.40 into fdri (zi ) = 0.069. Wonderfully neat and exact results like the Benjamini– Hochberg Fdr control rule exert a powerful influence on statistical theory, sometimes more than is good for applied work. Much of the microarray statistics literature seems to me to be overly concerned with exact properties borrowed from classical test theory, at the expense of ignoring the complications of large-scale testing. Neatness and exactness are mostly missing in what follows as I examine an empirical Bayes approach to the application of two-groups/Fdr ideas to situations like those in Figure 1. 3. EMPIRICAL BAYES METHODS In practice, the difference between Bayesian and frequentist statisticians is their self-confidence in assigning prior distributions to complicated probability models. Large-scale testing problems certainly look 6 B. EFRON complicated enough, but this is deceptive; their massively parallel structure, with thousands of similar situations each providing information, allows an appropriate prior distribution to be estimated from the data without upsetting even timid frequentists like myself. This is the empirical Bayes approach of Robbins and Stein, 50 years old but coming into its own in the microarray era; see Efron (2003). Consider estimating the local false discovery rate fdr(z) = p0 f0 (z)/f (z), (2.7). I will begin with a “good” case, like the prostate data of Example A in Section 1, where it is easy to believe in the theoretical null distribution (1.1), (3.1) 1 2 f0 (z) = ϕ(z) ≡ √ e−(1/2)z . 2π The z-values in Example A were obtained by transforming the usual two-sample t statistic “ti ” comparing cancer and normal patients’ expression levels for gene i, to a standard normal scale via zi = −1 (F100 (ti )); (3.2) here and F100 are the cdf’s of standard normal and t100 distributions. If we had only gene i’s data to test, classic theory would tell us to compare zi with f0 (z) = ϕ(z) as in (3.1). For the moment I will take p0 , the prior probability of a gene being null, as known. Section 4 discusses p0 ’s estimation, but in fact its exact value does not make much difference to Fdr(z) or fdr(z), (2.3) or (2.7), if p0 is near 1 as in (2.2). Benjamini and Hochberg (1995) take p0 = 1, providing an upper bound for Fdr(z). This leaves us with only the denominator f (z) to estimate in (2.7). By definition (2.6), f (z) is the marginal density of all N zi ’s, so we can use all the data to estimate f (z). The algorithm locfdr, an R function available from the CRAN library, does this by means of standard Poisson GLM software (Efron, 2005). Suppose the z-values have been binned, giving bin counts (3.3) yk = #{zi in bin k}, k = 1, 2, . . . , K. The prostate data histogram in panel A of Figure 1 has K = 49 bins of width = 0.2. We take the yk to be independent Poisson counts, (3.4) ind yk ∼ P0 (νk ), k = 1, 2, . . . , K, with the unknown νk proportional to density f (z) at midpoint “xk ” of the kth bin, approximately (3.5) νk = Nf (xk ). Modeling log(νk ) as a pth-degree polynomial function of xk makes (3.4)–(3.5) a standard Poisson general linear model (GLM). The choice p = 7 used in Figure 3 amounts to estimating f (z) by maximum likelihood within the seven-parameter exponential family (3.6) f (z) = exp 7 βj z j . j =0 Notice that p = 2 would make f (z) normal; the extra parameters in (3.6) allow flexibility in fitting the tails of f (z). Here we are employing Lindsey’s method; see Efron and Tibshirani (1996). Despite its unorthodox look, it is no more than a convenient way to obtain maximum likelihood estimates in multiparameter families like (3.6). The heavy curve in Figure 3 is an estimate of the local false discovery rate for the prostate data, (3.7) fdr(z) = p0 f0 (z)/f(z), with f(z) constructed as above, f0 (z) = ϕ(z) as in (3.1), and p0 = 0.93, as estimated in Section 4; fdr(z) is near 1 for |z| ≤ 2, decreasing to interesting levels for i ) ≤ 0.2, |z| > 3. Fifty-one of the 6033 genes have fdr(z 26 on the right and 25 on the left, and these could be reported back to the investigators as likely nonnull candidates. [The standard Benjamini–Hochberg procedure, (2.5) with q = 0.1, reports 60 nonnull genes, 28 on the right and 32 on the left.] At this point the reader might notice an anomaly: if p0 = 0.93 of the N genes are null, then about (1 − p0 ) · 6033 = 422 should be nonnull, but only 51 are reported. The trouble is that most of the nonnull genes are located in regions of the z axis where i ) exceeds 0.5, and these cannot be reported withfdr(z out also reporting a bevy of null cases. In other words, the prostate study is underpowered. The vertical bars in Figure 3 are estimates of the nonnull counts, the histogram we would see if only the nonnull genes provided z-values. In terms of (3.3), (3.7), the nonnull counts “yk(1) ” are (3.8) k ]yk , yk = [1 − fdr (1) k = fdr(x k ), the estimated fdr value at the where fdr k approximates the noncenter of bin k. Since 1 − fdr null probability for a gene in bin k, formula (3.8) is an obvious estimate for the expected number of nonnulls. Power diagnostics are obtained from comparisons of fdr(z) with the nonnull histogram. High power would k was small where y (1) was large. be indicated if fdr k MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL 7 F IG . 3. Heavy curve is estimated local false discovery rate fdr(z) for prostate data. Fifty-one genes, 26 on the right and 25 on the left, have ) < 0.20. Vertical bars estimate histogram of the nonnull counts (plotted negatively, divided by 50). Most of the nonnull genes will not fdr(z i be reported. That obviously is not the case in Figure 3. A simple power diagnostic is (3.9) E fdr (1) = K K (1) (1) yk fdr yk , k k=1 k=1 (1) the expected nonnull fdr. We want E fdr to be small, perhaps near 0.2, so that a typical nonnull gene will show up on a list of likely prospects. The prostate data (1) has E fdr = 0.68, indicating low power. If the whole study were rerun, we could expect a different list of 50 likely nonnull genes, barely overlapping with the first list. Section 3 of Efron (2006) discusses power calculations for microarray studies, presenting more elaborate power diagnostics. Stripped of technicalities, the idea underlying false discovery rates is appealingly simple, and in fact does not depend on the literal validity of the two-groups model (2.1). Consider the bin zi ∈ [3.1, 3.3] in the prostate data histogram; 17 of the 6033 genes fall into this bin, compared to expected number 2.68 = p0 Nϕ(3.2) of null genes, giving (3.10) fdr = 2.68/17 = 0.16 as an estimated false discovery rate. (The smoothed es = 0.24.) The implication is that timate in Figure 3 is fdr only about one-sixth of the 17 are null genes. This conclusion can be sharpened, as in Lehmann and Romano (2005), but (3.10) catches the main idea. Notice that we do not need all the null genes to have the same density f0 (z); it is enough to assume that the average null density is f0 (z), ϕ(z) in this case, in order to calculate the numerator 2.68. (This is an advantage of false discovery rate methods, which only control rates, not individual probabilities.) The nonnull density f1 (z) in (2.1) plays no role at all since the denominator 17 is an observed quantity. Exchangeability is the key assumption in interpreting (3.10): we expect about 1/6 of the 17 genes to be null, and assign posterior null probability 1/6 to all 17. Nonexchangeability, in the form of differing prior information among the 17, can be incorporated as in (2.15). Density estimation has a reputation for difficulty, well-deserved in general situations. However, there are good theoretical reasons, presented in Section 6 of Efron (2005), for believing that mixtures of z-values are quite smooth, and that (3.7) will efficiently estimate fdr(z). Independence of the zi ’s is not required, only that f(z) is a reasonably close estimate of f (z). 8 B. EFRON TABLE 1 (local fdr), and log Fdr(z), Boldface, standard errors of log fdr(z), (tail-area), 250 replications of model (3.11), N = 1500. Parentheses, average from formula (5.9), Efron (2006); fdr is true value (2.7). Empirical Null results explained in Section 4 Theoretical null Empirical null z fdr local (formula) tail local (formula) tail 1.5 2.0 2.5 3.0 3.5 4.0 0.88 0.69 0.38 0.12 0.03 0.005 0.05 0.08 0.09 0.08 0.10 0.11 (0.05) (0.09) (0.10) (0.10) (0.13) (0.15) 0.05 0.05 0.05 0.06 0.07 0.10 0.04 0.09 0.16 0.25 0.38 0.50 (0.04) (0.10) (0.16) (0.25) (0.38) (0.51) 0.10 0.15 0.23 0.32 0.42 0.52 A variety of local fdr estimation methods have been suggested, using parametric, semiparametric, nonparametric and Bayes methods: Pan et al. (2003), Pounds and Morris (2003), Allison et al. (2002), Heller and Qing (2003), Broberg (2005), Aubert et al. (2004), Liao et al. (2004) and Do et al. (2005), all performing reasonably well. The Poisson GLM methodology of locfdr has the advantage of easy implementation with familiar software, and a closed-form error analysis. Estimation efficiency becomes a more serious problem on the “Empirical Null” side of Table 1, where we can no longer trust the theoretical null f0 (z) ∼ N(0, 1). This is the subject of Section 4. 4. THE EMPIRICAL NULL DISTRIBUTION Table 1 reports on a small simulation study in which ind (3.11) zi ∼ N(μi , 1) ⎧ μi = 0, ⎪ ⎪ ⎨ with probability 0.9, ⎪ μ ∼ N(3, 1), ⎪ ⎩ i with probability 0.1, for i = 1, 2, . . . , N = 1500. The table shows standard deviations for log(fdr(z)), (3.7), from 250 simulations of (3.11), and also using a delta-method formula derived in Section 5 of Efron (2006), incorporated in the locfdr algorithm. Rather than (3.6), f (z) was modeled by a seven-parameter natural spline basis, locfdr’s default, though this gave nearly the same results as (3.6). Also shown are standard deviations for the correspond obtained by substituting tail-areaquantity log(Fdr(z)) z f(z ) dz in (2.3). [This is a little less ing F(z) = −∞ variable than using F̄ (z), (2.4).] The “Theoretical Null” side of the table shows that fdr(z) is more variable than Fdr(z), but both are more than accurate enough for practical use. At z = 3, for example, fdr(z) only errs by about 8%, yielding fdr(z) = ˙ 0.12 ± 0.01. Standard errors are roughly proportional to N −1/2 , so even reducing N to 250 gives fdr(3) = ˙ 0.12 ± .025, and similarly for other values of z, accurate enough to make pictures like Figure 3 believable. Empirical Bayes is a bipolar methodology, with alternating episodes of frequentist and Bayesian activity. [or F dr, (2.5)] to fdr beFrequentists may prefer Fdr cause of connections with classical tail-area hypothesis testing, or because cdf’s are more straightforward to es for its timate than densities, while Bayesians prefer fdr more apt a posteriori interpretation. Both, though, combine the Bayesian two-groups model with frequentist estimation methods, and deliver the same basic information. We have been assuming that f0 (z), the null density in (2.1), is known on theoretical grounds, as in (3.1). This leads to false discovery estimates such as fdr(z) = p0 f0 (z)/f(z) and Fdr(z) = p0 F0 (z)/F(z), where only denominators need be estimated. Most applications of Benjamini and Hochberg’s control algorithm (2.5) make the same assumption (sometimes augmented with permutation calculations, which usually produce only minor corrections to the theoretical null, as discussed in Section 5). Use of the theoretical null is mandatory in classic one-at-a-time testing, where theory provides the only information available for null behavior. But things change in large-scale simultaneous testing situations: serious defects in the theoretical null may become obvious, while empirical Bayes methods can provide more realistic null distributions. Figure 4 shows z-value histograms for two additional microarray studies, described more fully in Efron (2006). These are of the same form as the prostate data: n subjects in two disease categories provide expression levels for N genes; two-sample t-statistics ti comparing the categories are computed for each gene, and then transformed to z-values zi = −1 (Fn−2 (ti )), as in (3.2). Unlike panel A of Figure 1, however, neither histogram obeys the theoretical N(0, 1) null near z = 0. The BRCA data has a much wider central peak, while the HIV peak is too narrow. The lighter curves in Figure 4 are empirical null estimates (Efron, 2004), normal curves fit to the central peak of the z-value histograms. The idea here is simple enough: we make the “zero assumption,” Z ERO ASSUMPTION . (4.1) Most of the z-values near 0 come from null genes, 9 MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL F IG . 4. z-values from two microarray studies. BRCA data (Hedenfalk et al., 2001), comparing seven breast cancer patients having BRCA1 mutation to eight with BRCA2 mutation N = 3226 genes. HIV data (van’t Wout et al., 2003) comparing four HIV+ males with four HIV− males, N = 7680 genes. Theoretical N (0, 1) null, heavy curve is too narrow for BRCA data, too wide for HIV data. Light curves are empirical nulls: normal densities fit to the central histogram counts. (discussed further below), generalize the N(0, 1) theoretical null to N(δ0 , σ02 ), and estimate (δ0 , σ02 ) from the histogram counts near z = 0. Locfdr uses two different estimation methods, analytical and geometric, described next. The beaded curve shows the best quadratic approximation to log f(z) near 0. Matching its coefficients 0 ), for (β0 , β1 , β2 ) to (4.3) yields estimates ( δ0 , σ0 , p σ0 = (2β2 )−1/2 , instance, Figure 5 shows the geometric method in action on the HIV data. The heavy solid curve is log f(z), fit from (3.6) using Lindsey’s method, as described in Efron and Tibshirani (1996). The two-groups model and the zero assumption suggest that if f0 is normal, f (z) should be well-approximated near z = 0 by p0 ϕδ0 ,σ0 (z), with (4.4) (4.2) ϕδ0 ,σ0 (z) ≡ (2πσ02 )−1/2 exp − 1 z − δ0 2 σ0 2 making log f (z) approximately quadratic, (4.3) 1 δ02 log f (z) = ˙ log p0 − + log(2πσ02 ) 2 σ02 δ0 1 + 2 z − 2 z2 . σ0 2σ0 , δ0 = −0.107, σ0 = 0.753, 0 = 0.931, p for the HIV data. Trying the same method with the theoretical null, that is, taking (δ0 , σ0 ) = (0, 1) in (4.3), 0 equals the impossible gives a very poor fit, and p value 1.20. The analytic method makes more explicit use of the zero assumption, stipulating that the nonnull density f1 (z) in the two-groups model (2.1) is supported outside some given interval [a, b] containing zero (actually chosen by preliminary calculations). Let N0 be the number of zi in [a, b], and define b − δ0 a − δ0 P0 (δ0 , σ0 ) = − σ0 σ0 (4.5) θ = p0 P0 . and 10 B. EFRON F IG . 5. Geometric estimate of null proportion p0 and empirical null mean and standard deviation (δ0 , σ0 ) for the HIV data. Heavy curve is log f(z), estimated as in (3.3)–(3.6); beaded curve is best quadratic approximation to log f(z) near z = 0. Then the likelihood function for z0 , the vector of N0 z-values in [a, b], is (4.6) fδ0 ,σ0 ,p0 (z0 ) = [θ N0 (1 − θ )N−N0 ] · ϕδ ,σ (zi ) 0 0 zi ∈z0 P0 (δ0 , σ0 ) . This is the product of two exponential family likelihoods, which is numerically easy to solve for the 0 ), equaling δ0 , σ0 , p maximum likelihood estimates ( (−0.120, 0.787, 0.956) for the HIV data. Both methods are implemented in locfdr. The analytic method is somewhat more stable but can be more biased than geometric fitting. Efron (2004) shows that geometric fitting gives nearly unbiased estimates of δ0 and σ0 for p0 ≥ 0.90. Table 2 shows how the two methods fared in the simulation study of Table 1. A healthy literature has sprung up on the estimation of p0 , as in Pawitan et al. (2005) and Langlass et al. (2005), all of which assumes the validity of the theoretical null. The zero assumption plays a central role in this literature [which mostly works with two-sided pvalues rather than z-values, e.g., pi = 2(1 − F100 (|ti |)) in (3.2), making the “zero region” occur near p = 1]. The two-groups model is unidentifiable if f0 is unspecified in (2.1), since we can redefine f0 as f0 + cf1 , and p1 as p1 − cp0 for any c ≤ p1 /p0 . With p1 small, (2.2), and f1 supposed to yield zi ’s far from 0 for the most part, the zero assumption is a reasonable way to impose identifiability on the two-groups model. Section 6 considers the meaning of the null density more carefully, 0 among other things explaining the upward bias of p seen in Table 2. The empirical null is an expensive luxury from the point of view of estimation efficiency. Comparing the right-hand side of Table 1 with the left reveals factors of 2 or 3 increase in standard error relative to the theoretical null, near the crucial point where fdr(z) = 0.2. Section 4 of Efron (2005) pins the increased variability entirely on the estimation of (δ0 , σ0 ); even knowing the true values of p0 and f (z) would reduce the stan by less than 1%. (Using taildard error of log fdr(z) TABLE 2 0 ), simulation study of Table 1. δ0 , σ0 , p Comparison of estimates ( “Formula” is average from delta-method standard deviation formulas, Section 5 in Efron (2006), as implemented in locfdr Geometric δ0 : σ0 : 0 : p Analytic mean stdev (formula) mean stdev (formula) 0.02 1.02 0.92 0.056 0.029 0.013 (0.062) (0.033) (0.015) 0.04 1.04 0.93 0.031 0.031 0.009 (0.032) (0.031) (0.011) MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL TABLE 3 Number of genes identified as true discoveries by two-sided Benjamini–Hochberg procedure, 0.10 control level BRCA data: HIV data: Theoretical null Empirical null 107 22 0 180 Empirical null densities as in Figure 4. area Fdr’s rather than local fdr’s does not help—here the local version is less variable.) The reason for considering empirical nulls is that the theoretical N(0, 1) null does not seem to fit the data in situations like Figure 4. For the BRCA data we can see that the histogram is overdispersed compared to N(0, 1) around z = 0; the implication is that there will be more null counts far from zero than the theoretical null predicts, making N(0, 1) false discovery rate calculations like (3.10) too optimistic. The opposite happens with the HIV data. There is a lot at stake here for both Bayesians and frequentists. Table 3 shows the number of gene discoveries identified by the standard Benjamini–Hochberg two-sided Fdr procedure, q = 0.10 in (2.5). The HIV results are much more dramatic using the empirical null f0 (z) ∼ N(−0.11, 0.752 ) and in fact we will see in the next section that σ0 = 0.75 is quite believable in this case. The BRCA data has been used in the microarray literature to compare analysis techniques, under the presumption that better techniques will produce more discoveries; recently, for instance, in Storey et al. (2005) and Pawitan et al. (2005). Table 3 suggests caution in this interpretation, where using the empirical null negates any discoveries at all. The z-values in panel C of Figure 1, proteomics data, were calculated from standard Cox likelihood tests that should yield N(0, 1) null results asymptotically. A N(−0.02, 1.292 ) empirical null was obtained from the analytic method, resulting in only one peak < 0.2; using the theoretical null gave six such with fdr peaks. In panel B of Figure 1, the z-values were obtained from familiar binomial calculations, each zi being calculated as (4.7) ad − p dis − ) z = (p ad ) dis (1 − p dis ) −1/2 ad (1 − p p p + , nad ndis where nad was the number of advantaged students in ad the proportion passing the test, the high school, p · 11 dis for the disadvantaged stuand likewise ndis and p dents; = 0.192 was the overall difference, median ad ) − median (p dis ). Here the empirical null standard (p deviation σ0 equals 1.52, half again bigger than the theoretical standard deviation we would use if we had only one school’s data. An empirical null fdr analysis < 0.20, 30 on the left and yielded 75 schools with fdr 45 on the right. Example B is discussed a bit further in the next two sections, where its use in the two-groups model is questioned. My point here is not that the empirical null is always the correct choice. The opposite advice, always use the theoretical null, has been inculcated by a century of classic one-case-at-a-time testing to the point where it is almost subliminal, but it exposes the statistician to obvious criticism in situations like the BRCA and HIV data. Large-scale simultaneous testing produces mass information of a Bayesian nature that impinges on individual decisions. The two-groups model helps bring this information to bear, after one decides on the proper choice of f0 in (2.1). Section 5 discusses this choice, in the form of a list of reasons why the theoretical null, and its close friend the permutation null, might go astray. 5. THEORETICAL, PERMUTATION AND EMPIRICAL NULL DISTRIBUTIONS Like most statisticians, I have spent my professional life happily testing hypotheses against theoretical null distributions. It came as somewhat of a shock then, when pictures like Figure 4 suggested that the theoretical null might be more theoretical than I had supposed. Once suspicious, it becomes easy to think of reasons why f0 (z), the crucial element in the twogroups model (2.1), might not obey classical guidelines. This section presents four reasons why the theoretical null might fail, and also gives me a chance to say something about the strengths and weaknesses of permutation null distributions. R EASON 1(Failed mathematical assumptions). The usual derivation of the null hypothesis distribution for a two-sample t-statistic assumes independent and identically distributed (i.i.d.) normal components. For the BRCA data of Figure 4, direct inspection of the 3226 by 15 matrix “X” of expression values reveals markedly nonnormal components, skewed to the right (even after the columns of X have been standardized to mean 0 and standard deviation 1, as in all my examples here). Is this causing the failure of the N(0, 1) theoretical null? 12 B. EFRON Permutation techniques offer quick relief from such concerns. The columns of X are randomly permuted, giving a matrix X ∗ with corresponding t-values ti∗ and z-values zi∗ = −1 (Fn−2 (ti∗ )). This is done some large number of times, perhaps 100, and the empirical distribution of the 100 · N zi∗ ’s used as a permutation null. The well-known SAM algorithm (Tusher, Tibshirani and Chu, 2001) effectively employs the permutation null cdf in the numerator of the Fdr formula (2.3). Applied to the BRCA matrix, the permutation null came out nearly N(0, 1) (as did simply simulating the entries of X ∗ by independent draws from all 3226 · 15 entries of X), so nonnormal distributions were not the cause of BRCA’s overwide histogram. In practice the permutation null usually approximates the theoretical null closely, as a long history of research on the permutation t-test demonstrated; see Section 5.9 of Lehmann and Romano (2005). R EASON 2 (Unobserved covariates). The BRCA study is observational rather than experimental—the 15 women were observed to be BRCA1 or BRCA2, not assigned, and likewise with the HIV and prostate studies. There are likely to be covariates—age, race, general health—that affect the microarray expression levels differently for different genes. If these were known to us, they could be factored out using a separate linear model on each gene’s data, providing a new and improved zi obtained from the “Treatment” coefficient in the model. This would reduce the spread of the z-value histogram, perhaps even restoring the N(0, 1) theoretical null for the BRCA data. Unobserved covariates act to broaden the null distribution f0 (z). They also broaden the nonnull distribution f1 (z) in (2.1), and the mixture density f (z), but this does not correct fdr estimates like (3.10), where the numerator, which depends entirely on f0 , is the only estimated quantity. Section 4 of Efron (2004) provides an analysis of a simplified model with unobserved covariates. Permutation techniques cannot recognize unobserved covariates, as the model demonstrates. R EASON 3 (Correlation across arrays). False discovery rate methodology does not require independence among the test statistics zi . However, the theoretical null distribution does require independence of the expression values used to calculate each zi ; in terms of the elements xij of the expression matrix X, for gene i we need independence among xi1 , xi2 , . . . , xin in order to validate (1.1). Experimental difficulties can undercut acrossmicroarray independence, while remaining undetectable in a permutation analysis. This happened in both studies of Figure 4 (Efron, 2004, 2006). The BRCA data showed strong positive correlations among the first four BRCA2 arrays, and also among the last four. This reduces the effective degrees of freedom for each t-statistic below the nominal 13, making ti and zi = −1 (F13 (ti )) overdispersed. R EASON 4 (Correlation across genes). Benjamini and Hochberg’s 1995 paper verified Fdr control for rule (2.5) under the assumption of independence among the N z-values (relaxed a little in Benjamini and Yekutieli, 2001). This seems fatal for microarray applications since we expect genes to be correlated in their actions. A great virtue of the empirical Bayes/two-groups approach is that independence is not necessary; with Fdr(z) = p0 F0 (z)/F(z), for instance, Fdr(z) can provide a reasonable estimate of Pr{null|Z ≤ z} as long as F(z) is roughly unbiased for F (z)—in formal terms requiring consistency but not independence—and like wise for the local version fdr(z) = p0 f0 (z)/f(z), (3.7). There is, however, a black cloud inside the silver lining: the assumption that the null density f0 (z) is known to the statistician. The empirical null estimation methods of Section 4 do not require z-value independence, and so disperse the black cloud, at the expense of increased variability in fdr estimates. Do we really need to use an empirical null? Efron (2007) discusses the following somewhat disconcerting result: even if the theoretical null distribution zi ∼ N(0, 1) holds exactly true for all null genes, Reasons 1–3 above not causing trouble, correlation among the zi ’s can make the overall null distribution effectively much wider or much narrower than N(0, 1). Microarray data sets tend to have substantial z-value correlations. Consider the BRCA data: there are more than five million correlations ρij between pairs of gene z-values zi and zj ; by examining the row-wise correlations in the X matrix we can estimate that the distribution of the ρij ’s has approximately mean 0 and variance α 2 = 0.1532 , (5.1) ρ ∼ (0, α 2 ). (The zero mean is a consequence of standardizing the columns of X.) This is a lot of correlation—as much as if the BRCA genes occurred in 10 independent groups, but with common interclass correlation 0.50 for all genes within a group. Section 3 of Efron (2006) shows that under assumptions (1.1)–(5.1), the ensemble of null-gene z-values will behave roughly as (5.2) zi ∼N(0, ˙ σ02 ) MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL with (5.3) √ σ02 = 1 + 2A, A ∼ (0, α 2 ). If the variable A equaled α = 0.153, for instance, giving σ0 = 1.10, then the expected number of null counts below z = −3 would be about p0 N(−3/1.10) rather than p0 N(−3), more than twice as many. There is even more correlation in the HIV data, α = ˙ 0.42, enough so that a moderately negative value of A could cause σ0 = 0.75, as in Figure 4. The random variable A acts like an observable ancillary in the two-groups situation—observable because we can estimate σ0 from the central counts of the zσ0 is essentially the value histogram, as in Section 4; half-width of the central peak. Figure 6 is a cautionary story on the dangers of igσ0 . A simulation model with noring (5.4) zi ∼ N(0, 1), i = 1, 2, . . . , 2700, zi ∼ N(2.5, 1.5), and i = 2701, . . . , 3000, was run, in which the null zi ’s, the first 2700, were correlated to the same degree as in the BRCA data, α = 0.153. For each of 1000 simulations of (5.4), a standard Benjamini–Hochberg Fdr analysis (2.5) (i.e., 13 using the theoretical null for F0 ) was run at control level q = 0.10, and used to identify a set of nonnull genes. Each of the thousand points in Figure 6 is ( σ0 , Fdp), where σ0 is half the distance between the 16th and 86th percentiles of the 3000 zi ’s, and Fdp is the “False discovery proportion,” the proportion of identified genes that were actually null. Fdp averaged 0.091, close to the target value q = 0.10, but with a strong dependence on σ0 : the lowest 5% of σ0 ’s corresponded to Fdp’s averaging only 0.03, while the upper 5% average was 0.29, a factor of 9 difference. The point here is not that the claimed q-value 0.10 is wrong, but that in any one simulation we may be able to see, from σ0 , that it is probably misleading. Using the empirical null counteracts this fallacy which, again, is not apparent from the permutation null. (Section 4 of Efron, 2007, discusses more elaborate permutation methods that do bear on Figure 6. See Qui et al., 2005, for a gloomier assessment of correlation effects in microarray analyses.) What is causing the overdispersion in the Education data of panel B, (4.7)? Correlation across schools, Reason 44, seems ruled out by the nature of the sampling, leaving Reasons 2 and 3 as likely candidates; un- F IG . 6. Benjamini–Hochberg Fdr control procedure (2.5), q = 0.1, run for 1000 simulations of correlated model (5.4); true false discovery σ0 . Overall Fdp averaged 0.091, close to q, but with a strong dependence on σ0 . proportion Fdp plotted versus half-width estimate 14 B. EFRON observed covariates are an obvious threat here, while within-school sampling dependences (Reason 3) are certainly possible. Fdr analysis yields eight times as many “significant” schools based on the theoretical null rather than f0 ∼ N(−0.35, 1.512 ), but looks completely untrustworthy to me. Sometimes the theoretical null distribution is fine, of course. The prostate data had ( δ0 , σ0 ) = (0.00, 1.06) according to the analytic method of (4.6), close enough to (0, 1) to make theoretical null calculations believable. However, there are lots of things that can go wrong with the theoretical null, and lots of data to check it with in large-scale testing situations, making it a matter of due diligence for the statistician to do such checking, even if only by visual inspection of the zvalue histogram. All simultaneous testing procedures, not just false discovery rates, go wrong if the null distribution is misrepresented. 6. A ONE-GROUP MODEL Classical one-at-a-time hypothesis testing depends on having a unique null density f0 (z), such as Student’s t distribution for the normal two-sample situation. The assumption of unique f0 has been carried over into most of the microarray testing literature, including our definition (2.1) of the two-groups model. Realistic examples of large-scale inference are apt to be less clearcut, with true effect sizes ranging continuously from zero or near zero to very large. Here we consider a “one-group” structural model that allows for a range of effects. We can still usefully apply fdr methods to data from one-group models; doing so helps clarify the choice between theoretical and empirical null hypotheses, and explicates the biases inherent in model (2.1). The discussion in this section, as in Section 2, will be mostly theoretical, involving probability models rather than collections of observed z-values. Model (2.1) does not require knowing how the z-values were generated, a substantial practical advantage of the two-groups formulation. In contrast, onegroup analysis begins with a specific Bayesian structural model. We assume that the ith case has an unobserved true value μi distributed according to some density g(μ), and that the observed zi is normally distributed around μi , (6.1) μ ∼ g(·) and z|μ ∼ N(μ, 1). The density g(μ) is allowed to have discrete atoms. It might have an atom at zero but this is not required, and in any case there is no a priori partition of g(μ) into null and nonnull components. As an example, suppose g(μ) is a mixture of 90% N(0, 0.52 ) and 10% N(2.5, 0.52 ), (6.2) g(μ) = 0.9 · ϕ0,0.5 (μ) + 0.1 · ϕ2.5,0.5 (μ) in notation (4.2). The histogram in Figure 7 shows N = 3000 draws of μi from (6.2). I am thinking of this as a situation having a large proportion of uninteresting cases centered near, but not exactly at, zero, and a small proportion of interesting cases centered far to the right. We still want to use the observed zi ’s from (6.2) to flag cases that are likely to be interesting. The density of z in model (6.1) is f (z) = (6.3) ∞ −∞ ϕ(μ − z)g(μ) dμ, √ ϕ(x) = exp(−x 2 /2)/ 2π , shown as the smooth curve in the left-hand panel, (6.4) f (z) = 0.9 · ϕ0,1.12 (z) + 0.1 · ϕ2.5,1.12 (z). The effect of noise in going from μi to zi ∼ N(μi , 1) has blurred the strongly bimodal μ-histogram into a smoothly unimodal f (z). We can still employ the tactic of Figure 5, fitting a quadratic curve to log f (z) around z = 0 to estimate p0 and the empirical null density f0 (z). Using the formulas described later in this section gives (6.5) p0 = 0.93 and f0 (z) ∼ N(.02, 1.142 ), and corresponding fdr curve p0 f0 (z)/f (z), labeled “Emp null” in the right-hand panel of Figure 7. Looking at the histogram, it is reasonable to consider “interesting” those cases with μi ≥ 1.5, and “uninteresting” μi < 1.5. The curve labeled “Bayes” in Figure 7 is the posterior probability Pr{uninteresting|z} based on full knowledge of (6.1), (6.2). The empirical null fdr curve provides an excellent estimate of the full Bayes result, without the prior knowledge. [An fdr based on the theoretical N(0, 1) null is seen to be far off.] Unobserved covariates, Reason 2 in Section 4, can easily produce blurry null hypotheses like that in (6.2). My point here is that the two-group model will handle blurry situations if the null hypothesis is empirically estimated. Or, to put things negatively, theoretical or permutation null methods are prone to error in such situations, no matter what kind of analysis technique is used. Comparing (6.5) with (6.4) shows that f0 (z) is just about right, but p0 is substantially larger than the value 0.90 we might expect. The ϕ2.5,.5 component of g(μ) 15 MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL F IG . 7. Left panel: Histogram shows N = 3000 draws of μi from model (6.2); smooth curve is corresponding density f (z), (6.3). Right panel: “Emp null” is fdr(z) based on empirical null; it closely matches full Bayes posterior probability “Bayes” = Pr{μk < 1.5|z} from (6.1)–(6.2); “Theo null” is fdr(z) based on theoretical null, a poor match to Bayes. puts some of its z-values near zero, weakening the zero assumption (4.1) and biasing p0 upward. The same thing happened in Table 2 even though model (3.11) is “unblurred,” g(μ) having a point mass at μ = 0. Fortunately, p0 is the least important part of the twogroups model for estimating fdr(z), under assumption (2.2). “Bias” can be a misleading term in model (6.1) since it presupposes that each μi is clearly defined as null or nonnull. This seems clear enough in (3.11). The null/nonnull distinction is less clear in (6.2), though it still makes sense to search for cases that have μi unusually far from 0. The results in (6.5) come from a theoretical analysis of model (6.1). The idea in what follows is to generalize the construction in Figure 5 by approximating (z) = log f (z) with Taylor series other than quadratic. The J th Taylor approximation to (z) is (6.6) J (z) = J (j ) (0)zj /j !, Let f0 (z) indicate the subdensity p0 f0 (z), the numerator of fdr(z) in (2.7). The choice (6.8) matches f (z) at z = 0 (a convenient form of the zero assumption) and leads to an fdr expression (6.9) (6.7) (j ) d j log f (z) (0) = . dzj z=0 fdr(z) = eJ (z) /f (z). Larger choices of J match f0 (z) more accurately to f (z), increasing ratio (6.9); the interesting z-values, those with small fdr’s, are pushed farther away from zero as we allow more of the data structure to be explained by the null density. Bayesian model (6.1) provides a helpful interpretation of the derivatives (j ) (0): L EMMA . The derivative (j ) (0), (6.7), is the j th cumulant of the posterior distribution of μ given z = 0, except that (2) (0) is the second cumulant minus 1. Thus j =0 where (0) (0) = log f (0) and for j ≥ 1 f0 (z) = eJ (z) (6.10) (1) (0) = E0 and −(2) (0) = 1 − V0 ≡ V̄0 , where E0 and V0 are the posterior mean and variance of μ given z = 0. 16 B. EFRON TABLE 4 Expressions for p0 , f0 and fdr, first three choices of J in (6.8), (6.9); V̄0 = 1 − V0 ; J = 0 gives theoretical null, J = 2 empirical null; f (z) from (6.3) J 0 1 2 p0 √ f (0) 2π √ 2 f (0) 2π eE0 /2 2 f (0) 2π eE0 /2V̄0 f0 (z) N (0, 1) N (E0 , 1) N(E0 /V̄0 , 1/V̄0 ) fdr(z) f (0)e−z f (z) 2 /2 f (0)eE0 z−z f (z) 2 /2 V̄0 f (0)eE0 z−V̄0 z f (z) 2 /2 Proof of the lemma appears in Section 7 of Efron (2005). For J = 0, 1, 2, formulas (6.8), (6.9) yield simple expressions for p0 and f0 (z) in terms of f (0), E0 and V̄0 . These are summarized in Table 4, with p0 obtained from (6.11) p0 = ∞ −∞ f0 (z) dz. Formulas are also available for Fdr(z), (2.8). The choices J = 0, 1, 2 in Table 4 result in a normal null density f0 (z), the only difference being the means and variances. Going to J = 3 allows for an asymmetric choice of f0 (z), f (0) E0 z−V̄0 z2 /2+S0 z3 /6 e (6.12) , fdr(z) = f (z) where S0 is the posterior third central moment of μ given z = 0 in model (6.1). The program locfdr uses a variant, the “split normal,” to model asymmetric null densities, with the exponent of (6.12) replaced by a quadratic spline in z. The lemma bears on the difference between empirical and theoretical nulls. Suppose that the probability mass of g(μ) occurring within a few units of the origin is concentrated in an atom at μ = 0. Then the posterior mean and variance (E0 , V0 ) of μ given z = 0 will be near 0, making (E0 , V̄0 )=(0, ˙ 1). In this case the empirical null (J = 2) will approximate the theoretical null (J = 0). Otherwise the two nulls differ; in particular, any mass of g(μ) near zero increases V0 , swelling the standard deviation (1 − V0 )−1/2 of the empirical null. The two-groups model (2.1), (2.2) puts one in a hypothesis-testing frame of mind: a large group of uninteresting cases is to be statistically separated from a small interesting group. Even blurry situations like (6.2) exhibit a clear grouping, as in Figure 7. None of this is necessary for the one-group model (6.1). We might, for example, suppose that g(μ) is normal, (6.13) μ ∼ N(A, B 2 ), and proceed in an empirical Bayes way to estimate A and B and then apply Bayes estimation to the individual cases. This line of thought leads directly to James–Stein estimation (Efron and Morris, 1975). Estimation, as opposed to testing, is the key word here—with possible effect sizes μi varying continuously rather than having a large clump of values near zero. The Education data of panel B, Figure 1, could reasonably be analyzed this way, instead of through simultaneous testing. Scientific context, which says that there is likely to be a large group of (nearly) unaffected genes, as in (2.2), is what makes the two-groups model a reasonable Bayes prior for microarray studies. 7. BAYESIAN AND FREQUENTIST CONFIDENCE STATEMENTS False discovery rate methods provide a happy marriage between Bayesian and frequentist approaches to multiple testing, as shown in Section 2. Empirical Bayes techniques based on the two-groups model seem to give us the best of both statistical philosophies. Things do not always work out so peaceably; in these next two sections I want to discuss contentious situations where the divorce court looms as a possibility. An insightful and ingenious paper by Benjamini and Yekutieli (2005) discusses the following problem in simultaneous significance testing: having applied false discovery rate methods to select a set of nonnull cases, how can confidence intervals be assigned to the true effect size for each selected case? (The paper and the ensuing discussion are much more general, but this is all I need for the illustration here.) Figure 8 concerns Benjamini and Yekutieli’s solution applied to the following simulated data set: N = 10,000 (μi , zi ) pairs were generated as in (6.1), with 90% of the μi zero, the null cases, and 10% distributed N(−3, 1), (7.1) g(μ) = 0.90 · δ0 (μ) + 0.10 · ϕ−3,1 (μ), δ0 (μ) a delta function at μ = 0. The Fdr procedure (2.5) was applied with q0 = 0.05, yielding 566 nonnull “discoveries,” those having zi ≤ −2.77. The Benjamini–Yekutieli “false coverage rate” (FCR) control procedure provides upper and lower bounds for the true effect size μi corresponding to each zi less than −2.77; these are indicated by heavy diagonal lines in Figure 8, constructed as described in BY’s Definition 1. This construction guarantees that 17 MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL F IG . 8. Benjamini–Yekutieli FCR controlling intervals applied to simulated sample of 10,000 cases from (6.1), (7.1). 566 cases have zi ≤ z0 = −2.77, the Fdr (0.05) threshold. Plotted points are (zi , μi ) for the 1000 nonnull cases; 14 null cases with zi ≤ z0 indicated by “+.” Heavy diagonal lines indicate FCR 95% interval limits; light lines are Bayes 95% posterior intervals given μi = 0. Beaded curve at top is fdr(zi ), posterior probability μi = 0. the expected proportion of the 566 intervals not containing the true μi , the false coverage rate, is bounded by q = 0.05. In a real application only the zi ’s and their BY confidence intervals could be seen, but in a simulation we can plot the actual (zi , μi ) pairs, and compare them to the intervals. Figure 8 plots (zi , μi ) for the 1000 nonnull cases, those from μi ∼ N(−3, 1) in (7.1). Of these, 55,2 plotted as heavy points, lie to the left of z0 = −2.77, the Fdr threshold, with the other 448 plotted as light points; 14 null cases, μi = 0, plotted as “+,” also had zi < z0 . The first thing to notice is that the FCR property is satisfied: only 17 of the 566 intervals have failed to contain μi (14 of these the +’s), giving 3% noncoverage. The second thing, though, is that the √ intervals are frighteningly wide—zi ± 2.77, about 2 longer than the usual individual 95% intervals zi ± 1.96—and poorly centered, particularly at the left where all the μi ’s fall in their intervals’ upper halves. An interesting comparison is with Bayes’ rule applied to (6.1), (7.1), which yields (7.2) Pr{μ = 0|zi } = fdr(zi ), where (7.3) fdr(z) = 0.9 · ϕ0,1 (z) · [0.9 · ϕ0,1 (z) + 0.1 · ϕ−3,√2 (z)]−1 as in (2.7), and zi − 3 1 , . 2 2 That is, μi is null with probability fdr(zi ), and N((zi − 3)/2, 1/2) with probability 1 − fdr(zi ). The dashed lines indicate the posterior 95% √ intervals √ given that μi is nonnull, (zi −3)/2±1.96/ 2, now 2 shorter than the usual individual intervals; at the top of Figure 9 the beaded curve shows fdr(zi ). The frequentist FCR intervals and the Bayes intervals are pursuing the same goal, to include the nonnull scores μi with 95% probability. At zi = −2.77 the FCR assessment is Pr{μ ∈ [−5.54, 0]} = 0.95; Bayes’ rule states that μi = 0 with probability fdr(−2.77) = 0.25, and if μi = 0, then μi ∈ [−4.27, −1.49] with probability 0.95. This kind of disconnected description is natural to the two-groups model. A principal cause of FCR’s oversized intervals (the paper shows (7.4) g(μi |μi = 0, zi ) ∼ N 18 B. EFRON F IG . 9. Computing a p-value for z̄S = 0.842, average of 15 z-values in CTL pathway, p53 data Solid histogram 500 row randomizations give p-value 0.002. Line histogram 500 column permutations give p-value 0.048. that no FCR-controlling intervals can be much narrower) comes from using a single connected set to describe a disconnected situation. Of course Bayes’ rule will not be easily available to us in most practical problems. Is there an empirical Bayes solution? Part of the solution certainly is there: estimating fdr(z) as in Section 3. Estimating g(μi |μi = 0, zi ), (7.4), is more challenging. A straightforward approach uses the nonnull counts (3.8) to estimate the nonnull density f1 (z) in (2.1), deconvolutes f1 (z) to estimate the nonnull component “g1 (μ)” in (7.1), and applies Bayes’ rule directly to g1 . This works reasonably well in Figure 8’s example, but deconvolution calculations are notoriously tricky and I have not been able to produce a stable general algorithm. Good frequentist methods like the FCR procedure enjoy the considerable charm of an exact error bound, without requiring a priori specifications, and of course there is no law that they have to agree with any particular Bayesian analysis. In large-scale situations, however, empirical Bayes information can overwhelm both frequentist and Bayesian predilections, hopefully leading to a more satisfactory compromise between the two sets of intervals appearing in Figure 8. 8. IS A SET OF GENES ENRICHED? Microarray experiments, through a combination of insufficient data per gene and massively multiple si- multaneous inference, often yield disappointing results. In search of greater detection power, enrichment analysis considers the combined outcomes of biologically defined sets of genes, such as pathways. As a hypothetical example, if the 20 z-values in a certain pathway all were positive, we might infer significance to the pathway’s effect, whether or not any of the individual zi ’s were deemed nonnull. Our example here will involve the p53 data, from Subramanian et al. (2005), N = 10,100 genes on n = 50 microarrays, zi ’s as in (3.2), whose z-value histogram looks like a slightly short-tailed normal distribution having mean 0.04 and standard deviation 1.06. Fdr analysis (2.5), q = 0.1, yielded just one nonnull gene, while enrichment analysis indicated seven or eight significant gene sets, as discussed at length in Efron and Tibshirani (2006). Figure 9 concerns the CTL pathway, a set of 15 genes relating to the development of so-called killer T cells, #95 in a catalogue of 522 gene-sets provided by Subramanian et al. (2005). For a given gene-set “S” with m members, let z̄S denote the mean of the m z-values within S; z̄S is the enrichment statistic suggested in the Bioconductor R package limma (Smyth, 2004), (8.1) z̄S = 0.842 for the CTL pathway. How significant is this result? I will consider assigning an individual p-value to (8.1), MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL not taking into account multiple inference for a catalogue of possible gene-sets (which we could correct for later using Fdr methods, for instance, to combine the individual p-values). Limma computes p-values by “row randomization,” that is, by randomizing the order of rows of the N × n expression matrix X, and recomputing the statistic of interest. For a simple average like (8.1) this amounts to choosing random subsets of size m = 15 from the N = 10, 100 zi ’s and comparing z̄S to the distribution of the randomized values z̄S∗ . Five hundred rowrands produced only one z̄S∗ > z̄S , giving p-value 1/500 = 0.002. Subramanian et al. calculate p-values by permuting the columns of X rather than the rows. The permutations yield a much wider distribution than the row randomizations in Figure 9, with corresponding p-value 0.048. The reason is simple: the genes in the CTL pathway have highly correlated expression levels that increase the variance of z̄S∗ ; column-wise permutations of X preserve the correlations across genes, while row randomizations destroy them. At this point it looks like column permutations should always give the right answer. Wrong! For the BRCA data in Figure 4, the ensemble of z-values has (mean, standard deviation) about (0, 1.50), compared to (0, 1) for zi∗ ’s from column permutations. This shrinks the permutation variability of z̄S∗ , compared to what one would get from a random selection of genes for S, and can easily reverse the relationship in Figure 9. The trouble here is that there are two obvious, but different, null hypotheses for testing enrichment: Randomization null hypothesis S has been chosen by random selection of m genes from the full set of N genes. Permutation null hypothesis The order of the n microarrays has been chosen at random with respect to the patient characteristics (e.g., with the patient being in the normal or cancer category in Example A of the Introduction). Efron and Tibshirani (2006) suggest a compromise method, restandardization, that to some degree accommodates both null hypotheses. Instead of permuting z̄S in (8.1), restandardization permutes (z̄S − μz )/σz , where (μz , σz ) are the mean and standard deviation of all N zi ’s. Subramanian et al. do something similar using a Kolmogorov–Smirnov enrichment statistic. All of these methods are purely frequentistic. Theoretically we might consider applying the two-groups/ 19 empirical Bayes approach to sets of z-values “zS ,” just as we did for individual zi ’s in Sections 2 and 3. For at least three reasons that turns out to be extremely difficult: • My technique for estimating the mixture density f , as in (3.6), becomes exponentially more difficult in higher dimensions. • There is not likely to be satisfactory theoretical null f0 for the correlated components of z̄S , while estimating an empirical null faces the same “curse of dimensionality” as for f . • As discussed following (3.10), false discovery rate interpretation depends on exchangeability, essentially an equal a priori interest in all N genes. There may be just one gene-set S of interest to an investigator, or a catalogue of several hundred S’s as in Subramanian et al., but we certainly are not interested in all possible gene-sets. It would be a daunting exercise in subjective, as opposed to empirical, Bayesianism to assign prior probabilities to any particular gene-set S. Having said this, it turns out there is one “gene-set” situation where the two-groups/empirical Bayes approach is practical (though it does not involve genes). Looking at panel D of Figure 1, the Imaging data, the obvious spatial correlation among z-values suggests local averaging to reduce the effects of noise. This has been carried out in Figure 10: at voxel i of the N = 15,445 voxels, the average of z-values for those voxels within city-block distance 2 has been computed, say “z̄i .” The results for the same horizontal slice as in panel D are shown using a similar symbol code. Now that we have a single number zi for each estimates voxel, we can compute the empirical null fdr as in Section 4. The voxels labeled “enriched” in Fig i ) ≤ 0.2. ure 10 are those having fdr(z̄ Enrichment analysis looks much more familiar in this example, being no more than local spatial smoothing. The convenient geometry of three-dimensional space has come to our rescue, which it emphatically fails to do in the microarray context. 9. CONCLUSION Three forces influence the state of statistical science at any one time: mathematics, computation and applications, by which I mean the type of problems subject-area scientists bring to us for solution. The Fisher–Neyman–Pearson theory of hypothesis testing was fashioned for a scientific world where experimentation was slow and difficult, producing small data sets 20 B. EFRON F IG . 10. Enrichment analysis of Imaging data, panel D of Figure 1; z-value for original 15,445 voxels have been averaged over “gene-sets” of neighboring voxels with city-block distance ≤ 2. Coded as “−” for z̄i < 0, “+” for z̄i ≥ 0; solid rectangles, labeled as “Enriched,” show ) ≤ 0.2, using empirical null. voxels with fdr(z̄ i focused on answering single questions. It was wonderfully successful within this milieu, combining elegant mathematics and limited computational equipment to produce dependable answers in a wide variety of application areas. The three forces have changed relative intensities recently. Computation has become literally millions of times faster and more powerful, while scientific applications now spout data in fire-hose quantities. (Mathematics, of course, is still mathematics.) Statistics is changing in response, as it moves to accommodate massive data sets that aim to answer thousands of questions simultaneously. Hypothesis testing is just one part of the story, but statistical history suggests that it could play a central role: its development in the first third of the twentieth century led directly to confidence intervals, decision theory and the flowering of mathematical statistics. I believe, or maybe just hope, that our new scientific environment will also inspire a new look at old philosophical questions. Neither Bayesians nor frequentists are immune to the pressures of scientific necessity. Lurking behind the specific methodology of this paper is the broader, still mainly unanswered, question of how one should combine evidence from thousands of parallel but not identical hypothesis testing situations. What I called “empirical Bayes information” accumulates in a way that is not well understood yet, but still has to be acknowledged: in the situations of Figure 4, the frequentist is not free to stick with classical null hypotheses, while the Bayesian cannot use prior (6.13), at least not without the risk of substantial inferential confusion. Classical statistics developed in a data-poor environment, as Fisher’s favorite description, “small-sample theory,” suggests. By contrast, modern-day disciplines such as machine learning seem to struggle with the difficulties of too much data. Both problems, too little and too much data, can afflict microarray studies. Massive data sets like those in Figure 1 are misleadingly comforting in their suggestion of great statistical accuracy. As I have tried to show here, the power to detect interesting specific cases, genes, may still be quite low. New methods are needed, perhaps along the lines of “enrichment,” as well as a theory of experimental design explicitly fashioned for large-scale testing situations. One floor up from the philosophical basement lives the untidy family of statistical models. In this paper I have tried to minimize modeling decisions by working directly with z-values. The combination of the twogroups model and false discovery rates applied to the zvalue histogram is notably light on assumptions, more so when using an empirical null, which does not even require independence across the columns of X (i.e., across microarrays, a dangerous assumption as shown MICROARRAYS, EMPIRICAL BAYES AND THE TWO-GROUPS MODEL in Section 6 of Efron, 2004). There will certainly be situations when modeling inside the X matrix, as in Newton et al. (2004) or Kerr, Martin and Churchill (2000), yields more information than z-value procedures, but I will leave that for others to discuss. ACKNOWLEDGMENTS This research was supported in part by the National Science Foundation Grant DMS-00-72360 and by National Institute of Health Grant 8R01 EB002784. REFERENCES A LLISON , D., G ADBURY, G., H EO , M., F ERNANDEZ , J., L EE , C. K., P ROLLA , T. and W EINDRUCH , R. (2002). A mixture model approach for the analysis of microarray gene expression data. Computat. Statist. Data Anal. 39 1–20. MR1895555 AUBERT, J., BAR - HEN , A., DAUDIN , J. and ROBIN , S. (2004). Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics 5 125. B ENJAMINI , Y. and H OCHBERG , Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392 B ENJAMINI , Y. and Y EKUTIELI , D. (2001). The control of the false discovery rate under dependency. Ann. Statist. 29 1165– 1188. MR1869245 B ENJAMINI , Y. and Y EKUTIELI , D. (2005). False discovery rateadjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93. MR2156820 B ROBERG , P. (2005). A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6 199. D O , K.-A., M UELLER , P. and TANG , F. (2005). A Bayesian mixture model for differential gene expression. J. Roy. Statist. Soc. Ser. C 54 627–644. MR2137258 D UDOIT, S., S HAFFER , J. and B OLDRICK , J. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18 71– 103. MR1997066 E FRON , B. (2003). Robbins, empirical Bayes, and microarrays. Ann. Statist. 31 366–378. MR1983533 E FRON , B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96– 104. MR2054289 E FRON , B. (2005). Local false discovery rates. Available at http: //www-stat.stanford.edu/~brad/papers/False.pdf. E FRON , B. (2006). Size, power, and false discovery rates. Available at http://www-stat.stanford.edu/~brad/papers/Size. pdf. Ann. Appl. Statist. To appear. E FRON , B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103. MR2293302 E FRON , B. and G OUS , A. (2001). Scales of evidence for model selection: Fisher versus Jeffreys. Model Selection IMS Monograph 38 208–256. MR2000754 E FRON , B. and M ORRIS , C. (1975). Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 70 311– 319. MR0391403 21 E FRON , B. and T IBSHIRANI , R. (1996). Using specially designed exponential families for density estimation. Ann. Statist. 24 2431–2461. MR1425960 E FRON , B. and T IBSHIRANI , R. (2006). On testing the significance of sets of genes. Available at http://www-stat.stanford. edu/~brad/papers/genesetpaper.pdf. Ann. Appl. Statist. To appear. E FRON , B. and T IBSHIRANI , R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23 70–86. E FRON , B., T IBSHIRANI , R., S TOREY, J. and T USHER , V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160. MR1946571 H EDENFALK , I., D UGGEN , D., C HEN , Y., ET AL . (2001). Gene expression profiles in hereditary breast cancer. New Engl. J. Medicine 344 539–548. H ELLER , G. and Q ING , J. (2003). A mixture model approach for finding informative genes in microarray studies. Unpublished manuscript. K ERR , M., M ARTIN , M. and C HURCHILL , G. (2000). Analysis of variance in microarray data. J. Comp. Biology 7 819–837. L ANGASS , M., L INDQUIST, B. and F ERKINSTAD , E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. Roy. Statist. Soc. Ser. B 67 555–572. MR2168204 L EE , M. L. T., K UO , F., W HITMORE , G. and S KLAR , J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. 97 9834–9838. L EHMANN , E. and ROMANO , J. (2005). Generalizations of the familywise error rate. Ann. Statist. 33 1138–1154. MR2195631 L EHMANN , E. and ROMANO , J. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York. MR2135927 L EWIN , A. R ICHARDSON , S., M ARSHALL , C., G LASER , A. and A ITMAN , Y. (2006). Bayesian modeling of differential gene expression. Biometrics 62 1–9. MR2226550 L IANG , C., R ICE , J., DE PATER , I., A LCOCK , C., A XELROD , T., WANG , A. and M ARSHALL , S. (2004). Statistical methods for detecting stellar occultations by Kuiper belt objects: The Taiwanese-American occultation survey. Statist. Sci. 19 265– 274. MR2146947 L IAO , J., L IN , Y., S ELVANAYAGAM , Z. and W EICHUNG , J. (2004). A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 20 2694–2701. N EWTON , M., K ENDZIORSKI , C., R ICHMOND , C., B LATTNER , F. and T SUI , K. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comp. Biology 8 37–52. N EWTON , M., N OVEIRY, A., S ARKAR , D. and A HLQUIST, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture model. Biostatistics 5 155–176. PAN , W., L IN , J. and L E , C. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional and Integrative Genomics 3 117–124. PARMIGIANI , G., G ARRETT, E., A MBAZHAGAN , R. and G ABRIELSON , E. (2002). A statistical framework for expression-based molecular classification in cancer. J. Roy. Statist. Soc. Ser. B 64 717–736. MR1979385 PAWITAN , Y., M URTHY, K., M ICHIELS , J. and P LONER , A. (2005). Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 21 3865–3872. 22 B. EFRON P OUNDS , S. and M ORRIS , S. (2003). Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of the p-values. Bioinformatics 19 1236–1242. Q UI , X., B ROOKS , A., K LEBANOV, L. and YAKOVLEV, A. (2005). The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics 6 120. ROGOSA , D. (2003). Accuracy of API index and school base report elements: 2003 Academic Performance Index, California Department of Education. Available at http://www.cde.cagov/ ta/ac/ap/researchreports.asp. S CHWARTZMAN , A., D OUGHERTY, R. F. and TAYLOR , J. E. (2005). Cross-subject comparison of principal diffusion direction maps. Magnetic Resonance in Medicine 53 1423–1431. S INGH , D., F EBBO , P., ROSS , K., JACKSON , D., M ANOLA , J., L ADD , C. TAMAYO , P., R ENSHAW, A., D’A MICO , A., R ICHIE , J., L ANDER , E., L ODA , M., K ANTOFF , P., G OLUB , T. and S ELLERS , R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 302–309. S MYTH , G. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 1–29. MR2101454 S TOREY, J. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc. Ser. B 64 479–498. MR1924302 S TOREY, J., TAYLOR , J. and S IEGMUND , D. (2005). Strong control conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. Roy. Statist. Soc. Ser. B 66 187–205. S UBRAMANIAN , A., TAMAYO , P., M OOTHA , V. K., M UKHER JEE , S., E BERT, B. L., G ILLETTE , M. A., PAULOVICH , A., P OMEROY, S. L., G OLUB , T. R., L ANDER , E. S. and M ESIROV, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102 15545–15550. T URNBULL , B. (2006). BEST proteomics data. Available at www. stanford.edu/people/brit.turnbull/BESTproteomics.pdf. T USHER , V., T IBSHIRANI , R. and C HU , G. (2001). Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc. Natl. Acad. Sci. USA 98 5116–5121. VAN ’ T W OUT, A., L EHRMA , G., M IKHEEVA , S., O’K EEFFE , G., K ATZE , M., B UMGARNER , R., G EISS , G. and M ULLINS , J. (2003). Cellular gene expression upon human immunodeficiency virus type 1 infection of CD$+ T-Cell lines. J. Virology 77 1392–1402.