Molecular subtypes identification to refine breast cancer prognosis Benjamin Haibe-Kains1,2 1 Functional 2 Genomics Unit, Institut Jules Bordet Machine Learning Group, Université Libre de Bruxelles October 16, 2008 Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 1 / 76 Research Groups Machine Learning Group (Gianluca Bontempi) 10 researchers (2 Profs, 1 postDoc, 7 PhD students), 2 graduate students). Research topics : Bioinformatics, Classification, Regression, Time series prediction, Sensor networks. Website : http://www.ulb.ac.be/di/mlg. Scientific collaborations in ULB : IRIDIA (Sciences Appliquées), Physiologie Molculaire de la Cellule (IBMM), Conformation des Macromolcules Biologiques et Bioinformatique (IBMM), CENOLI (Sciences), Functional Genomics Unit (Institut Jules Bordet), Service d’Anesthesie (Erasme). Scientific collaborations outside ULB : UCL Machine Learning Group (B), Politecnico di Milano (I), Universitá del Sannio (I), George Mason University (US). The MLG is part to the ”Groupe de Contact FNRS” on Machine Learning and to CINBIOS: http://babylone.ulb.ac.be/Joomla/. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 2 / 76 Research Groups Functional Genomics Unit (Christos Sotiriou) 9 researchers (1 Prof, 5 postDocs, 3 PhD students), 5 technicians. Research topics : Genomic analyses, clinical studies and translational research. Website : http://www.bordet.be/en/services/medical/array/practical.htm. National scientific collaborations : ULB, Erasme, ULg, Gembloux, IDDI. International scientific collaborations : Genome Institute of Singapore, John Radcliffe Hospital, Karolinska Institute and Hospital, MD Anderson Cancer Center, Netherlands Cancer Institute, Swiss Institute for Experimental Cancer Research, NCI/NIH, Gustave-Roussy Institute. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 3 / 76 Summary Introduction I I I Breast Cancer Prognosis Gene Expression Profiling Breast Cancer Molecular Subtypes Prognostic Gene Signatures Subtypes and Prognosis I GENIUS Conclusion Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 4 / 76 Part I Introduction Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 5 / 76 Breast Cancer Breast cancer is a global public health issue. It is the most frequently diagnosed malignancy in women in the western world and the commonest cause of cancer death for European and American women. In Europe, one out of eight to ten women, depending on the country, will develop breast cancer during their lifetime. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 6 / 76 ntensity for each gene probe. A gene-expression vector is then collected by summarizing the expression levels of uorescent for each gene probe. A gene-expression is then by summarizing slidesintensity (CodeLink). The Affymetrix microarray system uses a single-colour detection vector scheme, whereby onecollected sample is hybridized per chip. Agilent the expression l cilitateTo the comparison between the different compensate for differences acilitate the comparison between the different in labelling, hybridizh technology uses a two-colour scheme, whereby the same array isexperiments hybridized with twoand different samples. ample. facilitate the comparison between the different experiments and compensate for differences in labelling, RNA is extracted from frozen breast tumour samples collected either at surgery (a), or before treatment (b), labelled with a detectable marker ormalization step isis usually performed. normalization step usually performed. methods, a normalization stepto istheusually performed. (fluorescent dye), and hybridized array containing individual gene-specific probes. Gene-expression levels are estimated by measuring the fluorescent intensity for each geneare probe. A gene-expression vector is then collectedgene-expression by summarizing the expression levels of each gene in the from tumour su ession prognostic classifiers usually built by correlating patterns, ession prognostic classifiers are usually built generated Gene-expression prognostic classifiers are usually built by correlating gene-expression patterns, generated from tum sample. To facilitate the comparison between the different experiments and compensate for differences in labelling, hybridizations and detection ome of metastases during (a). Gene-expression ome(development (development ofdistant distant metastases during follow-up) predictive classifiers of response methods, a normalization step is usually performed. metastases linical outcome (development of distant during follow-up) (a). Gene-expression predictive classifiers of re Gene-expression prognostic classifiers are usually built byfrom correlating gene-expression patterns, generated from tumour surgical specimens,therapy, with yyenerated correlating gene-expression data, derived biopsies taken before pre-operative systemic with clini clin correlating gene-expression data, derived by correlating gene-expression data, derived from biopsies taken before pre-operative systemic therapy, w clinical outcome (development of distant metastases during follow-up) (a). Gene-expression predictive classifiers of response to treatment are he given (treatment the given treatment (bb).).gene-expression generated correlating esponse totreatment thebygiven (b).data, derived from biopsies taken before pre-operative systemic therapy, with clinical and/or pathological a b Breast Cancer Prognosis response to the given treatment (b). a Breast surgery + radiotherapy Follow-up 5–10 years Breast surgery Follow-up 5-10 years Development of gene expression prognostic classifier Follow-up Follow-up 5–10 years5–10 years Breast surgery Breast Breast surgery surgery Recurrence Remission DNA microarrays b Diagnosis Frozen surgical tumour specimen Recurrence Remission Remission Recurrence Recurrence Remission ? Pre-operative chemotherapy DNA microarrays DNA microarrays Prognosis Frozen surgical No response Response Frozen surgical surgical Frozen Tumour biopsy tumour specimen before therapy tumour specimen specimen tumour Development of gene expression predictive classifier DNA microarrays 5–15% for the various microarray platforms, Benjamin (ULB) whereas it wasHaibe-Kains 10–20% for between-labora- gene expression, were detecting similar the prediction of good outcome versus Pre-operative chemotherapy Visit to IGC results October 16, 2008 changes in gene abundance. Similar poor outcome) but generated different 7 / 76 Current Clinical Tools for Prognosis Clinical variables Histological grade Nodal status Age Tumor size Guidelines NIH/St Gallen ER IHC NPI AOL Prognosis Need to improve current clinical tools to detect patients who need adjuvant systemic therapy. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 8 / 76 Potential of Genomic Technologies for Prognosis In the nineties, new biotechnologies emerged: I I Human genome sequencing. Gene expression profiling (low to high-throughput). Genomic data could be used to better understand cancer biology . . . and to build efficient prognostic models. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 9 / 76 Biology Paradigm Enhancers Promoter DNA Transcription Introns pre-mRNA Exon Exon Exon Exon Exon Splicing mRNA 5' UTR Open reading frame 3' UTR Translation Gene expression profiling protein Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 10 / 76 Gene Expression Profiling Gene expression profiling using microarray chip: Microarray chip Hybridization Detection A Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 11 / 76 Microarray Data Few samples (dozens to hundreds). I I Microarray technology is expensive. Frozen tumor samples are rare (biobank). On the other hand, numerous gene expressions are measured. I The new microarray chips cover the whole genome (≈ 50,000 probes representing 30,000 ”known genes”). ß High feature-to-sample ratio (curse of dimensionality). Microarray is a complex technology. ß High level of noise in the data. Biology is complex. ß Variables are highly correlated (gene co-expressions due to biological pathways). Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 12 / 76 Microarray Data Warning You can easily find spurious patterns in the data, biologically ”meaningful”. Personal experience: I I I I At the beginning of my thesis, I had accidentally mixed the patients labels, so the relation between input (gene expressions) and output (a mutation) was completely random. I gave a list of genes differentially expressed between wild type and mutated patients, to the biologists in charge of the project and they found it very interesting (known genes, meaningful biological story). When I saw my mistake, I corrected the bug and sent a new gene list . . . and the results were even better! In conclusion, the complexity of microarray data and the biology behind should make you very critic and cautious with your results. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 13 / 76 Part II Breast Cancer Molecular Subtypes Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 14 / 76 Breast Cancer Subtypes Early microarray studies showed that BC is a molecularly heterogeneous disease [Perou et al., 2000; Sorlie et al., 2001, 2003; Sotiriou et al., 2003]. Hierarchical clustering on microarray data [Sorlie et al., 2001]: MEDICAL SCIENCES I Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 15 / 76 Breast Cancer Subtypes Clinical Outcome ig. 3. The molecular subtypes exhibited different clinical outcomes, suggesting that the biological processes involved in patients’ survival might be different. Overall and relapse-free survival analysis of the 49 breast cancer patients, uniformly16treate Visit to IGC October 16, 2008 / 76 Benjamin Haibe-Kains (ULB) Breast Cancer Subtypes Early Results These early studies showed similar results, i.e. ER and HER2 pathways are the main discriminators in breast cancer (confirmed by [Kapp et al., 2006]). However, this classification has strong limitations [Pusztai et al., 2006]: I I I Instability: the results are hardly reproducible due to the instability of the hierarchical clustering method in combination with microarray data (high feature-to-sample ratio). Crispness: hierarchical clustering produces crisp partition of the dataset (hard partitioning ) without estimation of the classification uncertainty. Validation: the hierarchical clustering is hardly applicable to new data. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 17 / 76 Breast Cancer Subtypes New Clustering Model Because of these limitations we sought to develop a simple method to identify the breast cancer subtypes. ß We introduced a model-based clustering (mixture of Gaussians) in a two-dimensional space defined by the ESR1 and ERBB2 module scores [Wirapati et al., 2008; Desmedt et al., 2008]. I I We used the Bayesian information criterion (BIC) to select the most likely number of subtypes [Fraley and Raftery, 2002]. We validated our model (fitted on Wang et al. series) on 14 independent datasets in terms of number of clusters and prediction strength [Tibshirani and Walther, 2005]. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 18 / 76 New Clustering Model Training VDX BIC 2 1 ER /HER2 HER2+ ER+/HER2 0.6 BIC 0 ERBB2 1 0.8 1 0.4 2 0.2 0 2 1 0 1 2 2 ESR1 Benjamin Haibe-Kains (ULB) 4 6 8 10 number of clusters Visit to IGC October 16, 2008 19 / 76 New Clustering Model Validation 2 ER−/HER2− HER2+ ER+/HER2− 0 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●● ● ● ●●●●● ● ●● ● ● ● ● ●● ● ● ● ●● ●●● ●●●● ● ● ●● ● ● ●● ● ●●●● ● ●● ● ● ●● ● ●● ●● ● ●●● ●●● ● ●● ●●●●● ● ●● ● ● ●●●● ● ● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● −1 −1 0 ERBB2 0 ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●●●●● ●● ●● ● ● ●●●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ERBB2 ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ●●● ● ●● ●● ●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ●● ●● ● ● ●●●● ● ●● ●●● ● ● ●● ●●● ● ●●● ● ● ● ●● ● ● ● ●●● ●● ●● ●●●● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● −1 ER−/HER2− HER2+ ER+/HER2− 1 ● 1 ● 1 ● ERBB2 UPP 2 TBG 2 NKI ER−/HER2− HER2+ ER+/HER2− −2 −1 0 1 2 −2 −2 −2 ● −2 −1 ESR1 0 1 2 −2 1 2 2 UNC2 2 2 ER−/HER2− HER2+ ER+/HER2− ER−/HER2− HER2+ ER+/HER2− ER−/HER2− HER2+ ER+/HER2− 1 ● 1 ● 1 0 ESR1 MAINZ UNT ● −1 ESR1 −1 0 ESR1 Benjamin Haibe-Kains (ULB) 1 2 0 ● −2 ● −2 −2 ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ●● ●● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● −1 ● ERBB2 ● ● ● ●● ● ● ●● ●● ●● ●● ●●●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ●● ● ●● ●● ● ●● ● ●●● ●● ●●● ● ●●●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ●●● ● ● ● ●●● ● ● ● 0 ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● −2 −1 ● ● ERBB2 ● ● −1 0 ERBB2 ● ● ● −2 −1 0 ESR1 Visit to IGC 1 2 −2 −1 0 1 2 ESR1 October 16, 2008 20 / 76 New Clustering Model Validation: Prediction Strength Dataset NKI TBG UPP UNT MAINZ STNO2 NCI MSK STK DUKE UNC2 CAL DUKE2 NCH Benjamin Haibe-Kains (ULB) ER-/HER21.00 1.00 1.00 1.00 1.00 1.00 0.85 1.00 1.00 1.00 1.00 1.00 1.00 1.00 HER2+ 1.00 1.00 0.93 0.89 1.00 0.69 0.83 1.00 0.91 0.82 0.87 1.00 0.64 0.82 Visit to IGC ER+/HER20.99 0.83 0.87 0.92 0.90 0.97 0.93 0.96 0.87 0.92 0.96 0.95 0.95 0.98 October 16, 2008 21 / 76 New Clustering Model Validation: Number of Clusters 1 0.8 BIC average 0.6 0.4 0.2 0 −0.2 2 4 6 8 10 number of clusters Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 22 / 76 Breast Cancer Subtypes Clinical Outcome ER+/HER2-: 60-70% 0.6 ER!/HER2! HER2+ ER+/HER2! 0.0 of the global population of BC patients. 0.4 HER2+: 15-20% 0.2 ER-/HER2-: 20-25% Probability of survival 0.8 1.0 Node-negative untreated patients NKI/TBG/UPP/UNT/MAINZ 0 No. At Risk ER!/HER2! 119 HER2+ 106 ER+/HER2! 516 Benjamin Haibe-Kains (ULB) Visit to IGC 1 111 98 507 2 91 91 487 3 83 81 462 4 5 6 Time (years) 78 73 435 71 69 410 68 64 363 7 64 58 319 8 53 52 282 9 46 47 257 October 16, 2008 10 37 44 223 23 / 76 Breast Cancer Subtypes New Clustering Model (dis)Advantages Advantages: I Simple model-based clustering: F F I Easily applicable to new data. Returning for each patient the probability to belong to each subtype (soft partitioning ). Low dimensional space: F F Low computational cost to fit the model. Simple visualization of the results. Disadvantages: I Low dimensional space: which dimension could we add in order to find another robust subtype? Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 24 / 76 Part III Prognostic Gene Signatures Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 25 / 76 Prognostic Gene Signatures Use of microarray technology to improve current prognostic models (NIH/St Gallen guidelines, NPI, AOL). A typical microarray analysis dealing with breast cancer prognostication involves 5 key steps: 1 2 3 4 5 Data preprocessing: quality controls and normalization. Filtering: discard the genes exhibiting low expressions and/or low variance. Identification of a list of prognostic genes (called a gene signature). Building of a prognostic model, i.e. combination of the expression of the genes from the signature in order to predict the clinical outcome of the patients. Validation of the model performance and comparison with current prognostic models. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 26 / 76 Prognostic Gene Signatures Fishing Expedition Prognostic models derived from gene expression data by looking for genes associated with clinical outcome without any a priori biological assumption [van’t Veer et al., 2002; Wang et al., 2005]. GENE70 signature GENE76 signature 1.0 Overall Survival 0.8 0.6 0.4 Poor signature 0.2 0.0 2 4 60 91 57 72 54 55 0.4 0.2 P<0.001 0 0.6 6 8 10 12 31 26 22 17 12 9 0.0 Dist ant -met ast asis-f ree survival in validat ion set 1·0 1·0 Good (n=59) 0·8 0·6 Poor signature Poor (n=112) 0·4 P<0.001 0·2 0 Years Hazard ratio=5·67 (95% CI 2·59–12·4) 0 2 0 4 61 8 210 123 4 Years Years 0·8 0·6 0·4 0·2 Log-rank p0·0001 5 6 7 55 55 53 48 66 60 55 52 Patients at risk N O. AT R ISK Good signature Poor signature Good signature Probability of overall survival Good signature 0.8 Probability of metastasis-free survival Probability of Remaining Metastasis-free 1.0 N O. AT R ISK 45 41 Good signature Good signature 60 59 59 58 Poor signature Poor signature 91 8611266 van't Veer et al. van de Vijver 58 48 35 5624 103 50 33 9021 1255 1075 Wang et al. Promising results but a lot criticisms from a statistical point of view. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 27 / 76 Prognostic Gene Signatures Hypothesis-driven Prognostic models were also derived from gene expression data based on a biological assumption. I I Example: GGI [Sotiriou et al., 2006] was designed to discriminate patients with low and high histological grade (proliferation). GGI was able to discriminate patients with intermediate histological grade (HG2). Histological grade Benjamin Haibe-Kains (ULB) GGI (HG2) Visit to IGC GGI October 16, 2008 28 / 76 Prognostic Gene Signatures Independent Validation These preliminary resulting were promising but validation was required. A first validation was published by the authors of the GENE70 and GENE76 signatures in [van de Vijver et al., 2002] and [Foekens et al., 2006] respectively. Our group was involved in a second validation: I I I Complete independence: the authors of the signatures were not aware of the clinical data of the patients in the dataset. The statistical analyses were performed by an independent group. Aim: validate definitively the prognostic power of these two models in order to start a large clinical trial called MINDACT (Microarray In Node negative Disease may Avoid ChemoTherapy). Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 29 / 76 Prognostic Gene Signatures Independent Validation (cont.) Although the performance in this validation series was less impressive than in the original publications, GENE70 and GENE76 sufficiently improved the current clinical models to go ahead with MINDACT. Independent Validation of the 76-Gene Signature GENE76 0.8 0.6 0.4 52 59 28 163 0.2 Probability of event Patients Events Risk group 7 11 6 52 Gene signature low risk, clinical low risk Gene signature low risk, clinical high risk Gene signature high risk, clinical low risk Gene signature high risk, clinical high risk 0.0 ,K( "#*( H.*+"#*( &$=.M =,0&#2,( *+,( <,K.&%( 2(*+&%(+&')(*+&*(")(*+,( $23($,2;,0*.G,'NI3(H,( ."2(&KQ#2*,K()"$(0'.%M *H&$,(H.*+(&$=.*$&$N( <,(;".%*25(A2(2+"H%( ,(J$,&*,2*(;$"J%"2*.0( "I( &*( .K,%*.)N.%J( ;&M (C+,2,(&%&'N2,2(2#JM <&N(,W;'&.%(2"<,(")( =,*H,,%( *+,( "$.J.%&'( A 0 2 4 52 59 28 163 50 58 26 143 47 55 25 126 6 8 10 12 14 41 36 28 17 Year 44 Fig. curves The 511. Kaplan-Meier 51 44by genomic 36risk group.26 HRs stratified by 10 center. A, TDM. 21and log-rank 16 tests are 12 3 B,115 overall survival. 101 91 74 46 Number at risk B 1.0 "))(0+"2,%3()"$(="*+( G.G&'5( L"H,G,$3( )"$( 2(&0+.,G,K("%'N(="$M 1.0 GENE70 bability of event 0.8 ß Validation of GENE70 [Buyse et al., 2006] and GENE76 [Desmedt et al., 2007]. 0.4 0.6 "H,K(*+&*(*+,(;$,K.0M &2*(&2(J""K(&2(*+&*(")( 5,53(&('"H(;$"=&=.'.*N( C+,(2,%2.*.G.*.,2()"$( Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 30 / 76 Prognostic Gene Signatures Independent Validation (cont.) We sought to compare the GGI to the GENE70 and GENE76 signatures in this validation series . . . and showed that GGI has very similar performance [Haibe-Kains !"#$%&'()*+, !"##$%!! &' ( ) *++,&--.../012345647+89:/623-;)<;=";>)-(-'() et al., 2008b]. !"#$%&'()*+,!"##$%!!&'() GENE70 GENE76 5 0 7 9 15 25 103 32 0 5 GGI 25 15 9 7 and the GGI were higher than the HR of the 76-gene signature, the differences were not statistically significant (see Table 2). Figure 3 illustrates the Kaplan-Meier estimates of DMFS for the four groups of patients (two groups with concordant results in risk assessment and two with discordant results) for the different signatures two by two. GENE70analyses (Table 3), we can conclude From the multivariate GENE76 that the three signatures added significant information to the traditional GGIparameters and were the strongest predictive variables of DMFS, as reflected by their lowest p-valAOL ues compared to the other variables. The additional information of these signatures over the clinical risk was also confirmed by the fact that the univariate HRs for the 0.5similar 0.6when0.7 0.8for the 0.9 three signatures remained adjusted clinical risk, with a HR of 7.25 (95% CI: 2.4–21.5; p = 3.5 concordance index -4 – 10 ), 2.8 (95% CI: 1.2–6.8; p = 0.018) and 6.25 (95% CI: 2.3–17; p = 3.3 – 10-4) for the 70-gene signature, 76gene signature and the GGI respectively. Figure ple Venn according diagram 1 to illustrating the prognostic the classification signatures of the tumor samVenn diagram illustrating the classification of the tumor sample according to the prognostic signawe combined the three gene signatures in order tures. Dark red = high-risk patients and blue = low-risk Benjamin Haibe-Kains (ULB) VisitLastly, to IGC October 16, to 2008 1 Forest the anceAdjuvant! Figure indices, plots 2 (and and Online B/the 95%classification CI) log2 forhazard the 31 / 76 Part IV Subtypes and Prognosis Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 32 / 76 Prognosis in Specific Subtypes The first publications attempted to build a prognostic model from the global population of BC patients. In 2005, Wang et al. were the first to divide the global population based on ER status: I I I I As BC biology is very different according to the ER status, prognostic models might be different too. They built a prognostic model for each subgroup of patients (ER+ and ER-). To make a prediction, they used one of the two models depending on the ER-status of the tumor. Unfortunately the group of ER- tumors was too small and their corresponding model was not generalizable. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 33 / 76 Prognosis in Specific Subtypes (cont.) Recently, Teschendorff et al. built a new prognostic model for ERtumors [Teschendorff et al., 2007] and validated it [Teschendorff and Caldas, 2008] using large datasets. I The signature is composed of 7 immune-related genes. We showed in two meta-analyses [Wirapati et al., 2008; Desmedt et al., 2008] that: I Proliferation (AURKA) was the most prognostic factor in ER+/HER2tumors and the common driving force of the early gene signatures. F I I Actually, these early signatures (e.g. GENE70, GENE76, GGI) are prognostic in ER+/HER2- tumors only. Immune response (STAT1) is prognostic in ER-/HER2- and HER2+ tumors. Tumor invasion (PLAU or uPA) is prognostic in HER2+ tumors. Finak et al. introduced a stroma-derived prognostic predictor (SDPP) particularly efficient in HER2+ tumors [Finak et al., 2008]. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 34 / 76 New Prognostic Model Since current prognostic models/gene signatures are limited to some subtypes, we sought to develop a new prognostic model integrating the breast cancer subtypes identification in order to: I I Build a prognostic gene signatures specifically targeting each subtype. Build a global prognostic model able to predict the risk of the patients whatever the tumor subtype (ER-/HER2-, HER2+ or ER+/HER2-). We assessed the performance and compared it to current prognostic models using the thorough statistical framework developed in [Haibe-Kains et al., 2008a]. This new prognostic model is called GENIUS, standing for Gene Expression progNostic Index Using Subtypes Benjamin Haibe-Kains (ULB) Visit to IGC , October 16, 2008 35 / 76 GENIUS A B Training set Validation set Subtypes identification Pr1 Pr2 HER2+ Pr3 (Pr1, Pr2, Pr3) Pr1 ER-/HER2- ER+/HER2- Pr2 Identification of prognostic genes Identification of prognostic genes GGI subtype signature subtype signature subtype signature Model building Model building Model building subtype risk score subtype risk score subtype risk score Combination GENIUS GENIUS risk score cutoff cutoff risk group risk group risk score Performance assessment and comparison Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 36 / 76 GENIUS Performance Assessment and Comparison We trained GENIUS on VDX: I 286 node-negative untreated BC patients. We assessed the performance in an independent dataset composed of I I 765 node-negative untreated patients coming from 5 different datasets (NKI, TBG, UPP, UNT and MAINZ). Risk score prediction: continuous value. Risk group prediction: binary value (application of a cutoff on the risk score). Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 37 / 76 GENIUS Risk Score Prediction State-of-the-art gene signatures A Test for GENIUS superiority ALL: clinical prognostic indices B GENIUS GGI STAT1 PLAU uPA IRMODULE SDPP ALL: 0.025 4E-12 2E-15 1E-11 1E-4 0.027 Test forTest GENIUS for GENIUS superiority superiority A GENIUS GGI GENIUS AOL STAT1 PLAUNPI uPA ER+/HER2!: GENIUS IRMODULE SDPPAOL ALL: NPI ER+/HER2!: GENIUS GGI STAT1 PLAU uPA IRMODULE SDPP ER!/HER2!: GENIUS GGI STAT1 PLAU uPA IRMODULE SDPP HER2+: 0.3 0.012 0.047 0.0021 0.01 0.44 0.014 ER!/HER2!: GENIUS GGI STAT1 0.2 PLAU 0.3 0.4 uPA IRMODULE SDPP AOL NPI HER2+: GENIUS GGI STAT1 PLAU uPA IRMODULE SDPP 0.2 0.73 1E-5 1E-9 5E-8 0.058 0.2 ER+/HER2!: GENIUS GGI GENIUS ER!/HER2!: STAT1 AOL PLAUNPI uPA IRMODULE HER2+: GENIUS SDPP 0.04 0.068 0.082 0.042 0.51 0.36 0.4 0.5 0.6 0.7 0.0015 2E-10 0.022 5E-17 0.017 3E-10 2E-4 0.0086 0.021 0.19 ALL 0.5 2E-5 0.31 1E-6 0.019 0.0012 0.13 0.12 0.27 ER ER HE 0.069 0.0072 0.13 0.7 9E-40.8 0.0076 concordance index 0.58 0.028 0.5 0.6 GENIUS GGI STAT1 PLAU uPA IRMODULE SDPP 0.0052 0.074 0.015 0.0084 0.11 0.063 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.8 concordance index concordance index Benjamin Haibe-Kains (ULB) B Visit to IGC October 16, 2008 38 / 76 GENIUS Risk Group Prediction GGI GENIUS Global population No. At Risk Low 488 High 253 474 241 2 459 208 3 445 180 4 5 6 Time (years) 427 158 409 140 Benjamin Haibe-Kains (ULB) 364 131 7 325 116 0.2 Low High 0.0 HR=3.6, 95%CI [2.6,4.8], p−value=9.6E−17 1 0.6 0.8 0.0 Low High 0 0.4 Probability of survival 0.6 0.4 0.2 Probability of survival 0.8 1.0 1.0 Global population 8 286 100 9 257 92 10 223 81 HR=2.5, 95%CI [1.8,3.3], p−value=1.7E−09 0 No. At Risk Low 484 High 257 Visit to IGC 1 473 242 2 458 209 3 439 186 4 5 6 Time (years) 414 171 394 155 348 146 7 306 135 8 270 116 9 245 104 October 16, 2008 10 215 89 39 / 76 GENIUS Risk Group Prediction (cont.) GENIUS GGI ER+/HER2- 0.6 0.2 0.4 Probability of survival 0.6 0.4 0.2 Probability of survival 0.8 0.8 1.0 1.0 ER+/HER2- Low High 0.0 0.0 Low High HR=3.4, 95%CI [2.3,5], p−value=1.8E−09 0 No. At Risk Low 391 High 125 1 384 124 2 374 113 3 364 99 4 5 6 Time (years) 349 87 335 76 Benjamin Haibe-Kains (ULB) 294 71 7 258 62 8 226 57 9 204 54 10 176 47 HR=3.4, 95%CI [2.3,5], p−value=1.3E−09 0 No. At Risk Low 391 High 125 Visit to IGC 1 384 124 2 373 114 3 364 99 4 5 6 Time (years) 349 87 335 76 294 71 7 258 62 8 226 57 9 204 54 10 175 48 October 16, 2008 40 / 76 GENIUS Risk Group Prediction (cont.) GGI GENIUS ER-/HER2- No. At Risk Low 47 High 72 45 67 2 42 50 3 40 44 4 5 6 Time (years) 39 40 37 35 Benjamin Haibe-Kains (ULB) 37 32 7 36 29 0.6 0.2 HR=3.1, 95%CI [1.5,6.6], p−value=2.8E−03 1 Low High 0.0 0.0 Low High 0 0.4 Probability of survival 0.6 0.4 0.2 Probability of survival 0.8 0.8 1.0 1.0 ER-/HER2- 8 31 23 9 27 20 10 21 16 HR=0.8, 95%CI [0.4,1.6], p−value=5.4E−01 0 No. At Risk Low 43 High 76 Visit to IGC 1 40 72 2 32 60 3 31 53 4 5 6 Time (years) 28 51 25 47 25 44 7 24 41 8 19 35 9 18 29 October 16, 2008 10 13 24 41 / 76 GENIUS Risk Group Prediction (cont.) GGI GENIUS HER2+ 0.8 1 47 52 2 45 47 3 43 39 4 5 6 Time (years) 41 33 39 31 Benjamin Haibe-Kains (ULB) 35 30 7 32 27 0.6 0.2 HR=4.2, 95%CI [1.9,9.4], p−value=5.7E−04 0 Low High 0.0 0.0 Low High No. At Risk Low 50 High 56 0.4 Probability of survival 0.6 0.4 0.2 Probability of survival 0.8 1.0 1.0 HER2+ 8 31 22 9 28 20 10 26 18 HR=1.4, 95%CI [0.71,2.7], p−value=3.5E−01 0 No. At Risk Low 49 High 57 Visit to IGC 1 47 52 2 45 47 3 39 43 4 5 6 Time (years) 37 37 35 35 32 33 7 29 30 8 26 27 9 24 24 October 16, 2008 10 22 22 42 / 76 Part V Conclusion Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 43 / 76 Conclusion Numerous studies confirmed the great potential of gene expression profiling using microarrays to better understand cancer biology and to improve current prediction models. This technology becomes more and more mature (MAQC [shi, 2006]) and is now ready for clinical applications. The promising results of early publications were validated in different independent studies. Recent meta-analyses successfully recapitulated the main discoveries made these late decades and refined our knowledge on breast cancer biology. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 44 / 76 Conclusion (cont.) We benefit from this strong basis to go a step further to improve breast cancer prognosis using microarrays. I Prognostic models/gene signatures in specific subtypes [Teschendorff et al., 2007; Desmedt et al., 2008; Finak et al., 2008]. I Development of GENIUS, a prognostic model integrating BC molecular subtypes identification [manuscript in preparation]. A major issue remains: ”How to combine these microarray prognostic models with clinical variables?” I I Several studies showed the additional information of tumor size, nodal status, . . . However, we currently lack of data to fit robust prognostic models combining microarray and clinical variables. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 45 / 76 Thank you for your attention. This presentation is available from http://www.ulb.ac.be/di/map/ bhaibeka/papers/haibekains2008molecular.pdf. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 46 / 76 Part VI Bibliography Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 47 / 76 Bibliography I The microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotech, 24(9):1151–1161, 2006. URL http://dx.doi.org/10.1038/nbt1239. Marc Buyse, Sherene Loi, Laura van’t Veer, Giuseppe Viale, Mauro Delorenzi, Annuska M. Glas, Mahasti Saghatchian d’Assignies, Jonas Bergh, Rosette Lidereau, Paul Ellis, Adrian Harris, Jan Bogaerts, Patrick Therasse, Arno Floore, Mohamed Amakrane, Fanny Piette, Emiel Rutgers, Christos Sotiriou, Fatima Cardoso, and Martine J. Piccart. Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women With Node-Negative Breast Cancer. J. Natl. Cancer Inst., 98(17): 1183–1192, 2006. doi: 10.1093/jnci/djj329. URL http://jnci.oxfordjournals.org/cgi/content/abstract/jnci;98/17/1183. C. A. Davis, F. Gerick, V. Hintermair, C. C. Friedel, K. Fundel, R. Kuffner, and R. Zimmer. Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics, 22(19):2356–2363, 2006. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 48 / 76 Bibliography II Christine Desmedt, Fanny Piette, Sherene Loi, Yixin Wang, Francoise Lallemand, Benjamin Haibe-Kains, Giuseppe Viale, Mauro Delorenzi, Yi Zhang, Mahasti Saghatchian d’Assignies, Jonas Bergh, Rosette Lidereau, Paul Ellis, Adrian L. Harris, Jan G.M. Klijn, John A. Foekens, Fatima Cardoso, Martine J. Piccart, Marc Buyse, Christos Sotiriou, and on behalf of the TRANSBIG Consortium. Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series. Clin Cancer Res, 13(11):3207–3214, 2007. doi: 10.1158/1078-0432.CCR-06-2765. URL http://clincancerres.aacrjournals.org/cgi/content/abstract/13/11/3207. Christine Desmedt, Benjamin Haibe-Kains, Pratyaksha Wirapati, Marc Buyse, Denis Larsimont, Gianluca Bontempi, Mauro Delorenzi, Martine Piccart, and Christos Sotiriou. Biological Processes Associated with Breast Cancer Clinical Outcome Depend on the Molecular Subtypes. Clin Cancer Res, 14(16):5158–5165, 2008. doi: 10.1158/1078-0432.CCR-07-4756. URL http://clincancerres.aacrjournals.org/cgi/content/abstract/14/16/5158. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 49 / 76 Bibliography III Greg Finak, Nicholas Bertos, Francois Pepin, Svetlana Sadekova, Margarita Souleimanova, Hong Zhao, Haiying Chen, Gulbeyaz Omeroglu, Sarkis Meterissian, Atilla Omeroglu, Michael Hallett, and Morag Park. Stromal gene expression predicts clinical outcome in breast cancer. Nat Med, 14(5):518–527, 2008. URL http://dx.doi.org/10.1038/nm1764. J. A. Foekens, D. Atkins, Y. Zhang, F. C. Sweep, N. Harbeck, A. Paradiso, T. Cufer, A. M. Sieuwerts, D. Talantov, P. N. Span, V. C. Tjan-Heijnen, A. F. Zito, K. Specht, H. Hioefler, R. Golouh, F. Schittulli, M. Schmitt, L. V. Beex, J. G. Klijn, and Y. Wang. Multicenter validation of a gene expression–based prognostic signature in lymph node–negative primary breast cancer. Journal of Clinical Oncology, 24(11), 2006. C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of American Statistical Asscoiation, 97(458):611–631, June 2002. B. Haibe-Kains, C. Desmedt, C. Sotiriou, and G. Bontempi. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics, 24(19):2200–2208, 2008a. doi: 10.1093/bioinformatics/btn374. URL http: //bioinformatics.oxfordjournals.org/cgi/content/abstract/24/19/2200. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 50 / 76 Bibliography IV Benjamin Haibe-Kains, Christine Desmedt, Fanny Piette, Marc Buyse, Fatima Cardoso, Laura van’t Veer, Martine Piccart, Gianluca Bontempi, and Christos Sotiriou. Comparison of prognostic gene expression signatures for breast cancer. BMC Genomics, 9(1):394, 2008b. ISSN 1471-2164. doi: 10.1186/1471-2164-9-394. URL http://www.biomedcentral.com/1471-2164/9/394. Amy Kapp, Stefanie Jeffrey, Anita Langerod, Anne-Lise Borresen-Dale, Wonshik Han, Dong-Young Noh, Ida Bukholm, Monica Nicolau, Patrick Brown, and Robert Tibshirani. Discovery and validation of breast cancer subtypes. BMC Genomics, 7(1): 231, 2006. ISSN 1471-2164. doi: 10.1186/1471-2164-7-231. URL http://www.biomedcentral.com/1471-2164/7/231. MJ Pencina and RB D’Agostino. Overall c as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med, 23(13):2109–2123, 2004. doi: 10.1002/sim.1802. Charles M. Perou, Therese Sorlie, Michael B. Eisen, Matt van de Rijn, Stefanie S. Jeffrey, Christian A. Rees, Jonathan R. Pollack, Douglas T. Ross, Hilde Johnsen, Lars A. Akslen, Oystein Fluge, Alexander Pergamenschikov, Cheryl Williams, Shirley X. Zhu, Per E. Lonning, Anne-Lise Borresen-Dale, Patrick O. Brown, and David Botstein. Molecular portraits of human breast tumours. Nature, 406(6797): 747–752, 2000. URL http://dx.doi.org/10.1038/35021093. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 51 / 76 Bibliography V Lajos Pusztai, Chafika Mazouni, Keith Anderson, Yun Wu, and W. Fraser Symmans. Molecular Classification of Breast Cancer: Limitations and Potential. Oncologist, 11 (8):868–877, 2006. doi: 10.1634/theoncologist.11-8-868. URL http://theoncologist.alphamedpress.org/cgi/content/abstract/11/8/868. T. Sorlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisher, H. Johnsen, T. Hastie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese., P. O. Brown, D. Botstein, P. L. Eystein, and A. L. Borresen-Dale. Gene expression patterns breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Matl. Acad. Sci. USA, 98(19):10869–10874, 2001. T. Sorlie, R. Tibshirani, J. Parker, T. Hastie, J. S. Marron, A. Nobel, S. Deng, H. Johnsen, R. Pesich, S. Geister, J. Demeter, C. Perou, P. E. Lonning, P. O. Brown, A. L. Borresen-Dale, and D. Botstein. Repeated observation of breast tumor subtypes in indepedent gene expression data sets. Proc Natl Acad Sci USA, 1(14):8418–8423, 2003. C. Sotiriou, S. Y. Neo, L. M. McShane, E. L. Korn, P. M. Long, A. Jazaeri, P. Martiat, S.B. Fox, A. L. Harris, and E. T. Liu. Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc. Natl. Acad. Sci., 100(18):10393–10398, 2003. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 52 / 76 Bibliography VI Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J. Van de Vijver, Jonas Bergh, Martine Piccart, and Mauro Delorenzi. Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. J. Natl. Cancer Inst., 98(4): 262–272, 2006. doi: 10.1093/jnci/djj052. URL http://jnci.oxfordjournals.org/cgi/content/abstract/jnci;98/4/262. Andrew Teschendorff and Carlos Caldas. A robust classifier of high predictive value to identify good prognosis patients in er-negative breast cancer. Breast Cancer Research, 10(4):R73, 2008. ISSN 1465-5411. doi: 10.1186/bcr2138. URL http://breast-cancer-research.com/content/10/4/R73. Andrew Teschendorff, Ahmad Miremadi, Sarah Pinder, Ian Ellis, and Carlos Caldas. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biology, 8(8):R157, 2007. ISSN 1465-6906. doi: 10.1186/gb-2007-8-8-r157. URL http://genomebiology.com/2007/8/8/R157. R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511–528, 2005. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 53 / 76 Bibliography VII M. J. van de Vijver, Y. D. He, L. van’t Veer, H. Dai, A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25): 1999–2009, 2002. L. J. van’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhiven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530–536, 2002. Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov, M. Timmermans, M. E. Meijer van Gelder, J. Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A. Forekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365:671–679, 2005. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 54 / 76 Bibliography VIII Pratyaksha Wirapati, Christos Sotiriou, Susanne Kunkel, Pierre Farmer, Sylvain Pradervand, Benjamin Haibe-Kains, Christine Desmedt, Michail Ignatiadis, Thierry Sengstag, Frederic Schutz, Darlene Goldstein, Martine Piccart, and Mauro Delorenzi. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research, 10(4):R65, 2008. ISSN 1465-5411. doi: 10.1186/bcr2124. URL http://breast-cancer-research.com/content/10/4/R65. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 55 / 76 Part VII Appendix Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 56 / 76 Gene Expression Profiling Technologies There exist several technologies to measure the expression of genes. Low throughput technologies such as RT-PCR, allow for measuring the expression of a few genes. High throughput technologies, such as microarrays, allows for measuring simultaneously the expression of thousands of genes (whole genome). Microarray principles will be illustrated through the Affymetrix technology. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 57 / 76 Microarray A microarray is composed of I I I I DNA fragments (probes) fixed on a solid support. Ordered position of probes. Principle of hybridization to a specific probe of complementary sequence. Molecular labeling. ß Simultaneous detection of thousands of sequences in parallel. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 58 / 76 Affymetrix GeneChip Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 59 / 76 Affymetrix GeneChip Probes A Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 60 / 76 Hybridization Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 61 / 76 Detection Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 62 / 76 Affymetrix Design Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 63 / 76 Affymetrix Equipment Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 64 / 76 Prognostic Gene Signatures A Single Gene? From the validation studies, we learned that GGI yields similar (sometimes better) performance than other gene signatures [Haibe-Kains et al., 2008b]. Since GGI is a very simple model from a statistical and a biological (proliferation genes) points of view, we challenged the use of complex statistical methods for BC prognostication. We compared simple to complex statistical methods to a single proliferation gene (AURKA) [Haibe-Kains et al., 2008a]. ß Due to the complexity of microarray data, it is very hard to build prognostic models statistically better than AURKA. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 65 / 76 Prognostic Gene Signatures A Single Gene? (cont.) A single gene for breast cancer prognostication? 2 Forestplot of the concordance index for each method in the training set and the three validation sets: CONCORDANCE INDEX FOR RISK SCORE PREDICTION AURKA BD.COMBUNIV.WILCOXON.HG AURKA AURKA AURKA AURKA AURKA BD.COMBUNIV.COX.SURV BD.COMBUNIV.WILCOXON.HG BD.COMBUNIV.WILCOXON.HG BD.COMBUNIV.WILCOXON.HG BD.COMBUNIV.WILCOXON.HG BD.COMBUNIV.WILCOXON.HG BD.MULTIV.LM.TOE BD.COMBUNIV.COX.SURV BD.COMBUNIV.COX.SURV BD.COMBUNIV.COX.SURV BD.COMBUNIV.COX.SURV BD.COMBUNIV.COX.SURV BD.MULTIV.COX.SURV BD.MULTIV.LM.TOE BD.MULTIV.LM.TOE BD.MULTIV.LM.TOE BD.MULTIV.LM.TOE BD.MULTIV.LM.TOE GW.RANK.COMBUNIV.WILCOXON.HG BD.MULTIV.COX.SURV BD.MULTIV.COX.SURV BD.MULTIV.COX.SURV BD.MULTIV.COX.SURV BD.MULTIV.COX.SURV GW.RANKCV.COMBUNIV.WILCOXON.HG GW.RANK.COMBUNIV.WILCOXON.HG GW.RANK.COMBUNIV.WILCOXON.HG GW.RANK.COMBUNIV.WILCOXON.HG GW.RANK.COMBUNIV.WILCOXON.HG GW.RANK.COMBUNIV.WILCOXON.HG GW.RANK.COMBUNIV.COX.SURV GW.RANKCV.COMBUNIV.WILCOXON.HG GW.RANKCV.COMBUNIV.WILCOXON.HG GW.RANKCV.COMBUNIV.WILCOXON.HG GW.RANKCV.COMBUNIV.WILCOXON.HG GW.RANKCV.COMBUNIV.WILCOXON.HG GW.RANKCV.COMBUNIV.COX.SURV GW.RANK.COMBUNIV.COX.SURVGW.RANK.COMBUNIV.COX.SURV GW.RANK.COMBUNIV.COX.SURV GW.RANK.COMBUNIV.COX.SURV GW.RANK.COMBUNIV.COX.SURV GW.RANK.MULTIV.RCOX.SURV GW.RANKCV.COMBUNIV.COX.SURV GW.RANKCV.COMBUNIV.COX.SURV GW.RANKCV.COMBUNIV.COX.SURV GW.RANKCV.COMBUNIV.COX.SURV GW.RANKCV.COMBUNIV.COX.SURV GW.RANKCV.MULTIV.RCOX.SURV GW.RANK.MULTIV.RCOX.SURV GW.RANK.MULTIV.RCOX.SURV GW.RANK.MULTIV.RCOX.SURV GW.RANK.MULTIV.RCOX.SURV GW.RANK.MULTIV.RCOX.SURV GW.PCA.COMBUNIV.WILCOXON.HG GW.RANKCV.MULTIV.RCOX.SURV GW.RANKCV.MULTIV.RCOX.SURV GW.RANKCV.MULTIV.RCOX.SURV GW.RANKCV.MULTIV.RCOX.SURVGW.RANKCV.MULTIV.RCOX.SURV GW.PCACV.COMBUNIV.WILCOXON.HG GW.PCA.COMBUNIV.WILCOXON.HG GW.PCA.COMBUNIV.WILCOXON.HG GW.PCA.COMBUNIV.WILCOXON.HG GW.PCA.COMBUNIV.WILCOXON.HG GW.PCA.COMBUNIV.WILCOXON.HG GW.PCA.COMBUNIV.COX.SURV GW.PCACV.COMBUNIV.WILCOXON.HG GW.PCACV.COMBUNIV.WILCOXON.HG GW.PCACV.COMBUNIV.WILCOXON.HG GW.PCACV.COMBUNIV.WILCOXON.HG GW.PCACV.COMBUNIV.WILCOXON.HG GW.PCACV.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV GW.PCA.MULTIV.RCOX.SURV GW.PCACV.COMBUNIV.COX.SURV GW.PCACV.COMBUNIV.COX.SURVGW.PCACV.COMBUNIV.COX.SURV GW.PCACV.COMBUNIV.COX.SURV GW.PCACV.COMBUNIV.COX.SURV GW.PCACV.MULTIV.RCOX.SURV GW.PCA.MULTIV.RCOX.SURV GW.PCA.MULTIV.RCOX.SURV GW.PCA.MULTIV.RCOX.SURV GW.PCA.MULTIV.RCOX.SURV GW.PCA.MULTIV.RCOX.SURV GENE76 GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV GGI GENE76 GENE76 GENE76 GENE76 GENE76 AOL GGI GGI GGI GGI GGI NPI AOL AOL NPI NPI 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.50.550.60.650.70.750.80.850.90.95 concordance index 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.50.550.60.650.70.750.80.850.90.95 Training VDX set Validation TBVDXset 1 concordance index Validation TAM set 2 0.5 0.55 0.6 0.65 0.7 0.75 0.8 concordance index Validation set 3 UPP Supplementary Figure 1: Forest plot of the concordance indices for the risk scores predicted by all the methods in the training set (VDX) and in the three validation sets (TBG, TAM, and UPP). AURKA and GGI models were not fitted on VDX which can be considered as a validation set for these models. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 66 / 76 GENIUS Identification of Subtypes The first step of GENIUS method is the identification of subtypes in the dataset. In BC, we applied the clustering model developed previously (training set: VDX). The model returns the probabilities Pr(s) for a patient to belong to each subtype s ∈ S. I S is composed of the ER-/HER2-, HER2+ and ER+/HER2- subtypes. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 67 / 76 GENIUS Identification of Prognostic Genes We used a ranking-based gene selection method. The score (relevance) given to each gene is based on the significance of the concordance index. We introduced a weighted version of the concordance index in order to select genes relevant for a specific subtype; The weights were defined as the probability for a patient to belong to the subtype of interest. ß This feature selection allowed for using all the patients in the dataset. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 68 / 76 GENIUS Weighted Concordance Index Survival data for the ith patient: I I ti stands for the event time ci for the censoring time C -index computes the probability that, for a pair of randomly chosen comparable patients, the patient with the higher risk prediction will experience an event before the lower risk patient. P i,j∈Ω 1{ri > rj } C -index = |Ω| I I where ri and rj are the risk predictions of the patient i and j Ω is the set of all the pairs of patients {i, j} such that: F F ri 6= rj (no ties in r ) meet one of the following conditions: (i) both patients i and j experienced an event and time ti < tj or (ii) only patient i experienced an event and ti < cj . Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 69 / 76 GENIUS Weighted Concordance Index (cont.) We introduced a weighted version of the concordance P i,j∈Ω wij 1{ri > rj } P C -indexwted = i,j∈Ω wij I where wij = wi wj is the weight for the pair of patients {i, j} ∈ Ω. Significance of the C -index was computed by assuming asymptotic normality [Pencina and D’Agostino, 2004]. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 70 / 76 GENIUS Signature Stability Once the genes were ranked, the only hyperparameter to tune was the signature size k (number of selected genes in the signature). We assessed the stability with respect to the signature size by resampling the training set. The stability criterion was inspired from [Davis et al., 2006]: I I I Let X be the set of features and freq(xj ) be the number of sampling steps in which a feature xj ∈ X has been selected out of m sampling steps. The set X is sorted by frequency into the set x(1) , x(2) , . . . , x(n) where freq(x(i) ) ≥ freq(x(j) ) if i < j where i, j ∈ {1, 2, . . . , n}. A first measure of stability for a given signature size k is returned by Pk Stab(k) = Benjamin Haibe-Kains (ULB) Visit to IGC i=1 freq(x(i) ) km October 16, 2008 71 / 76 GENIUS Signature Stability (cont.) Since the Stab statistic can be made artificially high by simply increasing k, we formulated an adjusted statistic k Stabadj (k) = max 0, Stab(k) − α n I where α is a penalty factor depending on the number of selected features (usually α = 1). Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 72 / 76 GENIUS Signature Stability (cont.) Signature stability A Signature stability HER2+ HER2+ B 1.0 0.8 0.6 Stability 0.4 0.2 0.0 0.0 0.2 0.4 Stability 0.6 0.8 1.0 ER!/HER2! ER-/HER2- 0 200 400 600 800 1000 0 signature size 200 400 600 800 1000 signature size In the training set (VDX), the most stable signatures were composed of 63 and 22 genes for the ER-/HER2- and HER2+ subtypes. Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 73 / 76 GENIUS Risk Score Prediction The risk score predictions for the subtype s is defined as P wi xi R(s) = i∈Q nQ I I I I where Q is the set of of genes in the signature for subtype s xi is the expression of gene i wi ∈ {−1, +1} depending on the concordance index (> 0.5 or ≤ 0.5) nQ is the signature size. The global risk score is defined as X R= Pr(s)R(s) s∈S Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 74 / 76 Tools Bioinformatics softwares I I I I R is a widely used open source language and environment for statistical computing and graphics Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data Java Treeview is an open source software for clustering visualization BRB Array Tools is a software suite for microarray analysis working as an Excel macro Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 75 / 76 Links Personal webpage: http://www.ulb.ac.be/di/map/bhaibeka/ Machine Learning Group: http://www.ulb.ac.be/di/mlg Functional Genomics Unit: http://www.bordet.be/en/services/medical/array/practical.htm Master in Bioinformatics at ULB and other belgian universities: http://www.bioinfomaster.ulb.ac.be/ Benjamin Haibe-Kains (ULB) Visit to IGC October 16, 2008 76 / 76