A Predictive Based Regression Algorithm for Gene Network Selection Stéphane Guerrier1 , Nabil Mili2 & Samuel Orso2 1 Department 2 of Statistics, University of Illinois at Urbana-Champaign, USA Research Center for Statistics, University of Geneva, Switzerland joint work with Marco Avella Medina (U. Geneva), Yanyuan Ma (USC), Roberto Molinari (U. Geneva) June 6, 2016 S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 1 / 32 Introduction Motivation Introduction Gene Selection Problems: Selection of relevant genes is a common task in most gene expression studies. Researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (Dı́az-Uriarte & Alvarez de Andrés, 2006 ). How statisticians (typically) understand this definition: We are looking for a single model. For a given candidate model, picking the most likely parameters given the data is optimal. Predictive performance can be measured by the likelihood function (typically out-of-sample). The order in which the variables enter the model is unimportant (implying: Model AB is equivalent to Model BA). S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 2 / 32 Introduction Motivation Equivalence of outcomes according likelihood function S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 3 / 32 Introduction Potential drawbacks Introduction Is this a good idea? According to our understanding of the problem (i.e. single model based on likelihood methods): YES! However: Focusing on a single model suggests a level of confidence in our final result that is not justified by the data as other models generally exist with similar good fit (Whittingham et al., 2006 ). Maximizing the likelihood function does not guarantee finding the best model(s) (and parameters) according to a given out-of-sample (medically chosen) objective function (e.g. classification error, quality of life, mortality, ... ). The unimportance of the order of variable can causes interpretation issues. These methods are prone to overfitting (due to the asymmetric effects of “under” vs “over” fitting). S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 4 / 32 Introduction Random Medical News This can lead to... S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 5 / 32 Introduction Alternative framework Introduction An alternative approach The “Panning” algorithm aims to addressing (some of) these issues. Its goals are the following: Finding all models (and parameters) minimizing an out-of-sample (medically chosen) objective function (e.g. classification error, quality of life, mortality, ...). Restricting our attention to the models with the smallest dimension. Investigating in which order the variables should enter the model (implying: Model AB is NOT equivalent to Model BA). S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 6 / 32 Introduction Stylized example Introduction A Stylized Example: Consider a typical gene selection problem (e.g. n < 102 and p > 104 ). Suppose we employ a commonly used technique such as the LASSO and a model containing the following variables: A B E X Y C Z Suppose we could compute the prediction error of all models. Then, one might find that there exist a model with two variables having the smallest prediction error. Moreover, several models have a non-discernible prediction error. For example: A S. Guerrier, N. Mili & S. Orso B C A E F D Panning Algorithm for Gene Selection A June 6, 2016 7 / 32 Introduction Stylized example Introduction A Stylized Example (cont): Next, we could investigate the marginal performance of these models and obtain: A B C A E F D A A → B A → C E → F A → D This “group” of models shares variable A and be can be understood as a network. Given that A is in the model B , C and D can be though of as “synonyms”. This model represents another “network”. This network may have a very different biological meaning than the previous one, but it is equivalent in terms of prediction error. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 8 / 32 Introduction Stylized example Introduction A Stylized Example (cont): Would it be possible to find a better representation than: A → B A → C E → F A → D ? A possible solution (paradigmatic network?) is: B E A C F D S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 9 / 32 Methodology Introduction Model selection paradigm Source: T. Hastie, R. Tibshirani and J. Friedman The Elements of Statistical Learning, Springer, 2009. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 10 / 32 Methodology Introduction Model selection paradigm con’t Out-of-sample error 4 3 Noisy models 2 1 Good models 0 Frontier Best models S. Guerrier, N. Mili & S. Orso 2 4 6 8 Model dimension Frontier (lower bound) Best models dimension Panning Algorithm for Gene Selection 10 12 Indistinguishable models Best models set June 6, 2016 11 / 32 Methodology Algorithm Panning algorithm Motivations Models of low-dimension generalize better. There are many models with good performances, that are indistinguishable. It is impossible to explore every single models. For example, with 60 variables there is 260 ≈ 1.15 × 1017 models to explore. If one model takes 10−10 seconds to be computed, it would take about 4 years to explore all. The panning algorithm aims at: finding the best set of low-dimension models in a ”reasonable” time S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 12 / 32 Methodology Algorithm Panning heuristic procedure Initialisation step 1 Evaluate the performance of all models for a given low dimension. 2 Compute the frontier. 3 Separate variables between “best” (below the frontier) and “noisy” (above the frontier). General step 1 Increase the dimension of one increment. 2 Generate some models by mixing “best” and “noisy” variables. 3 Evaluate their performances. 4 Compute the new frontier. 5 Redefine “best” and “noisy” variables. 6 Repeat points 2.1 to 2.5 until reach a given dimension. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 13 / 32 Methodology Algorithm Panning heuristic procedure con’t Best set of models 1 Find the lowest dimension on the frontier. 2 Determine the set of best models. 3 Eliminate duplicates. Why an heuristic? There is no guarantee whether the area of best set of models is reached, and how well. But in practice, preliminary results are promising. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 14 / 32 Methodology Implementation Panning algorithm con’t The user can choose: The dimensions of interest (where to start and stop). The number of models to explore. The measure of performance. And even more... We are reproducible! We are curently developing a R package. A beta version is free of access at https://github.com/SMAC-Group/panning S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 15 / 32 Methodology Implementation Acute Leukemia - Estimated Number of Parameters tenfold CV error 4 6 8 10 Quantile q̂j for α = .01 2 ENet Golub 0 NSC PLR SVM 0 10 20 30 subset dimension 40 50 Figure: Estimated number of biomarkers necessary to distinguish ALL from AML S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 16 / 32 Methodology Implementation Leukemia - ALL versus AML - Classification Error Method Golub Support vector machine (with recursive feature elimination) Penalised logistic regression (with recursive feature elimination) Nearest shrunken centroids Elastic net Panning Algorithm (107) Model a Model b Model c [. . .] Model averaging Tenfold CV error Test error Number of genes 3/38 2/38 4/34 1/34 50 31 2/38 1/34 26 2/38 3/38 2/34 0/34 21 45 0/38 0/38 0/38 2/34 2/34 2/34 2 2 2 2/34 2 Table: Summary of Leukemia classification results. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 17 / 32 Case Studies Leukemia Case Studies - Acute Leukemia and Breast Cancer Research questions 1 Acute Leukemia Distinction between ALL and AML in a pediatric population (Golub et al., 1999 ) 2 Breast Cancer Distinction between Estrogen Receptor ⊕ and Estrogen Receptor Breast Cancers (Chin et al., 2006 ) What it is -What it isn’t Examples from the clinical literature (transcriptome level) in order to illustrate the usefulness of our method. A discussion about the etiology of Acute Leukemias and Breast Cancers . S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 18 / 32 Case Studies Leukemia Acute Leukemia - Estimated Number of Parameters tenfold CV error 4 6 8 10 Quantile q̂j for α = .01 2 ENet Golub 0 NSC PLR SVM 0 10 20 30 subset dimension 40 50 Figure: Estimated number of biomarkers necessary to distinguish ALL from AML S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 19 / 32 Case Studies Leukemia Leukemia - ALL versus AML - Biomarker Network L07633 at X66899 at HG2815-HT2931 at U90549 at U51166 at M94345 at L33075 at M20778 s at U57316 at U49248 at M27891 at J03589 at Z32765 at Z69881 at D83920 at D80006 at X03934 at X89109 s at M84526 at M74088 s at M28130 rna1 s at U84388 at L07758 at M33680 at Y00291 at U94855 at X04526 at U93867 at M92287 at X95735 at HG1612-HT1612 at U29175 at S80437 s at U32645 at D17532 at M60483 rna1 s at D78577 s at D63506 at M83233 at HG3521-HT3715 at Figure: Biomarker Network- Leukemia data set S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 20 / 32 Case Studies Leukemia Leukemia - ALL versus AML - Biological Interpretation Three hubs were identified 1 Cystatin C: a secreted cysteine protease inhibitor. Cystatin C is implicated in cell apoptosis induction by decreasing B-cell leukemia-2 (BCL-2) activity. BCL-2 regulation is known to be implicated in resistant AML (Sakamoto et al., 2015 ) 2 Zyxin: a zinc-binding phosphoprotein that concentrates at focal adhesions and along the actin cytoskeleton. Zyxin interacts with Vasodilator-Stimulated Phosphoprotein (VASP). VASP is a substrate of the BcrAbl oncoprotein which drives oncogenesis in patients with chronic myeloid leukemia (Bernusso et al., 2015 ). 3 Complement factor D: a rate-limiting enzyme in the alternative pathway of complement activation. Ratajczak (2014) has stressed the role of the complement cascade as a trigger for the migration of hematopoietic stem cells from bone marrow into blood. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 21 / 32 Case Studies Leukemia Leukemia - ALL versus AML - Network Organisation Affy ID Gene ID (ENSG) Gene Function Biological Process NETWORK 1 Position 1 M27891 at 00000101439 Cystatin C AA Position 2 D80006 at M20778 s at U57316 at U90549 at X66899 at M74088 s at U51166 at Z69881 at U49248 at X89109 s at HG2815-HT2931 at M94345 at L33075 at L07633 at J03589 at D83920 at X03934 at 00000114978 00000163359 00000108773 00000182952 00000182944 00000134982 00000139372 00000074370 00000023839 00000102879 00000092841 00000042493 00000140575 00000092010 00000102178 00000085265 00000167286 MOB kinase activator 1A Collagen, type VI, alpha 3 K(lysine) acetyltransferase 2A High mobility group nucleosomal binding domain 4 Ewing Sarcoma region 1; RNA binding protein Adenomatous polyposis coli, DP2, DP3, PPP1R46 Thymine-DNA glycosylase ATPase, Ca++ transporting, ubiquitous ATP-binding cassette, sub-family C (CFTR/MRP) Coronin, actin binding protein, 1A Myosin, Light Chain, Alkali, Smooth Muscle Capping protein (actin filament), gelsolin-like IQ motif containing GTPase activating protein 1 Proteasome (prosome, macropain) activator subunit 1 Ubiquitin-like 4A FCN1, Ficolin-1 CD3d molecule, delta (CD3-TCR complex) AA AA TF TF TF TF TF IPT IPT IPT ACC ACC ACC APC APC IR IR S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 22 / 32 Case Studies Leukemia Leukemia - ALL versus AML - MOB kinase activator 1 MOB kinase activator 1 Cell Signal. 2011 Sep; 23(9) : 1433-40. MOB control: reviewing a conserved family of kinase regulators. Hergovich A1. The family of Mps One binder (MOB) co-activator proteins is highly conserved from yeast to man. Loss of dMOB1 resulted in increased cell proliferation and decreased cell death, suggesting that MOB1 acts as tumour suppressor protein. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 23 / 32 Case Studies Leukemia Leukemia - ALL versus AML -Thymine-DNA glycosylase Thymine-DNA glycosylase Mutat Res. 2013 Mar-Apr; 743-744 : 12-25. MBD4 and TDG: multifaceted DNA glycosylases with ever expanding biological roles. Sjolund AB1, Senejani AG, Sweasy JB. The base excision repair system is vital to the repair of endogenous and exogenous DNA damage. This pathway is initiated by one of several DNA glycosylases that recognizes and excises specific DNA lesions in a coordinated fashion. MBD4 has been closely linked to apoptosis, while TDG has been clearly implicated in transcriptional regulation. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 24 / 32 Case Studies Leukemia Leukemia - ALL versus AML - CD3-TCR complex CD3-TCR complex Hum. Immunol. 2008 Nov; 69(11): 755-9. Role of bone marrow stromal cells in the generation of human CD8+ regulatory T cells. Poggi A1, Zocchi MR. Fibroblast-like stromal cells exert a strong inhibitory effect on lymphocyte proliferation, both directly by interacting with responding lymphocytes and indirectly by inducing the generation of regulatory T cells. Upon triggering via the CD3/TCR complex, highly effective CD8(+)regulatory cells strongly inhibit lymphocyte proliferation at a ratio of 1:1 to 1:100 between CD8(+)Reg(c) and responding lymphocytes. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 25 / 32 Case Studies Breast cancer Breast Cancer - Estimated number of parameters Chin et. al breast cancer data Tenfold−CV error 14 13 12 2.5 5.0 7.5 10.0 dimension Estimated number of biomarkers necessary to distinguish ER+ from ER- Breast Cancers S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 26 / 32 Case Studies Breast cancer Breast Cancer - Classification Error Method Tenfold CV error Test error Number of genes Support vector machine (with recursive feature elimination) 0/60 10/58 3/22215 Penalised logistic regression (with forward selection followed by backward deletion) 2/60 12/58 15/22215 Logistic regression (with greedy forward selection) 2/60 11/58 2/22215 Nearest shrunken centroids 2/60 11/58 5/22215 Elastic net 3/60 11/58 196/22215 0/60 2/60 0/60 9/58 9/58 12/58 3/22215 3/22215 3/22215 3/22215 3/22215 Panning Algorithm (274) Model a Model b Model c [. . . ] Model averaging 10/58 Table: Summary of Breast Cancer classification results. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 27 / 32 Case Studies Breast cancer Breast Cancer - Biomarker Network 205766 at 205907 s at 202498 s at 212288 at 221698 s at 209287 s at 214318 s at 218877 s at 210477 x at 49049 at 201316 at 221696 s at 205520 at 212702 s at 220443 s at205152 at 204902 s at 221030 s at 214194 at 216604 s at 207303 at 209604 s at 202951 at 207518 at 209713 s at 201102 s at 212195 at 221955 at 212956 at 221901 at 208964 s at 206270 at 208915 s at 214972 at 201197 at 210221 at 208019 at 216814 at 219168 s at 221103 s at 210021 s at 219493 at 204590 x at Figure: Biomarker Network- Breast Cancer data set S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 28 / 32 Case Studies Breast cancer Breast Cancer - Biological Interpretation Three hubs were identified: 1 GATA binding protein 3 (GATA3): a transcription factor regulating the differentiation of breast luminal epithelial cells. GATA3 expression is progressively lost during luminal breast cancer progression as cancer cells acquire a stem cell-like phenotype (Chou et al., 2010 ) 2 IL6 Signal Transducer (IL6 ST): a pro-inflammatory cytokine signal transducer. IL6 ST has been linked to breast cancer epithelial mesenchymal transition and cancer stem cell traits (Chung et al., 2014 ), and cancer-promoting microenvironment (Bohrer et al., 2014 ). 3 TBC1 domain family, member 9 (TBC1D9): a GTPase-activating protein for Rab family protein involved in the expression of the ER in breast tumors. Expression of the ER on the surface of breast tumor cells is highly correlated with the coordinate expression of different genes among which TBC1D9 and GATA3 (Andres et al., 2012 ). S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 29 / 32 Case Studies Breast cancer Breast Cancer - Network Organisation NETWORK 2 Position 1 Position 2 Position 3 NETWORK 3 Position 1 Position 2 Position 3 Position 2 Position 3 Affy ID Gene ID (ENSG) Gene Function Biological Process 212195 at 00000134352 IL6 Signal Transducer ICT 202951 at 221955 at 207303 at 00000112079 00000088256 00000154678 Serine/threonine kinase 38 Guanine nucleotide binding protein Phosphodiesterase 1C, calmodulin-dependent 70kDa CG ITT ICT 212956 at 00000109436 TBC1 domain family, member 9 (with GRAM domain) IPT 202951 at 205152 at 207518 at 00000112079 00000157103 00000153933 Serine/threonine kinase 38 Solute carrier family 6, member 1 Diacylglycerol kinase, epsilon 64kDa CG ST ST 216814 at 221103 s at 00000232267 00000206530 ACTR3 pseudogene 2 Cilia and flagella associated protein 44 PUP ACC S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 30 / 32 Conclusions Summary Conclusions Panning is a new model selection framework. It provides: an estimate of the dimension of the problem; a set of equivalent models, rather than a single final model. This allows the construct of “paradigmatic networks”. an overview of the architecture of the selected models, and not only an unordered list of variables. This approach makes it easier to give a biological meaning to the set of selected models. S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 31 / 32 Conclusions Thanks! Thank you very much for your attention! Any questions? More info... SMAC-group.com github.com/SMAC-Group [email protected] [email protected] [email protected] S. Guerrier, N. Mili & S. Orso Panning Algorithm for Gene Selection June 6, 2016 32 / 32