A Predictive Based Regression Algorithm for Gene Network Selection St´ ephane Guerrier

publicité
A Predictive Based Regression Algorithm
for Gene Network Selection
Stéphane Guerrier1 , Nabil Mili2 & Samuel Orso2
1 Department
2
of Statistics, University of Illinois at Urbana-Champaign, USA
Research Center for Statistics, University of Geneva, Switzerland
joint work with
Marco Avella Medina (U. Geneva), Yanyuan Ma (USC),
Roberto Molinari (U. Geneva)
June 6, 2016
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
1 / 32
Introduction
Motivation
Introduction
Gene Selection Problems:
Selection of relevant genes is a common task in most gene expression
studies. Researchers try to identify the smallest possible set of genes
that can still achieve good predictive performance (Dı́az-Uriarte &
Alvarez de Andrés, 2006 ).
How statisticians (typically) understand this definition:
We are looking for a single model.
For a given candidate model, picking the most likely parameters
given the data is optimal.
Predictive performance can be measured by the likelihood function
(typically out-of-sample).
The order in which the variables enter the model is unimportant
(implying: Model AB is equivalent to Model BA).
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
2 / 32
Introduction
Motivation
Equivalence of outcomes according likelihood function
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
3 / 32
Introduction
Potential drawbacks
Introduction
Is this a good idea?
According to our understanding of the problem (i.e. single model based on
likelihood methods): YES! However:
Focusing on a single model suggests a level of confidence in our final
result that is not justified by the data as other models generally exist
with similar good fit (Whittingham et al., 2006 ).
Maximizing the likelihood function does not guarantee finding the
best model(s) (and parameters) according to a given out-of-sample
(medically chosen) objective function (e.g. classification error, quality
of life, mortality, ... ).
The unimportance of the order of variable can causes
interpretation issues.
These methods are prone to overfitting (due to the asymmetric
effects of “under” vs “over” fitting).
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
4 / 32
Introduction
Random Medical News
This can lead to...
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
5 / 32
Introduction
Alternative framework
Introduction
An alternative approach
The “Panning” algorithm aims to addressing (some of) these issues. Its
goals are the following:
Finding all models (and parameters) minimizing an out-of-sample
(medically chosen) objective function (e.g. classification error, quality
of life, mortality, ...).
Restricting our attention to the models with the smallest dimension.
Investigating in which order the variables should enter the model
(implying: Model AB is NOT equivalent to Model BA).
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
6 / 32
Introduction
Stylized example
Introduction
A Stylized Example:
Consider a typical gene selection problem (e.g. n < 102 and p > 104 ).
Suppose we employ a commonly used technique such as the LASSO and a
model containing the following variables:
A
B
E
X
Y
C
Z
Suppose we could compute the prediction error of all models. Then, one
might find that there exist a model with two variables having the smallest
prediction error. Moreover, several models have a non-discernible
prediction error. For example:
A
S. Guerrier, N. Mili & S. Orso
B C
A E
F D
Panning Algorithm for Gene Selection
A
June 6, 2016
7 / 32
Introduction
Stylized example
Introduction
A Stylized Example (cont):
Next, we could investigate the marginal performance of these models
and obtain:
A
B C
A E
F D
A
A → B A → C E → F A → D
This “group” of models shares variable A and be can be
understood as a network. Given that A is in the model B , C and
D can be though of as “synonyms”.
This model represents another “network”. This network may have
a very different biological meaning than the previous one, but it is
equivalent in terms of prediction error.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
8 / 32
Introduction
Stylized example
Introduction
A Stylized Example (cont):
Would it be possible to find a better representation than:
A → B A → C E → F A → D ?
A possible solution (paradigmatic network?) is:
B
E
A
C
F
D
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
9 / 32
Methodology
Introduction
Model selection paradigm
Source: T. Hastie, R. Tibshirani and J. Friedman The Elements of Statistical Learning, Springer, 2009.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
10 / 32
Methodology
Introduction
Model selection paradigm con’t
Out-of-sample error
4
3
Noisy models
2
1
Good models
0
Frontier
Best models
S. Guerrier, N. Mili & S. Orso
2
4
6
8
Model dimension
Frontier (lower bound)
Best models dimension
Panning Algorithm for Gene Selection
10
12
Indistinguishable models
Best models set
June 6, 2016
11 / 32
Methodology
Algorithm
Panning algorithm
Motivations
Models of low-dimension generalize better.
There are many models with good performances, that are
indistinguishable.
It is impossible to explore every single models. For example, with
60 variables there is 260 ≈ 1.15 × 1017 models to explore. If one
model takes 10−10 seconds to be computed, it would take about 4
years to explore all.
The panning algorithm aims at:
finding the best set of low-dimension models in a ”reasonable” time
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
12 / 32
Methodology
Algorithm
Panning heuristic procedure
Initialisation step
1
Evaluate the performance of all models for a given low dimension.
2
Compute the frontier.
3
Separate variables between “best” (below the frontier) and “noisy”
(above the frontier).
General step
1
Increase the dimension of one increment.
2
Generate some models by mixing “best” and “noisy” variables.
3
Evaluate their performances.
4
Compute the new frontier.
5
Redefine “best” and “noisy” variables.
6
Repeat points 2.1 to 2.5 until reach a given dimension.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
13 / 32
Methodology
Algorithm
Panning heuristic procedure con’t
Best set of models
1 Find the lowest dimension on the frontier.
2
Determine the set of best models.
3
Eliminate duplicates.
Why an heuristic?
There is no guarantee whether the area of best set of models is
reached, and how well.
But in practice, preliminary results are promising.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
14 / 32
Methodology
Implementation
Panning algorithm con’t
The user can choose:
The dimensions of interest (where to start and stop).
The number of models to explore.
The measure of performance.
And even more...
We are reproducible!
We are curently developing a R package. A beta version is free of access
at https://github.com/SMAC-Group/panning
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
15 / 32
Methodology
Implementation
Acute Leukemia - Estimated Number of Parameters
tenfold CV error
4
6
8
10
Quantile q̂j for α = .01
2
ENet Golub
0
NSC PLR SVM
0
10
20
30
subset dimension
40
50
Figure: Estimated number of biomarkers necessary
to distinguish ALL from AML
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
16 / 32
Methodology
Implementation
Leukemia - ALL versus AML - Classification Error
Method
Golub
Support vector machine
(with recursive feature elimination)
Penalised logistic regression
(with recursive feature elimination)
Nearest shrunken centroids
Elastic net
Panning Algorithm (107)
Model a
Model b
Model c
[. . .]
Model averaging
Tenfold CV
error
Test error
Number of
genes
3/38
2/38
4/34
1/34
50
31
2/38
1/34
26
2/38
3/38
2/34
0/34
21
45
0/38
0/38
0/38
2/34
2/34
2/34
2
2
2
2/34
2
Table: Summary of Leukemia classification results.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
17 / 32
Case Studies
Leukemia
Case Studies - Acute Leukemia and Breast Cancer
Research questions
1
Acute Leukemia
Distinction between ALL and AML in a pediatric population
(Golub et al., 1999 )
2
Breast Cancer
Distinction between Estrogen Receptor ⊕ and Estrogen Receptor Breast Cancers
(Chin et al., 2006 )
What it is -What it isn’t
Examples from the clinical literature (transcriptome level) in order to
illustrate the usefulness of our method.
A discussion about the etiology of Acute Leukemias and Breast
Cancers .
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
18 / 32
Case Studies
Leukemia
Acute Leukemia - Estimated Number of Parameters
tenfold CV error
4
6
8
10
Quantile q̂j for α = .01
2
ENet Golub
0
NSC PLR SVM
0
10
20
30
subset dimension
40
50
Figure: Estimated number of biomarkers necessary
to distinguish ALL from AML
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
19 / 32
Case Studies
Leukemia
Leukemia - ALL versus AML - Biomarker Network
L07633 at
X66899 at
HG2815-HT2931 at
U90549 at
U51166 at
M94345 at
L33075 at
M20778 s at
U57316 at
U49248 at
M27891 at
J03589 at
Z32765 at
Z69881 at
D83920 at
D80006 at
X03934 at
X89109 s at
M84526 at
M74088 s at
M28130 rna1 s at
U84388 at
L07758 at
M33680 at
Y00291 at
U94855 at
X04526 at
U93867 at
M92287 at
X95735 at
HG1612-HT1612 at
U29175 at
S80437 s at
U32645 at
D17532 at
M60483 rna1 s at
D78577 s at
D63506 at M83233 at
HG3521-HT3715 at
Figure: Biomarker Network- Leukemia data set
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
20 / 32
Case Studies
Leukemia
Leukemia - ALL versus AML - Biological Interpretation
Three hubs were identified
1 Cystatin C: a secreted cysteine protease inhibitor. Cystatin C is
implicated in cell apoptosis induction by decreasing B-cell leukemia-2
(BCL-2) activity. BCL-2 regulation is known to be implicated in
resistant AML (Sakamoto et al., 2015 )
2
Zyxin: a zinc-binding phosphoprotein that concentrates at focal
adhesions and along the actin cytoskeleton. Zyxin interacts with
Vasodilator-Stimulated Phosphoprotein (VASP). VASP is a substrate
of the BcrAbl oncoprotein which drives oncogenesis in patients with
chronic myeloid leukemia (Bernusso et al., 2015 ).
3
Complement factor D: a rate-limiting enzyme in the alternative
pathway of complement activation. Ratajczak (2014) has stressed the
role of the complement cascade as a trigger for the migration of
hematopoietic stem cells from bone marrow into blood.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
21 / 32
Case Studies
Leukemia
Leukemia - ALL versus AML - Network Organisation
Affy ID
Gene ID
(ENSG)
Gene Function
Biological
Process
NETWORK 1
Position 1
M27891 at
00000101439
Cystatin C
AA
Position 2
D80006 at
M20778 s at
U57316 at
U90549 at
X66899 at
M74088 s at
U51166 at
Z69881 at
U49248 at
X89109 s at
HG2815-HT2931 at
M94345 at
L33075 at
L07633 at
J03589 at
D83920 at
X03934 at
00000114978
00000163359
00000108773
00000182952
00000182944
00000134982
00000139372
00000074370
00000023839
00000102879
00000092841
00000042493
00000140575
00000092010
00000102178
00000085265
00000167286
MOB kinase activator 1A
Collagen, type VI, alpha 3
K(lysine) acetyltransferase 2A
High mobility group nucleosomal binding domain 4
Ewing Sarcoma region 1; RNA binding protein
Adenomatous polyposis coli, DP2, DP3, PPP1R46
Thymine-DNA glycosylase
ATPase, Ca++ transporting, ubiquitous
ATP-binding cassette, sub-family C (CFTR/MRP)
Coronin, actin binding protein, 1A
Myosin, Light Chain, Alkali, Smooth Muscle
Capping protein (actin filament), gelsolin-like
IQ motif containing GTPase activating protein 1
Proteasome (prosome, macropain) activator subunit 1
Ubiquitin-like 4A
FCN1, Ficolin-1
CD3d molecule, delta (CD3-TCR complex)
AA
AA
TF
TF
TF
TF
TF
IPT
IPT
IPT
ACC
ACC
ACC
APC
APC
IR
IR
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
22 / 32
Case Studies
Leukemia
Leukemia - ALL versus AML - MOB kinase activator 1
MOB kinase activator 1
Cell Signal. 2011 Sep; 23(9) : 1433-40.
MOB control: reviewing a conserved family of kinase regulators.
Hergovich A1.
The family of Mps One binder (MOB) co-activator proteins is highly
conserved from yeast to man. Loss of dMOB1 resulted in increased cell
proliferation and decreased cell death, suggesting that MOB1 acts as
tumour suppressor protein.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
23 / 32
Case Studies
Leukemia
Leukemia - ALL versus AML -Thymine-DNA glycosylase
Thymine-DNA glycosylase
Mutat Res. 2013 Mar-Apr; 743-744 : 12-25.
MBD4 and TDG: multifaceted DNA glycosylases with ever
expanding biological roles.
Sjolund AB1, Senejani AG, Sweasy JB.
The base excision repair system is vital to the repair of endogenous and
exogenous DNA damage. This pathway is initiated by one of several DNA
glycosylases that recognizes and excises specific DNA lesions in a
coordinated fashion. MBD4 has been closely linked to apoptosis, while
TDG has been clearly implicated in transcriptional regulation.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
24 / 32
Case Studies
Leukemia
Leukemia - ALL versus AML - CD3-TCR complex
CD3-TCR complex
Hum. Immunol. 2008 Nov; 69(11): 755-9.
Role of bone marrow stromal cells in the generation of human
CD8+ regulatory T cells.
Poggi A1, Zocchi MR.
Fibroblast-like stromal cells exert a strong inhibitory effect on lymphocyte
proliferation, both directly by interacting with responding lymphocytes and
indirectly by inducing the generation of regulatory T cells. Upon triggering
via the CD3/TCR complex, highly effective CD8(+)regulatory cells
strongly inhibit lymphocyte proliferation at a ratio of 1:1 to 1:100 between
CD8(+)Reg(c) and responding lymphocytes.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
25 / 32
Case Studies
Breast cancer
Breast Cancer - Estimated number of parameters
Chin et. al breast cancer data
Tenfold−CV error
14
13
12
2.5
5.0
7.5
10.0
dimension
Estimated number of biomarkers necessary to distinguish ER+ from ER- Breast Cancers
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
26 / 32
Case Studies
Breast cancer
Breast Cancer - Classification Error
Method
Tenfold CV
error
Test error
Number of
genes
Support vector machine
(with recursive feature elimination)
0/60
10/58
3/22215
Penalised logistic regression
(with forward selection
followed by backward deletion)
2/60
12/58
15/22215
Logistic regression
(with greedy forward selection)
2/60
11/58
2/22215
Nearest shrunken centroids
2/60
11/58
5/22215
Elastic net
3/60
11/58
196/22215
0/60
2/60
0/60
9/58
9/58
12/58
3/22215
3/22215
3/22215
3/22215
3/22215
Panning Algorithm (274)
Model a
Model b
Model c
[. . . ]
Model averaging
10/58
Table: Summary of Breast Cancer classification results.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
27 / 32
Case Studies
Breast cancer
Breast Cancer - Biomarker Network
205766 at
205907 s at
202498 s at
212288 at
221698 s at
209287 s at
214318 s at
218877 s at
210477 x at
49049 at
201316 at
221696 s at
205520 at
212702 s at
220443 s at205152 at
204902 s at
221030 s at
214194 at
216604 s at
207303 at
209604 s at
202951 at
207518 at
209713 s at 201102 s at
212195 at
221955 at
212956 at
221901 at
208964 s at
206270 at
208915 s at
214972 at
201197 at
210221 at
208019 at
216814 at
219168 s at
221103 s at
210021 s at
219493 at
204590 x at
Figure: Biomarker Network- Breast Cancer data set
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
28 / 32
Case Studies
Breast cancer
Breast Cancer - Biological Interpretation
Three hubs were identified:
1 GATA binding protein 3 (GATA3): a transcription factor regulating
the differentiation of breast luminal epithelial cells. GATA3 expression
is progressively lost during luminal breast cancer progression as cancer
cells acquire a stem cell-like phenotype (Chou et al., 2010 )
2
IL6 Signal Transducer (IL6 ST): a pro-inflammatory cytokine signal
transducer. IL6 ST has been linked to breast cancer epithelial mesenchymal transition and cancer stem cell traits (Chung et al.,
2014 ), and cancer-promoting microenvironment (Bohrer et al., 2014 ).
3
TBC1 domain family, member 9 (TBC1D9): a GTPase-activating
protein for Rab family protein involved in the expression of the ER in
breast tumors. Expression of the ER on the surface of breast tumor
cells is highly correlated with the coordinate expression of different
genes among which TBC1D9 and GATA3 (Andres et al., 2012 ).
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
29 / 32
Case Studies
Breast cancer
Breast Cancer - Network Organisation
NETWORK 2
Position 1
Position 2
Position 3
NETWORK 3
Position 1
Position 2
Position 3
Position 2
Position 3
Affy ID
Gene ID
(ENSG)
Gene Function
Biological
Process
212195 at
00000134352
IL6 Signal Transducer
ICT
202951 at
221955 at
207303 at
00000112079
00000088256
00000154678
Serine/threonine kinase 38
Guanine nucleotide binding protein
Phosphodiesterase 1C, calmodulin-dependent 70kDa
CG
ITT
ICT
212956 at
00000109436
TBC1 domain family, member 9 (with GRAM domain)
IPT
202951 at
205152 at
207518 at
00000112079
00000157103
00000153933
Serine/threonine kinase 38
Solute carrier family 6, member 1
Diacylglycerol kinase, epsilon 64kDa
CG
ST
ST
216814 at
221103 s at
00000232267
00000206530
ACTR3 pseudogene 2
Cilia and flagella associated protein 44
PUP
ACC
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
30 / 32
Conclusions
Summary
Conclusions
Panning is a new model selection framework. It provides:
an estimate of the dimension of the problem;
a set of equivalent models, rather than a single final model. This
allows the construct of “paradigmatic networks”.
an overview of the architecture of the selected models, and not
only an unordered list of variables.
This approach makes it easier to give a biological meaning to the set of
selected models.
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
31 / 32
Conclusions
Thanks!
Thank you very much for your attention!
Any questions?
More info...
SMAC-group.com
github.com/SMAC-Group
[email protected]
[email protected]
[email protected]
S. Guerrier, N. Mili & S. Orso
Panning Algorithm for Gene Selection
June 6, 2016
32 / 32
Téléchargement