A “GENIUS” Approach for Breast Cancer Prognostication Benjamin Haibe-Kains

publicité
A “GENIUS” Approach for Breast Cancer
Prognostication
Benjamin Haibe-Kains1,2
1 Functional
2 Machine
Benjamin Haibe-Kains (ULB)
Genomics Unit, Institut Jules Bordet
Learning Group, Université Libre de Bruxelles
Talk at Karolinska Institute
May 18, 2009
1 / 83
Research Groups
Functional Genomics Unit - Jules Bordet Institute
Head: Prof. Christos Sotiriou.
8 researchers (1 prof, 5 postDocs, 2 PhD students), 5 technicians.
Research topics : Genomic analyses, clinical studies and translational
research.
Website :
http://www.bordet.be/en/services/medical/array/practical.htm.
National scientific collaborations : ULB, Erasme, ULg, Gembloux
International scientific collaborations : Genome Institute of Singapore,
John Radcliffe Hospital, Karolinska Institute and Hospital, MD
Anderson Cancer Center, Netherlands Cancer Institute, Swiss Institute
of Bioinformatics, NCI/NIH, Gustave-Roussy Institute, IDDI.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
2 / 83
Research Groups
Machine Learning Group - Université Libre de Bruxelles
Head: Prof. Gianluca Bontempi.
10 researchers (2 prof, 4 postDoc, 4 PhD students), 4 graduate
students.
Research topics : Bioinformatics, Classification, Regression, Time
series prediction, Sensor networks.
Website : http://www.ulb.ac.be/di/mlg.
Scientific collaborations inside ULB : IRIDIA (Sciences Appliquées),
Physiologie Molculaire de la Cellule (IBMM), Conformation des
Macromolcules Biologiques et Bioinformatique (IBMM), CENOLI
(Sciences), Functional Genomics Unit (Institut Jules Bordet), Service
d’Anesthesie (Erasme).
Scientific collaborations outside ULB : UCL Machine Learning Group
(B), Politecnico di Milano (I), Universitá del Sannio (I), George
Mason University (US).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
3 / 83
Outline
1
Introduction
I
I
I
Breast Cancer
Traditional Approach for Prognostication
Gene Expression Profiling Approach for Prognostication
2
Global Prognostic Gene Signatures
3
Breast Cancer Molecular Subtypes
4
Local Prognostic Gene Signatures
5
Conclusions
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
4 / 83
Introduction
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
5 / 83
Breast Cancer
Huge public health issue:
I
I
One of the most diagnosed malignancy in the Western World.
1 out of 9 women will develop a breast cancer during her lifetime.
Prognostication:
Input
(clinical or microarray data)
Output
(survival data)
a
Breast surgery
+ radiotherapy
Follow-up
5-10 years
Diagnosis
Recurrence
Remission
?
Prognosis
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
6 / 83
Traditional Approach for Prognostication
Clinical variables
Histological
grade
Tumor
size
Nodal
status
Age
HER2
IHC
ER
IHC
HER2
FISH
NPI = 0.2 × tumor size +
grade + nodal status
PGR
IHC
AOL uses a life table
technique considering age,
NPI
AOL
Hormonotherapy
Chemotherapy
nodal status, tumor size, and
ER.
Guidelines
NIH/St Gallen
Prediction
Prognostication
Problem: Prognostic clinical models predict numerous low-risk patients with early breast
cancer (nodal status = 0) as high-risk.
ß Overtreatment
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
7 / 83
Traditional Approach for Prognostication
Clinical variables
Histological
grade
Tumor
size
Nodal
status
Age
HER2
IHC
ER
IHC
HER2
FISH
NPI = 0.2 × tumor size +
grade + nodal status
PGR
IHC
AOL uses a life table
technique considering age,
NPI
AOL
Hormonotherapy
Chemotherapy
nodal status, tumor size, and
ER.
Guidelines
NIH/St Gallen
Prediction
Prognostication
Problem: Prognostic clinical models predict numerous low-risk patients with early breast
cancer (nodal status = 0) as high-risk.
ß Overtreatment
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
7 / 83
Gene Expression Profiling
Microarray chip
Hybridization
Detection
A
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
8 / 83
Gene Expression Profiling Approach for Prognostication
Improvement of breast cancer prognostication by using machine
learning techniques to analyze microarray and survival data.
Our methodology is composed of three main parts:
Global prognostic
gene signature
identification
Breast cancer
molecular subtypes
identification
Local prognostic
gene signature
identification
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
9 / 83
Global Prognostic Gene Signatures
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
10 / 83
Global Prognostic Gene Signature Identification
Objective: Identification of prognostic gene signatures without
taking into account the presence of molecular subtypes.
The idea is to identify prognostic gene signatures and their
corresponding risk prediction models exhibiting the following
characteristics:
I
I
I
Good performance with independent data.
Interpretable from a biological point of view.
Useable with data generated by different microarray platforms and/or
normalization techniques.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
11 / 83
Global Prognostic Gene Signature Identification
Methodology
Clinical data of
breast cancer patients
Raw microarray data of
breast cancer patients
clinical outcome
Data
preprocessing
normalized
gene expressions
Stability-based
feature selection
Feature
selection
signature
Robust model
building
Risk prediction
modeling
risk predictions
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
12 / 83
Breast Cancer Molecular Subtypes
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
13 / 83
Molecular Subtypes Identification
Objective: Robust identification of breast cancer molecular
subtypes along with estimation of classification uncertainty.
The idea is to extent the study of Perou et al. by building a clustering
model exhibiting the following characteristics:
I
I
I
Estimation of classification uncertainty.
Good performance in independent dataset.
Useable with data generated by different microarray platforms and/or
normalization techniques.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
14 / 83
Molecular Subtypes Identification
gene 2
Classification uncertainty
subtype 2
?
?
?
?
subtype 3
patients
subtype 1
gene 1
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
15 / 83
Molecular Subtypes Identification
Methodology
Raw microarray data of
breast cancer patients
Data
preprocessing
normalized
gene expressions
Prototype-based
feature
transformation
Dimension
reduction
gene module
scores
Subtype
clustering
Clustering
subtypes
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
16 / 83
Dimension Reduction
Aim: To reduce the dimensionality of microarray data while keeping
information relevant for subtypes identification.
We used a prototype-based feature transformation
[Desmedt et al., 2008, Haibe-Kains, 2009] to identify sets of genes
specifically co-expressed to key biological processes in breast cancer.
The method is composed of 2 main steps:
1
2
Prototype-based clustering: Selection of prototypes and identification
of gene modules.
Summarization: Each cluster is summarized by a single feature called
gene module score (weighted average of genes in cluster).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
17 / 83
Gene Modules
Results
We applied this method with 7 prototypes [Hanahan and Weinberg, 2000]
in two large datasets (NKI & VDX) [Desmedt et al., 2008].
ER
signaling
HER2
signaling
Apoptosis
KA
STAT1
AUR
F
VE
G
Angiogenesis
Benjamin Haibe-Kains (ULB)
Proliferation
AU
PL
Immune
response
2
CA
SP
BB
ER
3
ESR1
Gene module
ESR1
AURKA
STAT1
PLAU
ERBB2
VEGF
CASP3
Size
468
228
94
67
27
13
8
Tumor
invasion
Talk at Karolinska Institute
May 18, 2009
18 / 83
Subtype Clustering
Aim: To develop of a robust clustering model able to estimate
classification uncertainty.
Low dimensional input space defined by ESR1 and ERBB2 module
scores (see [Kapp et al., 2006]).
We developed a model-based clustering:
I
Probability to belong to subtype j:
πj N (xi ; µj , Σj )
Pr(j|xi ) = Pm
l=1 πl N (xi ; µl , Σl )
I
where m is the number of Gaussians, πj is the prior probability of xi to
be generated by the j th Gaussian N (xi ; µj , Σj )
Hyperparameters:
F
F
Number of clusters (Gaussians): BIC [Schwarz, 1978].
Mean µj and covariance matrices Σj : EM algorithm
[Dempster et al., 1977].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
19 / 83
Subtype Clustering
Pros & Cons
Pros
I
I
I
I
The low dimensional space increases the model robustness (low
feature-to-sample ratio).
The low dimensional space facilitates the representation of the results
and their interpretation.
Estimation of classification uncertainty + ability to perform soft
partitioning of the data.
Easy to use this clustering model to predict the subtype of the tumor
of a new patient (unlike hierarchical clustering).
Cons
I
Although the use of ESR1 and ERBB2 module scores is beneficial (see
above), it prevents us to find any new subtypes.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
20 / 83
Subtype Clustering
Results
Training set: 344 patients with early breast cancer (VDX).
Validation set:
I
I
Clustering: 17 independent datasets (> 3400 patients).
Prognosis: 745 untreated patients with early breast cancer from 5
datasets (NKI, TBG, UPP, UNT and MAINZ).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
21 / 83
Subtype Clustering Model
Training set (VDX)
1
1
2
BIC
3
0.66
ER−/HER2−
HER2+
ER+/HER2−
2
4
6
8
10
\end{minipage}
Number of clusters
−1
2
0
0
ERBB2
1
0.33
Patients
1.0
−2
Density
0.5
0.0
2
−2
−1
0
1
2
ESR1
1
2
B2
ERB
0
1
1
0
1
2
1
ESR
2
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
22 / 83
Subtype Clustering Model
Independent datasets
2
ER−/HER2−
HER2+
ER+/HER2−
0
●
●
●
●
● ● ● ●● ●
●
●
●
●
●
●
●
● ● ● ●●●● ● ●
● ●
● ●
●
● ●●
●●
●
● ●
● ● ●
●
● ●● ●● ●●
●●
●●● ●
● ●●●●●
● ●●
●
●
●
●
●● ●
●
● ●● ●●●
●●●● ●
●
●●
●
● ●●
●
●●●● ●
●●
●
●
●●
●
●●
●● ●
●●●
●●●
●
●● ●●●●●
●
●●
●
●
●●●●
●
●
●●
● ●●
● ●
● ● ●●
●●
●●
●
●
●
●
●●● ●
● ●
●
●
●● ●
● ● ●
● ●
●
●
●
●
● ● ● ● ●●
●
●
●
●●
−1
−1
0
ERBB2
0
●
●
●● ●
●
●●
●
●
● ●
●
●
● ●
●●● ●
● ●●●●● ●●
●● ● ●
●●●● ●
●
● ●●
●●
● ● ● ●●
●
● ●
●●
● ●●●
●
●
●
●
●
●
●●●
●● ● ● ●
●●
●
●●
●●●
●
●
● ●
●
●
●●
●
●●
●●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
ERBB2
●
●
●
● ● ●●●
●
●●
●
● ●
●
●
●●
●● ●
●
●
●●● ●●
● ●
● ●● ● ●●
●
●●● ●
●●
●●
●●●
●
●
●●
●
●
●● ●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●● ● ●
●●
●
● ●● ●●
●●
●●
●●
●
●
●●●●
●
●●
●●● ●
●
●●
●●● ●
●●●
●
●
●
●●
●
● ●
●●● ●●
●●
●●●●
● ●
●●
●
●
●
●
●●●
●
● ●● ● ●
●
●
●
●
●● ●●
●
●
●
●
●●●
●●
●●
●
● ●
● ●●
●
●●
●
●
●●
●●
●
●
●
●●●
●●●
●●
●●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●●
●●
● ●●
● ● ●
●
●
●
●
−1
ER−/HER2−
HER2+
ER+/HER2−
1
●
1
●
1
●
ERBB2
UPP
2
TBG
2
NKI
ER−/HER2−
HER2+
ER+/HER2−
−2
−1
0
1
2
−2
−2
−2
●
−2
−1
ESR1
0
1
2
−2
1
2
2
UNC2
2
ER−/HER2−
HER2+
ER+/HER2−
ER−/HER2−
HER2+
ER+/HER2−
ER−/HER2−
HER2+
ER+/HER2−
1
●
1
●
1
0
ESR1
MAINZ
2
UNT
●
−1
ESR1
−1
0
ESR1
Benjamin Haibe-Kains (ULB)
1
2
0
●
−2
●
−2
−2
●
●
●
● ● ●●● ●●
●
●
●
●
●
●● ● ●
● ●●
● ●
●
●
●
●
●● ●● ●●
●
● ●
●
●
●
●●
● ●●●
● ●●
●●
●● ●● ● ●
●
●
●● ●
●● ● ● ● ●●
●
● ●
● ● ●●
● ● ●●
●
●●
●
● ●
● ● ●●●
●
● ●●
●
● ● ● ●
●
●
●
−1
●
ERBB2
●
●
●
●●
●
●
●●
●●
●●
●●
●●●●●
●●
●
●
●
●
● ●● ● ●●
●
●
● ●●● ● ●
●
● ●●
●● ●
●
● ●●
● ●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
● ●●
● ●●●
● ●
●
●●
●● ●
●●
●● ●
●●
● ●●● ●● ●●●
● ●●●●
●
●●
● ●
●
●
●●●● ● ● ●
●
●
● ● ●● ●●
●● ● ●●● ●
● ●
●●●
●
●
●
0
●
●
●●●
●
●
●
●
● ● ●● ●
● ●●
●
●
●● ●
●●
●
● ● ●●
●● ●
● ● ● ●●
● ●●
●
●● ● ●
●●
● ●●
●
● ●● ● ● ●
●●●
●
●
● ● ●●
●
●
● ● ●●
● ● ●
●
●
●
●●
●
−2
−1
●
●
ERBB2
●
●
−1
0
ERBB2
●
●
●
−2
−1
0
1
ESR1
Talk at Karolinska Institute
2
−2
−1
0
1
2
ESR1
May 18, 2009
23 / 83
Subtype Clustering
0.6
0.4
ER /HER2
HER2+
ER+/HER2
0.0
0.2
Probability of survival
0.8
1.0
Prognosis of early breast cancer patients
No. At Risk
ER /HER2
HER2+
ER+/HER2
Benjamin Haibe-Kains (ULB)
0
1
2
3
116
105
503
108
97
494
91
91
474
83
81
450
4
5
6
Time (years)
78
73
425
71
69
401
68
64
355
Talk at Karolinska Institute
7
8
9
10
64
58
312
53
52
276
46
47
251
37
44
217
May 18, 2009
24 / 83
Local Prognostic Gene Signatures
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
25 / 83
Local Prognostic Gene Signature Identification
Objective: Identification of prognostic gene signatures taking
into account the presence of molecular subtypes.
conducted a study similar to [van’t Veer et al., 2002]
but they took into account the molecular heterogeneity of breast
cancer wrt ER status (first local signatures).
[Wang et al., 2005]
ß Prognostic gene signature: GENE76.
Open issues:
I
I
I
The local models are developed for two subgroups (ER- and ER+)
only, without considering the heterogeneity of the HER2+ subgroup.
Hard partitioning of the population of breast cancer patients.
The local model for ER- subgroup was trained on few samples and
yielded poor performance in validation studies
[Foekens et al., 2006, Desmedt et al., 2007].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
26 / 83
Local Prognostic Gene Signature Identification
Methodology
Clinical data of
breast cancer patients
Microarray data of
breast cancer patients
Data
preprocessing
clinical outcome
Subtype
clustering
Pr(1) Pr(2)
S1
Pr(m)
S2
P(1)
Sm
P(2)
Subtype
clustering
P(m)
Feature
selection
Feature
selection
Feature
selection
subtype
signature
subtype
signature
subtype
signature
Model
building
Model
building
Model
building
subtype
risk score
{Pr(1), Pr(2), ..., Pr(m)}
normalized
gene expressions
subtype
risk score
Local risk prediction
modeling
subtype
risk score
Combination
risk predictions
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
27 / 83
Local Model Network
Aim: To develop a prognostic model taking into account breast cancer molecular subtypes
We adopted a divide-and-conquer strategy, through a modular
modeling approach, and developed a Local Model Network (fuzzy )
[Johansen and Foss, 1993] for prognostication:
r=
m
X
ρj (x, θj )hj (x, αj )
j=1
where x and r stand for the gene expressions and patients’ risk
respectively, ρj (x, θj ) is the j th basis function of parameters θj ,
hj (x, αj ) is the j th local risk prediction model of parameters αj .
The ρj are constrained to satisfy
m
X
ρj (x, θj ) = 1
j=1
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
28 / 83
Local Model Network
Example of LMN:
1
1.2
1.2
0.8
1
1
0.6
0.8
0.8
0.4
0.6
0.4
0.6
0.4
0.2
0.2
0.2
0
1
0
1
1
0.5
1
0.5
0.5
0
!0.5
!1
0
!0.5
0
!0.5
0.5
0
!1
The j th basis function is defined as the
probability to belong to the j th breast
cancer molecular subtype (subtype
clustering ):
0.5
0
0
!0.5
!1
!0.5
!1
Basis functions
πj N (x; µj , Σj )
ρj (x) = Pm
l=1 πl N (x; µl , Σl )
1
0.5
!0.5
!1
Local models
0
1
!1
Local Model Network
The j th local model is defined by our
robust model:
hj (x) =
k(j)
X
βi0 xi
i=1
where signature size k now depends
on subtype j (local feature selection).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
29 / 83
Local Feature Selection
In order to take advantage of the subtypes identification, we adapted
the stability-based feature ranking to identify the genes prognostic in
specific subtypes.
We introduced a weighted version of the concordance index:
P
k,l∈Ω wkl 1{xki > xli }
P
Cwted (xi , y , ρj ) =
k,l∈Ω wkl
where xi are the expressions of gene i, y are the survival data and
wkl = ρj (xk )ρj (xl ) is the weight for the pair of comparable patients
{k, l} with ρj (x) being the probability for a patient’s tumor x to
belong to the subtype j.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
30 / 83
Local Model Network
Pros & Cons
Pros
I
I
I
Use of robust linear risk prediction models to perform nonlinear
modeling.
Easy interpretation for doctors (breast cancer molecular subtypes +
corresponding risk predictions).
Potentially more biological insights since the molecular heterogeneity of
breast cancer is taken into account.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
31 / 83
Local Prognostic Gene Signature Identification
GENIUS
GENIUS = Gene Expression progNostic Index Using Subtypes.
I
Aim: Refining breast cancer prognosis according to molecular subtypes.
GENIUS refers to:
I
I
The local prognostic gene signatures identified through local
stability-based feature ranking.
The risk prediction model using robust model building and Local Model
Network.
Training set:
I
I
Input data: 22,283 (1050 after filtering) gene expressions of 344
untreated patients with early breast cancer (VDX).
Output data: Survival data.
Validation set: 745 untreated patients with early breast cancer from 5
datasets (NKI, TBG, UPP, UNT and MAINZ).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
32 / 83
GENIUS
Clinical data of
breast cancer patients
Microarray data of
breast cancer patients
Data
preprocessing
clinical outcome
Subtype
clustering
Pr(3)
Pr(1) Pr(2)
ER-/HER2- HER2+
P(1)
ER+/HER2-
P(2)
{Pr(1), Pr(2), Pr(3)}
normalized
gene expressions
P(3)
Feature
selection
Feature
selection
AURKA
subtype
signature
subtype
signature
subtype
signature
Model
building
Model
building
Model
building
subtype
subtype
subtype
risk score risk score risk score
Combination
GENIUS
risk predictions
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
33 / 83
GENIUS
Results
Local prognostic gene signatures identified for the ER-/HER2- and
HER2+ subtypes involved different genes but are enriched in genes
related to immune response.
GENIUS yielded significantly better performance than all the
state-of-the-art prognostic gene signatures in the global population.
GENIUS was not significantly superior in all subtypes but it was the
only signature that performed well whatever the subtype.
GENIUS significantly outperformed prognostic clinical models in the
global population. Its superiority almost reached significance in all the
subtypes.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
34 / 83
GENIUS
Results
Local prognostic gene signatures identified for the ER-/HER2- and
HER2+ subtypes involved different genes but are enriched in genes
related to immune response.
GENIUS yielded significantly better performance than all the
state-of-the-art prognostic gene signatures in the global population.
GENIUS was not significantly superior in all subtypes but it was the
only signature that performed well whatever the subtype.
GENIUS significantly outperformed prognostic clinical models in the
global population. Its superiority almost reached significance in all the
subtypes.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
34 / 83
GENIUS
Performance comparison with prognostic gene signatures
Test for GENIUS
superiority
ALL:
GENIUS
AURKA
GGI
STAT1
PLAU
IRMODULE
SDPP
0.012
0.018
5E-12
9E-14
1E-4
0.015
ER+/HER2!: GENIUS
AURKA
GGI
STAT1
PLAU
IRMODULE
SDPP
0.25
0.27
2E-6
5E-9
0.065
0.094
ER!/HER2!: GENIUS
AURKA
GGI
STAT1
PLAU
IRMODULE
SDPP
0.0051
0.016
0.1
0.0037
0.44
0.028
HER2+:
GENIUS
AURKA
GGI
STAT1
PLAU
IRMODULE
SDPP
0.2
0.3
0.25
0.039
0.15
0.10
0.62
0.39
0.4
0.5
0.6
0.7
0.8
concordance index
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
35 / 83
GENIUS
Results
Local prognostic gene signatures identified for the ER-/HER2- and
HER2+ subtypes involved different genes but shared genes related to
immune response.
GENIUS yielded significantly better performance than all the
state-of-the-art prognostic gene signatures in the global population.
GENIUS was not significantly superior in all subtypes but it was the
only signature that performed well whatever the subtype.
GENIUS significantly outperformed prognostic clinical models in the
global population. Its superiority did not reach significance in all the
subtypes.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
36 / 83
GENIUS
Performance comparison with prognostic clinical models
Test for GENIUS
superiority
ALL:
GENIUS
AOL
NPI
0.0013
0.02
ER+/HER2!: GENIUS
AOL
NPI
0.043
0.19
ER!/HER2!: GENIUS
AOL
NPI
0.037
0.03
HER2+:
0.2
GENIUS
AOL
NPI
0.3
0.082
0.077
0.4
0.5
0.6
0.7
0.8
concordance index
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
37 / 83
Conclusions
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
38 / 83
Biological Insights
From global prognostic gene signatures we highlighted the importance
of proliferation when robustly quantified through gene expression
profiling [Sotiriou et al., 2006, Wirapati et al., 2008, Desmedt et al., 2008].
From local prognostic gene signatures we showed that the prognostic
factors dramatically depend on the molecular subtypes, i.e.
proliferation for ER+/HER2-, immune response for ER-/HER2- and
HER2+ and angiogenesis for HER2+
[Wirapati et al., 2008, Desmedt et al., 2008, Haibe-Kains, 2009].
Since the ER+/HER2- subtype is the most common one (≈ 60% of
the patients), we showed that the prognostic ability of most global
gene signatures (e.g. GENE70, GENE76 or GGI) is driven by
proliferation-related genes
[Wirapati et al., 2008, Desmedt et al., 2008, Haibe-Kains et al., 2008b].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
39 / 83
GENIUS Approach
These findings suggested that a local modeling strategy taking into
account for the presence of molecular subtypes might improve breast
cancer prognosis.
We showed that GENIUS yielded promising performance in
independent datasets.
The next step is to compare this fuzzy approach with a deterministic
one to raise confidence in this local modeling strategy.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
40 / 83
Translational Research
Improving the clinical value of tumor grading
Clinical utility of the Genomic Grade
index
MapQuant Dx™ Genomic Grade is the cornerstone
of the MapQuant Dx™ assay series. It is the very
first, microarray-based and clinically-validated,
molecular diagnostic test to accurately measure
tumor grade, a consensus indicator of tumor
proliferation, risk of metastasis and response to
chemotherapy.
Gene expression profiling in breast cancer:
understanding the molecular basis of
histologic grade to improve prognosis.
J Natl Cancer Inst. 2006 Feb 15;98(4):26272.
Sotiriou C,Wirapati P, Loi S, Harris A, Fox S,
Smeds J, Nordgren H, Farmer P,
Praz V,Haibe-Kains B, Desmedt C, Larsimont
D,Cardoso F, Peterse H, Nuyten D, Buyse M,
VandeVijver MJ , Bergh J , Piccart M ,
Delorenzi M.
Resolving histological grading uncertainty
Tumor grade is a decision factor in most national &
international guidelines to breast cancer
treatment. It is generally recommended to treat
high-grade "grade 3" breast carcinoma with
chemotherapy because they are chemosensitive and will often recur otherwise. By contrast, most
low-grade "grade 1" tumors should not be treated with chemotherapy because they have a good
prognosis and often are chemo-insensitive.
A critical clinical issue is how to treat the 50% of breast cancers tested today as intermediate grade?
The MapQuant Dx™ Genomic Grade test now allows to resolve more than 80% of these uncertain
"grade 2" tumors into "grade 1" or "grade 3" tumors, potentially sparing useless chemotherapy
treatments to tens of thousands patients a year.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
Definition of clinically distinct molecular
subtypes in estrogen receptor-positive breast
carcinomas through genomic grade.
J Clin Oncol. 2007 Apr1; 25(10):1239-46.
Loi S, Haibe-Kains B, Desmedt C, Lallemand
F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J,
Foekens JA, Klijn JG, Larsimont D, Buyse M,
Bontempi G, Delorenzi M, Piccart MJ, Sotiriou C.
May 18, 2009
41 / 83
Software
survcomp R package available from CRAN:
I
Implementation of 6 performance criteria:
1
2
3
4
5
6
I
Hazard ratio
D index
Concordance index
Time-dependent ROC curve
Cross-validated partial likelihood
Brier score
Implementation of 2 statistical tests for performance comparison:
F
F
Paired student t test for 1, 2, and 3
Wilcoxon signed-rank test for 4, 5, and 6.
Sweave code: The LATEX and R codes performing the whole analysis
of the microarray and survival data are publicly available for
[Haibe-Kains et al., 2008a, Haibe-Kains et al., 2008b].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
42 / 83
Future of Bioinformatics
Integrative bioinformatics
Datasets
Dataset1
Dataset2
Datasetm
Meta-analysis
Technology2
Dataset1
Dataset2
Datasetn
Meta-analysis
Technologyq
Dataset1
Dataset2
Datasetp
Meta-analysis
Technologies
Technology1
Integrative
analysis
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
43 / 83
Thank you for your attention.
This presentation is available from http:
//www.ulb.ac.be/di/map/bhaibeka/papers/haibekains2009genius_talk_ki.pdf.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
44 / 83
Bibliography
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
45 / 83
Bibliography I
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, B, 39(1):1–38.
Desmedt, C., Haibe-Kains, B., Wirapati, P., Buyse, M., Larsimont, D., Bontempi, G.,
Delorenzi, M., Piccart, M., and Sotiriou, C. (2008).
Biological Processes Associated with Breast Cancer Clinical Outcome Depend on the
Molecular Subtypes.
Clin Cancer Res, 14(16):5158–5165.
Desmedt, C., Piette, F., Loi, S., Wang, Y., Lallemand, F., Haibe-Kains, B., Viale, G.,
Delorenzi, M., Zhang, Y., d’Assignies, M. S., Bergh, J., Lidereau, R., Ellis, P., Harris,
A. L., Klijn, J. G., Foekens, J. A., Cardoso, F., Piccart, M. J., Buyse, M., and Sotiriou, C.
(2007).
Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast
Cancer Patients in the TRANSBIG Multicenter Independent Validation Series.
Clin Cancer Res, 13(11):3207–3214.
Eleuteri, A., Tagliaferri, R., Milano, L., De Placido, S., and De Laurentiis, M. (2003).
A novel neural network-based survival analysis model.
Neural Netw, 16(5-6):855–864.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
46 / 83
Bibliography II
Foekens, J. A., Atkins, D., Zhang, Y., Sweep, F. C., Harbeck, N., Paradiso, A., Cufer, T.,
Sieuwerts, A. M., Talantov, D., Span, P. N., Tjan-Heijnen, V. C., Zito, A. F., Specht, K.,
Hioefler, H., Golouh, R., Schittulli, F., Schmitt, M., Beex, L. V., Klijn, J. G., and Wang,
Y. (2006).
Multicenter validation of a gene expression–based prognostic signature in lymph
node–negative primary breast cancer.
Journal of Clinical Oncology, 24(11).
Green, B. F. (1977).
Parameter sensitivity in multivariate methods.
Multivariate Behavioral Research, 12(3):263–287.
Haibe-Kains, B. (2009).
Identification and Assessment of Dene Signatures in Human Breast Cancer.
PhD thesis, Université Libre de Bruxelles.
Haibe-Kains, B., Desmedt, C., Loi, S., Delorenzi, M., Sotiriou, C., and Bontempi, G.
(2008a).
Computational Intelligence in Clinical Oncology : Lessons Learned from an Analysis of a
Clinical Study, volume 122 of Studies in Computational Intelligence, chapter 10, pages
237–268.
Springer-Verlag Berlin/Heidelberg.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
47 / 83
Bibliography III
Haibe-Kains, B., Desmedt, C., Sotiriou, C., and Bontempi, G. (2008b).
A comparative study of survival models for breast cancer prognostication based on
microarray data: does a single gene beat them all?
Bioinformatics, 24(19):2200–2208.
Hanahan, D. and Weinberg, R. A. (2000).
The hallmarks of cancer.
Cell, 100:57–70.
Harrell, F. J., Lee, K., and Mark, D. (1996).
Multivariable prognostic models: issues in developing models, evaluating assumptions and
adequacy, and measuring and reducing errors.
Stat Med, 15(4):361–387.
Johansen, T. A. and Foss, B. A. (1993).
Constructing NARMAX models using ARMAX models.
International Journal of Control, 58:1125–1153.
Kapp, A., Jeffrey, S., Langerod, A., Borresen-Dale, A.-L., Han, W., Noh, D.-Y., Bukholm,
I., Nicolau, M., Brown, P., and Tibshirani, R. (2006).
Discovery and validation of breast cancer subtypes.
BMC Genomics, 7(1):231.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
48 / 83
Bibliography IV
Kittler, J., Hatef, M., Duin, R., and Matas, J. (1998).
On combining classifiers.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–238.
Loi, S., Haibe-Kains, B., Desmedt, C., Wirapati, P., Lallemand, F., Tutt, A., Gillet, C.,
Ellis, P., Ryder, K., Reid, J., Daidone, M., Pierotti, M., Berns, E., Jansen, M., Foekens, J.,
Delorenzi, M., Bontempi, G., Piccart, M., and Sotiriou, C. (2008).
Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer
treated with tamoxifen.
BMC Genomics, 9(1):239.
Meyer, P. E. (2008).
Information-Theoretic Variable Selection and Network Inference from Microarray Data.
PhD thesis, Université Libre de Bruxelles.
Perou, C. M., Sorlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., Pollack,
J. R., Ross, D. T., Johnsen, H., Akslen, L. A., Fluge, O., Pergamenschikov, A., Williams,
C., Zhu, S. X., Lonning, P. E., Borresen-Dale, A.-L., Brown, P. O., and Botstein, D.
(2000).
Molecular portraits of human breast tumours.
Nature, 406(6797):747–752.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
49 / 83
Bibliography V
Schwarz, G. (1978).
Estimating the dimension of a model.
Annals of Statistics, 6:461–464.
Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisher, S., Johnsen, H., Hastie, T.,
Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese., J. C.,
Brown, P. O., Botstein, D., Eystein, P. L., and Borresen-Dale, A. L. (2001).
Gene expression patterns breast carcinomas distinguish tumor subclasses with clinical
implications.
Proc. Matl. Acad. Sci. USA, 98(19):10869–10874.
Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S.,
Johnsen, H., Pesich, R., Geister, S., Demeter, J., Perou, C., Lonning, P. E., Brown, P. O.,
Borresen-Dale, A. L., and Botstein, D. (2003).
Repeated observation of breast tumor subtypes in indepedent gene expression data sets.
Proc Natl Acad Sci USA, 1(14):8418–8423.
Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P.,
Praz, V., Haibe-Kains, B., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten,
D., Buyse, M., Van de Vijver, M. J., Bergh, J., Piccart, M., and Delorenzi, M. (2006).
Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of
Histologic Grade To Improve Prognosis.
J. Natl. Cancer Inst., 98(4):262–272.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
50 / 83
Bibliography VI
Tibshirani, R. and Walther, G. (2005).
Cluster validation by prediction strength.
Journal of Computational and Graphical Statistics, 14(3):511–528.
van Belle, V., Pelckmans, K., Suykens, J. A., and van Huffel, S. (2007).
Support vector machines for survival analysis.
In Third International Conference on Computational Intelligence in Medicine and
Healthcare.
van de Vijver, M. J., He, Y. D., van’t Veer, L., Dai, H., Hart, A. M., Voskuil, D. W.,
Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D.,
Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S.,
Rutgers, E. T., Friend, S. H., and Bernards, R. (2002).
A gene expression signature as a predictor of survival in breast cancer.
New England Journal of Medicine, 347(25):1999–2009.
van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse,
H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhiven,
R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002).
Gene expression profiling predicts clinical outcome of breast cancer.
Nature, 415:530–536.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
51 / 83
Bibliography VII
Wainer, H. (1976).
Estimating coefficients in linear models: It don’t make no nevermind.
Psychological Bulletin, 83(2):213–217.
Wang, Y., Klijn, J. G., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D.,
Timmermans, M., van Gelder, M. E. M., Yu, J., Jatkoe, T., Berns, E. M., Atkins, D., and
Forekens, J. A. (2005).
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary
breast cancer.
Lancet, 365:671–679.
Wessels, L. F. A., Reinders, M. J. T., Hart, A. A. M., Veenman, C. J., Dai, H., He, Y. D.,
and van’t Veer, L. J. (2005).
A protocol for building and evaluating predictors of disease state based on microarray data.
Bioinformatics, 21(19):3755–3762.
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H.,
Olson, J. A., Marks, J. R., and Nevins, J. R. (2001).
Predicting the clinical status of human breast cancer by using gene expression profiles.
PNAS, 98(20):11462–11467.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
52 / 83
Bibliography VIII
Wirapati, P., Sotiriou, C., Kunkel, S., Farmer, P., Pradervand, S., Haibe-Kains, B.,
Desmedt, C., Ignatiadis, M., Sengstag, T., Schutz, F., Goldstein, D., Piccart, M., and
Delorenzi, M. (2008).
Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding
of breast cancer subtyping and prognosis signatures.
Breast Cancer Research, 10(4):R65.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
53 / 83
Appendix
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
54 / 83
Survival Data
Retrospective observation plan (data retrieved from hospital DB).
I
A
Censoring
+
C
o
D
+
E
o
o
F
2
4
6
Time (years)
1
2
4
5
5
7
8
Event
1
0
1
0
1
1
0
Survival analysis
+
G
0
Survival data
Patient id
A
B
C
D
E
F
G
o
B
Patients
I
8
10
Censoring in survival data requires
specific methods to deal with.
Time (years)
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
55 / 83
Survival Distributions and Performance
Influence of covariates x:
I
1.0
0.8
0.6
S(t)
0.4
I
S(t) = 1 − Pr{t ≤ t}.
Pr{t ≤ t < t + ∆t | t ≥ t}
h(t) = lim
∆t→0
∆t
0.2
I
Kaplan-Meier survival curve
0.0
Time of event occurrence are realization of a
random variable t.
Survival distributions:
0
2
4
6
8
10
Years
Proportional hazards model (Cox):
hi (t) =
λ0 (t)
exp( βxi )
|{z}
| {z }
risk
unspecified baseline
where xi is the gene expression profile of patient i in the thesis.
Performance:
Concordance index: Estimate of the probability that, for a pair of
randomly chosen comparable patients, the patient with the higher risk
prediction will experience an event before the lower risk patient.
ß C -index ∈ [0, 1] (higher is better in the thesis).
I
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
56 / 83
Central Dogma of Molecular Biology
Enhancers
Promoter
DNA
Transcription
Introns
pre-mRNA
Exon
Exon
Exon
Exon
Exon
Splicing
mRNA
5' UTR
Open reading
frame
3' UTR
Translation
Gene expression
profiling
protein
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
57 / 83
Microarray Data
Few samples (dozens to hundreds patients).
I
I
Microarray technology is expensive.
Frozen tumor samples are rare (biobank).
Numerous gene expressions are measured (tens of thousands genes).
I
The recent microarray chips cover the whole genome (≈ 50,000 probes
representing 30,000 ”known genes”).
ß High feature-to-sample ratio (curse of dimensionality requiring feature
selection).
Microarray is a complex technology.
ß High level of noise in the measurements.
Biology is complex (e.g. gene
co-expressions due to pathways).
Microarray data matrix
DATA
patient1
patient2
..
.
patientn
x1
1.3
2.7
..
.
5.4
x2
3.2
1.2
..
.
1.4
...
...
...
..
.
...
xp
7
2.3
..
.
9.1
ß Redundancy.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
58 / 83
Datasets
Dataset
VDX
NKI
STNO2
NCI
MGH
MSK
UPP
STK
UNT
UNC2
DUKE
CAL
TBG
NCH
DUKE2
MAINZ
TAM
TAM2
LUND2
LUND
MUG
Technology
Affymetrix
Agilent
cDNA Stanford
cDNA NCI
Arcturus
Affymetrix
Affymetrix
Affymetrix
Affymetrix
Agilent
Affymetrix
Affymetrix
Affymetrix
Agilent
Affymetrix
Affymetrix
Affymetrix
Aminolink
Swegene
Swegene
Operon
Benjamin Haibe-Kains (ULB)
Survival
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
NO
YES
YES
YES
NO
NO
NO
Treatment
untreated
heterogeneous
heterogeneous
heterogeneous
hormono
heterogeneous
heterogeneous
heterogeneous
untreated
heterogeneous
heterogeneous
heterogeneous
untreated
heterogeneous
chemo
untreated
hormono
hormono
hormono
heterogeneous
NA
Talk at Karolinska Institute
Patients
344
345
122
99
60
99
251
159
137
248
171
118
198
135
160
200
354
155
105
143
152
Probes
22,283
24,481
7,787
6,878
11,421
22,283
22,283
22,283
22,283
21,495
12,625
22,283
22,283
17,086
61,359
22,283
44,928
21,332
27,648
26,824
16,783
May 18, 2009
59 / 83
State-of-the-art: [van’t Veer et al., 2002]
Study of [van’t Veer et al., 2002]:
I
Training set:
F
F
I
I
I
I
Input data: 24,481 (5000 after filtering) gene expressions of 78
untreated patients with early breast cancer.
Output data: Dichotomization of survival data (5 years).
Feature selection: Feature ranking (correlation).
Model building: Nearest centroid classifier.
Hyperparameter tuning: Supervised tuning of the signature size
(cross-validation).
Validation set:
F
F
Internal: Cross-validation !!!
External: 295 treated and untreated patients with breast cancer at
various stages [van de Vijver et al., 2002] (NKI) !!!
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
60 / 83
State-of-the-art: [Perou et al., 2000, Sorlie et al., 2001]
Study of [Perou et al., 2000, Sorlie et al., 2001]:
I
Training set:
F
I
I
I
Input data: 8,102 gene expressions of 84 patients with breast cancer at
various stages.
Feature selection: Feature ranking based on variance (intrinsic gene
list).
Model building: Hierarchical clustering.
Hyperparameter tuning: Number of clusters is selected based on
visualization assessment !!!
Validation:
I
I
Internal: None !!!
External: 38 new patients in [Sorlie et al., 2003] + datasets from
[van’t Veer et al., 2002] and [West et al., 2001] !!!
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
61 / 83
State-of-the-art: [Wang et al., 2005]
Study of [Wang et al., 2005]:
I
Training set:
F
F
I
I
I
I
Input data: 22,283 (17,819 after filtering) gene expressions of 115
untreated patients with early breast cancer.
Output data: ”Full” survival data and dichotomization (5 years) !!!
Feature selection: Feature ranking based on significance of univariate
Cox’s models (bootstrap) for ER- and ER+ separately.
Model building: Combination of univariate Cox’s models.
Hyperparameter tuning: Signature size tuned by optimizing the
sensitivity and specificity in the training set.
Validation set:
F
F
Internal: 171 patients (single ”random” split) !!!
External: 180 untreated patients with early breast cancer.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
62 / 83
Feature Selection
Desirable properties of feature selection:
I
I
I
Computation efficiency (thousand of features).
Robustness to overfitting (not dataset dependent).
Intuitive tuning of (few) hyperparameters.
For microarray data, feature ranking was shown to be efficient
[Wessels et al., 2005].
For prognostication, we used the concordance index
[Harrell et al., 1996] as scoring function:
P
k,l∈Ω 1{xki > xli }
C -index(xi , y ) =
|Ω|
where xi are the expressions of gene i, y are the survival data and Ω
is the set of all the pairs of comparable patients {k, l}.
Signature size?
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
63 / 83
Stability-Based Feature Ranking
Aim: To select a relevant and stable (low variance) set of features
1
Signature size k = 4
Dataset
feature10
k*
0.8
feature2
m times
feature34
Stab adj
n patients
feature24
sampling
feature5
0.6
0.4
feature67
n' patients
feature45
0.2
feature89
Feature ranking
feature1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
k features
Signatures
m signatures
Signature
stability
1
Frequency
0
20
40
60
80
100
120
Signature size
Let X be the set of p features and freq(xj ) be the number of
sampling steps in which xj has been selected out of m samplings
without replacement. X is sorted into {x(1) , . . . , x(p) } where
freq(x(i) ) ≥ freq(x(j) ) if i > j.
Pk
Stab(k) =
l=1
freq(x(l) )
km

ff
k
Stabadj (k) = max 0, Stab(k) −
p
Stab(k)
ß k ∗ = argmax Stabadj (k)
k
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
64 / 83
Stability-Based Feature Ranking
Pros & Cons
Pros
I
I
I
Computational scalability.
Reduction in risk of overfitting (low variance).
Easy and intuitive tuning of the hyperparameter.
Cons
I
Potential risk of bias since feature ranking is unable to deal with
redundancy and complementarity of the features [Meyer, 2008].
This method was first applied in a study related to tamoxifen resistance in
breast cancer [Loi et al., 2008].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
65 / 83
Risk Prediction Modeling
Risk prediction models can be:
I
I
Linear or non-linear.
Univariate or multivariate.
Intrinsic nature of microarray data evokes the risk of overfitting of
non-linear multivariate prediction models.
But univariate linear prediction models are not able to take into
account for the multiple interactions underlying tumorigenesis.
ß Trade-off?
Since 2002, large comparative studies have been conducted to identify
successful prediction models for class discovery and classification
. . . leaving aside risk prediction (survival models) in breast cancer.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
66 / 83
Comparative Study of Risk Prediction Models
We compared numerous linear risk prediction models in the context of
breast cancer prognostication using microarray and survival data
[Haibe-Kains et al., 2008b].
ß ”Complex” methods, although promising in the training set, yielded
poor performance in independent sets (overfitting).
ß The loss of interpretability deriving from the use of overcomplex
methods may be not sufficiently counterbalanced by significantly better
performance.
The few nonlinear risk prediction models (e.g. survival nnet
[Eleuteri et al., 2003] and cSVM [van Belle et al., 2007]) developed in the
field also yielded poor performance (data not shown).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
67 / 83
Robust Model Building
So we focussed on linear multivariate linear risk prediction models and
relied on a simple additive combination scheme [Kittler et al., 1998] and
equal weights linear regression [Wainer, 1976, Green, 1977]:
r=
k
X
βi0 xi
i=1
wherer is the risk of a patient, xi are the gene expressions, and
− k1 if xi is positively correlated with survival
βi0 =
+ k1 otherwise
The use of equal weights (coefficients)
I
I
I
reduces the risk of overfitting,
ensures the coefficients estimation to not be influenced by outliers,
and causes only modest expected loss in accuracy.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
68 / 83
Robust Model Buidling
Pros & Cons
Pros
I
I
I
I
Additive models are multivariate models with low variance.
Less sensitive to redundancy that traditional multivariate models
(collinearity).
Low computational cost (especially in combination with feature
ranking).
Equal weights regression (i.e. signed average in our case) facilitates the
computation of risk predictions in different microarray platforms with
missing probes.
Cons
I
Unable to deal with complementarity of features.
This method was used in the study related to GGI = Gene expression
Grade Index [Sotiriou et al., 2006].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
69 / 83
Comparative Study of Risk Prediction Models
We compared numerous linear risk prediction models in the context of
breast cancer prognostication using microarray and survival data
[Haibe-Kains et al., 2008b].
Risk prediction models:
1
2
3
4
5
6
7
8
9
10
11
12
13
Genotype
AURKA
BD
BD
BD
BD
GW
GW
GW
GW
GW
GW
Benjamin Haibe-Kains (ULB)
Dim. reduction
Structure
Learning algo.
Phenotype
RANK(CV)
RANK(CV)
RANK(CV)
PCA(CV)
PCA(CV)
PCA(CV)
COMBUNIV
COMBUNIV
MULTIV
MULTIV
COMBUNIV
COMBUNIV
MULTIV
COMBUNIV
COMBUNIV
MULTIV
WILCOXON
COX
LM
COX
WILCOXON
COX
RCOX
WILCOXON
COX
RCOX
GENE76
GGI
HG
SURV
TOE
SURV
HG
SURV
SURV
HG
SURV
SURV
Talk at Karolinska Institute
May 18, 2009
70 / 83
Comparative Study of Risk Prediction Models
Results
Forestplot of the concordance index for each method in the training
set and the three validation sets:
AURKA
BD.COMBUNIV.WILCOXON.HG
AURKA
AURKA
AURKA
AURKA
AURKA
BD.COMBUNIV.COX.SURV
BD.COMBUNIV.WILCOXON.HG BD.COMBUNIV.WILCOXON.HG
BD.COMBUNIV.WILCOXON.HG
BD.COMBUNIV.WILCOXON.HG BD.COMBUNIV.WILCOXON.HG
BD.MULTIV.LM.TOE
BD.COMBUNIV.COX.SURV
BD.COMBUNIV.COX.SURV
BD.COMBUNIV.COX.SURV
BD.COMBUNIV.COX.SURV
BD.COMBUNIV.COX.SURV
BD.MULTIV.COX.SURV
BD.MULTIV.LM.TOE
BD.MULTIV.LM.TOE
BD.MULTIV.LM.TOE
BD.MULTIV.LM.TOE
BD.MULTIV.LM.TOE
GW.RANK.COMBUNIV.WILCOXON.HG
BD.MULTIV.COX.SURV
BD.MULTIV.COX.SURV
BD.MULTIV.COX.SURV
BD.MULTIV.COX.SURV
BD.MULTIV.COX.SURV
GW.RANKCV.COMBUNIV.WILCOXON.HG
GW.RANK.COMBUNIV.WILCOXON.HG
GW.RANK.COMBUNIV.WILCOXON.HG
GW.RANK.COMBUNIV.WILCOXON.HG
GW.RANK.COMBUNIV.WILCOXON.HG
GW.RANK.COMBUNIV.WILCOXON.HG
GW.RANK.COMBUNIV.COX.SURV
GW.RANKCV.COMBUNIV.WILCOXON.HG
GW.RANKCV.COMBUNIV.WILCOXON.HG
GW.RANKCV.COMBUNIV.WILCOXON.HG
GW.RANKCV.COMBUNIV.WILCOXON.HG
GW.RANKCV.COMBUNIV.WILCOXON.HG
GW.RANKCV.COMBUNIV.COX.SURV
GW.RANK.COMBUNIV.COX.SURVGW.RANK.COMBUNIV.COX.SURV GW.RANK.COMBUNIV.COX.SURV
GW.RANK.COMBUNIV.COX.SURV GW.RANK.COMBUNIV.COX.SURV
GW.RANK.MULTIV.RCOX.SURV
GW.RANKCV.COMBUNIV.COX.SURV
GW.RANKCV.COMBUNIV.COX.SURV
GW.RANKCV.COMBUNIV.COX.SURV
GW.RANKCV.COMBUNIV.COX.SURV
GW.RANKCV.COMBUNIV.COX.SURV
GW.RANKCV.MULTIV.RCOX.SURV
GW.RANK.MULTIV.RCOX.SURV
GW.RANK.MULTIV.RCOX.SURV
GW.RANK.MULTIV.RCOX.SURV
GW.RANK.MULTIV.RCOX.SURV
GW.RANK.MULTIV.RCOX.SURV
GW.PCA.COMBUNIV.WILCOXON.HG
GW.RANKCV.MULTIV.RCOX.SURV
GW.RANKCV.MULTIV.RCOX.SURV GW.RANKCV.MULTIV.RCOX.SURV
GW.RANKCV.MULTIV.RCOX.SURVGW.RANKCV.MULTIV.RCOX.SURV
GW.PCACV.COMBUNIV.WILCOXON.HG
GW.PCA.COMBUNIV.WILCOXON.HG
GW.PCA.COMBUNIV.WILCOXON.HG
GW.PCA.COMBUNIV.WILCOXON.HG
GW.PCA.COMBUNIV.WILCOXON.HG
GW.PCA.COMBUNIV.WILCOXON.HG
GW.PCA.COMBUNIV.COX.SURV
GW.PCACV.COMBUNIV.WILCOXON.HG
GW.PCACV.COMBUNIV.WILCOXON.HG
GW.PCACV.COMBUNIV.WILCOXON.HG
GW.PCACV.COMBUNIV.WILCOXON.HG
GW.PCACV.COMBUNIV.WILCOXON.HG
GW.PCACV.COMBUNIV.COX.SURV
GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV
GW.PCA.COMBUNIV.COX.SURV GW.PCA.COMBUNIV.COX.SURV
GW.PCA.MULTIV.RCOX.SURV
GW.PCACV.COMBUNIV.COX.SURV
GW.PCACV.COMBUNIV.COX.SURVGW.PCACV.COMBUNIV.COX.SURV
GW.PCACV.COMBUNIV.COX.SURV
GW.PCACV.COMBUNIV.COX.SURV
GW.PCACV.MULTIV.RCOX.SURV
GW.PCA.MULTIV.RCOX.SURV
GW.PCA.MULTIV.RCOX.SURV
GW.PCA.MULTIV.RCOX.SURV
GW.PCA.MULTIV.RCOX.SURV
GW.PCA.MULTIV.RCOX.SURV
GENE76
GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV
GW.PCACV.MULTIV.RCOX.SURV GW.PCACV.MULTIV.RCOX.SURV
GGI
GENE76
GENE76
GENE76
GENE76
GENE76
AOL
GGI
GGI
GGI
GGI
GGI
NPI
AOL
AOL
NPI
NPI
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.5 0.55 0.6 0.65 0.7 0.75 0.8
0.50.550.60.650.70.750.80.850.90.95
concordance index
0.5 0.55 0.6 0.65 0.7 0.75 0.8
0.50.550.60.650.70.750.80.850.90.95
Training
VDX set
VDX
Validation
TBVDX set
TBG
concordance index
Validation
TAM set
TAM
0.5 0.55 0.6 0.65 0.7 0.75 0.8
concordance index
Validation
set
UPP
UPP
Supplementary Figure 1: Forest plot of the concordance indices for the risk scores predicted by all the methods in the training set (VDX
ß ”Complex” methods, although promising in the training set, yielded poor
performance in independent sets (overfitting).
ß The loss of interpretability deriving from the use of overcomplex methods may be
not sufficiently counterbalanced by significantly better performance.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
71 / 83
Global Prognostic Gene Signature Identification
Biological validation
GGI = Gene expression Grade Index.
I
Aim: Understanding the molecular basis of histological grade to
improve prognosis.
GGI refers to:
I
I
The signature of 128 probes selected through feature ranking with fixed
threshold.
The robust model.
Training set:
I
I
Input data: 22,283 gene expressions of 64 ER+ tamoxifen treated
patients with breast cancer at various stages.
Output data: Binary class defined by histological grade 1 or 3.
Validation set: 745 untreated patients with early breast cancer from 5
datasets (NKI, TBG, UPP, UNT and MAINZ).
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
72 / 83
GGI
Results
Histological grade:
I
I
HG1 and HG3 tumors have distinct gene expression profiles
characterized by proliferation-related genes, HG2 tumors being
heterogeneous.
GGI is strongly predictive of HG1 vs HG3.
Prognosis:
I
I
I
Strong prognostic value: GGI is strongly associated with distant
metastasis free survival (DMFS). The three-category histological
grading system could be replaced with a two-catehory one using GG to
improve prognosis.
Proliferation and modeling: We showed in [Haibe-Kains et al., 2008b]
that GGI is the only risk prediction model to outperform the single
proliferation gene AURKA.
Similar performance compared state-of-the-art gene signatures: GGI is
competitive with GENE70 [van’t Veer et al., 2002] and GENE76
[Wang et al., 2005].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
73 / 83
GGI
Results
Histological grade:
I
I
HG1 and HG3 tumors have distinct gene expression profiles
characterized by proliferation-related genes, HG2 tumors being
heterogeneous.
GGI is strongly predictive of HG1 vs HG3.
Prognosis:
I
I
I
Strong prognostic value: GGI is strongly associated with distant
metastasis free survival (DMFS). The three-category histological
grading system could be replaced with a two-catehory one using GG to
improve prognosis.
Proliferation and modeling: We showed in [Haibe-Kains et al., 2008b]
that GGI is the only risk prediction model to outperform the single
proliferation gene AURKA.
Similar performance compared state-of-the-art gene signatures: GGI is
competitive with GENE70 [van’t Veer et al., 2002] and GENE76
[Wang et al., 2005].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
73 / 83
GGI
Prognosis
1.0
HG1
HG1-like
0.4
No. At Risk
HG1 144
HG2 338
HG3 226
1
2
3
144
326
212
144
311
185
139
295
166
4
5
6
Time (years)
136
274
153
HG2/GG1
HG2/GG3
HR = 3.2 95%CI [1.98,5.18]
logrank p < 1.2E 05
0.0
0.0
HR = 3.45 95%CI [2.18,5.45]
logrank p < 9.6E 09
0
0.8
0.2
HG1
HG2
HG3
HG3-like
0.6
0.6
HG3
Probability of survival
0.8
HG2
0.2
Probability of survival
Gene expression grade (GG)
0.4
1.0
Histological grade (HG)
132
257
139
Benjamin Haibe-Kains (ULB)
115
227
133
7
8
9
10
101
203
119
86
176
107
74
163
96
58
141
89
0
No. At Risk
HG2/GG1 217
HG2/GG3 121
Talk at Karolinska Institute
1
2
3
212
115
206
105
202
94
4
5
6
Time (years)
190
85
180
78
156
73
7
8
9
10
139
65
121
56
112
52
102
39
May 18, 2009
74 / 83
GGI
Results
Histological grade:
I
I
HG1 and HG3 tumors have distinct gene expression profiles
characterized by proliferation-related genes, HG2 tumors being
heterogeneous.
GGI is strongly predictive of HG1 vs HG3.
Prognosis:
I
I
I
Strong prognostic value: GGI is strongly associated with distant
metastasis free survival (DMFS). The three-category histological
grading system could be replaced with a two-catehory one using GG to
improve prognosis.
Proliferation and modeling: We showed in [Haibe-Kains et al., 2008b]
that GGI is the only risk prediction model to outperform the single
proliferation gene AURKA.
Similar performance compared state-of-the-art gene signatures: GGI is
competitive with GENE70 [van’t Veer et al., 2002] and GENE76
[Wang et al., 2005].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
75 / 83
GGI
Results
Histological grade:
I
I
HG1 and HG3 tumors have distinct gene expression profiles
characterized by proliferation-related genes, HG2 tumors being
heterogeneous.
GGI is strongly predictive of HG1 vs HG3.
Prognosis:
I
I
I
Strong prognostic value: GGI is strongly associated with distant
metastasis free survival (DMFS). The three-category histological
grading system could be replaced with a two-catehory one using GG to
improve prognosis.
Proliferation and modeling: We showed in [Haibe-Kains et al., 2008b]
that GGI is the only risk prediction model to outperform the single
proliferation gene AURKA.
Similar performance compared state-of-the-art gene signatures: GGI is
competitive with GENE70 [van’t Veer et al., 2002] and GENE76
[Wang et al., 2005].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
75 / 83
GGI
Performance comparison (TBG)
Concordance in predictions:
GENE70
GENE76
7
5
0
9
15
25
103
32
0
5
Similar performance:
GENE70
GENE76
GGI
AOL
25
0.5
15
0.6
9
7
GENE70 vs GENE76
GENE70 vs GGI
GENE76 vs GGI
GGI
GENE70 vs AOL
GENE76 vs AOL
GGI vs AOL
Benjamin Haibe-Kains (ULB)
0.7
0.8
0.9
1
concordance index
Talk at Karolinska Institute
Test difference
0.15
0.53
0.22
Test superiority
0.007
0.13
0.015
May 18, 2009
76 / 83
GGI
Discussion
GGI is simple:
I
I
From a statistical point of view: robust model building.
From a biological point of view: GGI signature is mainly composed of
proliferation-related genes.
GGI yielded similar performance than gene signatures including genes
related to many biological processes (GENE70 and GENE76).
ß Actually, we showed in [Wirapati et al., 2008] that the prognostic value
of these signatures is mainly driven by the proliferation-related genes.
The studies related to GGI challenged the use of risk prediction
models of high statistical and biological complexity
[Sotiriou et al., 2006, Haibe-Kains et al., 2008b, Desmedt et al., 2008].
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
77 / 83
Molecular Subtypes Identification
State-of-the-Art
State-of-the-art:
I
I
[Perou et al., 2000, Sorlie et al., 2001] were the first to study the
molecular heterogeneity of breast cancer in order to build a subtype
classifier.
Open Issues:
F
F
F
F
F
Lack of estimation of classification uncertainty.
The use of large number of genes led to a clustering model prone to
overfitting (low robustness).
The (number of) clusters were subjectively selected.
Lack of existing statistics to evaluate the robustness of the clustering
model in independent datasets.
Association with survival tested in heterogeneous populations of breast
cancer patients.
[Kapp et al., 2006]
confirmed the importance of ER and HER2 signaling
pathways.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
78 / 83
Prototype-Based Clustering
P1
P2
patient2
patient2
P1
Dissimilarity estimation
Assignment
P1
P2
patient2
Prototypes selection
P2
Specific
patient1
3
tie
pa
gene j
Talk at Karolinska Institute
nt
3
patient1
tie
P3
P3
pa
tie
nt
3
patient1
pa
3
nt
tie
pa
Benjamin Haibe-Kains (ULB)
P2
gene j
patient1
P3
P1
P2
gene j
Unspecific
P3
gene j
nt
nt
tie
pa
P1
P2
patient2
patient2
P3
gene j
patient2
3
nt
tie
pa
P1
patient1
3
patient1
P3
gene j
May 18, 2009
79 / 83
Mixture of Gaussians?
1.5
1
0.5
NKI
0
0.8
0.6
0.4
0.2
VDX
0
1.5
1
Density
1
0.5
0
module: ERBB2
module: ERBB2
Density
module: AURKA
module: AURKA
1
0.8
0.6
0.4
0.2
0
1.5
1
0.5
module: ESR1
module: ESR1
1
0.8
0.6
0.4
0.2
0
0
!2
!1
0
1
2
!2
score
Benjamin Haibe-Kains (ULB)
!1
0
1
2
score
Talk at Karolinska Institute
May 18, 2009
80 / 83
Clustering Performance
Performance assessment of clustering is a difficult task since the
”truth” is hidden.
In [Tibshirani and Walther, 2005], the authors introduced a new
framework viewing clustering as a supervised learning problem in
which the ”true” class labels have to be estimated.
Let X and Y be the training and validation sets, respectively.
Let CX (Y ) denote the use, in the dataset Y , of a clustering model C
fitted on dataset X .
Let D[C. (.)] be the co-membership matrix of the clustering C. (.).
Idea:
1
2
3
4
Cluster Y ß CY (Y ).
Cluster X ß CX (X ).
Cluster Y using CX ß CX (Y ).
Compare CX (Y ) and CY (Y ) through the co-membership matrix D
ßprediction strength.
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
81 / 83
Subtype Clustering vs Perou’s Method
Prediction strength in independent datasets (higher is better)
Dataset
NKI
TBG
UPP
UNT
STNO2
NCI
STK
MSK
UNC2
NCH
DUKE
DUKE2
MAINZ
CAL
LUND2
LUND
MUG
mean
sd
Subtype clustering
3 clusters
1.00
0.83
0.87
0.89
0.69
0.83
0.87
0.96
0.87
0.82
0.82
0.64
0.90
0.95
0.87
0.81
0.49
0.83
0.12
Benjamin Haibe-Kains (ULB)
2 clusters
0.89
0.92
0.71
0.78
0.86
0.93
0.61
0.77
0.81
0.66
0.57
0.92
0.68
0.84
0.92
0.55
0.50
0.76
0.14
Perou’s method
3 clusters
4 clusters
0.42
0.25
0.39
0.24
0.51
0.33
0.55
0.36
0.44
0.28
0.44
0.33
0.38
0.28
0.66
0.20
0.57
0.31
0.49
0.36
0.42
0.37
0.63
0.47
0.39
0.24
0.41
0.31
0.51
0.17
0.36
0.24
0.33
0.27
0.47
0.30
0.10
0.07
Talk at Karolinska Institute
5 clusters
0.20
0.22
0.22
0.00
0.25
0.13
0.25
0.00
0.00
0.29
0.42
0.00
0.18
0.00
0.17
0.20
0.23
0.16
0.12
May 18, 2009
82 / 83
Links
Personal webpage: http://www.ulb.ac.be/di/map/bhaibeka/
Machine Learning Group: http://www.ulb.ac.be/di/mlg
Functional Genomics Unit:
http://www.bordet.be/en/services/medical/array/practical.htm
Master in Bioinformatics at ULB and other belgian universities:
http://www.bioinfomaster.ulb.ac.be/
Benjamin Haibe-Kains (ULB)
Talk at Karolinska Institute
May 18, 2009
83 / 83
Téléchargement