Genomic Analyses across Six Cancer Types Identify Basal-like Breast Cancer as

Genomic Analyses across Six Cancer
Types Identify Basal-like Breast Cancer as
a Unique Molecular Entity
Aleix Prat
1,2,3
, Barbara Adamo
3
, Cheng Fan
4
, Vicente Peg
5,6
, Maria Vidal
1,2,3
, Patricia Galva
´n
1
,
Ana Vivancos
7
, Paolo Nuciforo
8
,He
´ctor G. Palmer
9
, Shaheenah Dawood
10
, Jordi Rodo
´n
3
,
Santiago Ramon y Cajal
5
, Josep Maria Del Campo
3
, Enriqueta Felip
3
, Josep Tabernero
3
& Javier Corte
´s
2,3
1
Translational Genomics Group, Vall d’Hebron Institute of Oncology (VHIO), Barcelona, Spain,
2
Breast Cancer Unit, Vall d’Hebron
Institute of Oncology (VHIO), Barcelona, Spain,
3
Medical Oncology Department, Vall d’Hebron Institute of Oncology (VHIO),
Barcelona, Spain,
4
Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, USA,
5
Pathology
Department, Vall d’Hebron University Hospital, Barcelona, Spain,
6
Morphological Sciences Department, Universitat Auto
`noma de
Barcelona, Spain,
7
Cancer Genomics Group, Vall d’Hebron Institute of Oncology (VHIO), Barcelona, Spain,
8
Molecular Oncology
Group, Vall d’Hebron Institute of Oncology (VHIO), Barcelona, Spain,
9
Stem Cells and Cancer Group, Vall d’Hebron Institute of
Oncology (VHIO), Barcelona, Spain,
10
Department of Medical Oncology, Dubai Hospital, U.A.E.
To improve our understanding of the biological relationships among different types of cancer, we have
characterized variation in gene expression patterns in a set of 1,707 samples representing 6 human cancer
types (breast, ovarian, brain, colorectal, lung adenocarcinoma and squamous cell lung cancer). In the unified
dataset, breast tumors of the Basal-like subtype were found to represent a unique molecular entity as any
other cancer type, including the rest of breast tumors, while showing striking similarities with squamous cell
lung cancers. Moreover, gene signatures tracking various cancer- and stromal-related biological processes
such as proliferation, hypoxia and immune activation were found expressed similarly in different
proportions of tumors across the various cancer types. These data suggest that clinical trials focusing on
tumors with common profiles and/or biomarker expression rather than their tissue of origin are warranted
with a special focus on Basal-like breast cancer and squamous cell lung carcinoma.
Classification and treatment of the majority of solid tumors is generally based on the tumor’s tissue of origin
and histological appearance (e.g. squamous cell lung cancer). In some cancer types, identification of single
molecular alterations has been found to be very useful in the clinical setting due to its ability to predict
treatment efficacy. For example, epidermal growth factor receptor (EGFR) mutations predict benefit from anti-
EGFR drugs in lung adenocarcinoma
1
, amplification of epidermal growth factor receptor type 2 (HER2) predicts
benefit from anti-HER2 drugs in breast cancer
2
, Kirsten rat sarcoma viral oncogene homolog (KRAS) mutations
predict lack of benefit from anti-EGFR drugs in colorectal cancer
3
and BRCA1 mutations predict benefit from
poly (ADP-ribose) polymerase 1 (PARP1) inhibitors in ovarian cancer
4
. Thus, searching for novel biomarkers,
drug targets and better classification algorithms to individualize treatment of cancer patients is an area of active
preclinical and clinical research.
In recent years, The Cancer Genome Atlas (TCGA) project has improved our understanding of the molecular
alterations occurring in glioblastoma multiforme
5,6
, high grade serous ovarian cancer
7
, colorectal cancer
8
, squam-
ous cell lung cancer
9
and breast cancer
10
and many other cancer types are being evaluated. In addition, these
studies have revealed that particular molecular alterations such as TP53 mutations, MYC amplifications or
CDKN2A deletions can occur in subsets of tumors of different cancer types. In fact, TCGA breast cancer project
has observed that breast tumors of the Basal-like subtype share many genetic alterations with high-grade serous
ovarian cancers, including TP53, RB1 and BRCA1 loss, CCNE1 and MYC amplifications, and high expression of
HIF1-a/ARNT, MYC and FOXM1 gene signatures
10,11
. Overall, these data suggest that particular treatment
strategies could be effective in tumors with similar genetic alterations and/or gene expression profiles regardless
of the tumor’s tissue of origin
11,12
. Indeed, the observed benefit of anti-HER2 therapy in HER2-amplified breast
and gastric cancers supports this hypothesis
13,14
.
OPEN
SUBJECT AREAS:
CANCER
DIAGNOSTIC MARKERS
Received
30 September 2013
Accepted
3 December 2013
Published
18 December 2013
Correspondence and
requests for materials
should be addressed to
SCIENTIFIC REPORTS | 3 : 3544 | DOI: 10.1038/srep03544 1
To help better understand the relationships among different types
of cancer, we have compared head-to-head variation in global gene
expression patterns in a dataset of 1,707 samples representing 6
human cancer types.
Results
Combined microarray dataset.To study the relationships among
different cancer types, we combined expression data of 17,987 genes
and 1,707 samples representing 6 cancer types (glioblastoma multi-
forme [GBM]
5,6
, high-grade serous ovarian carcinoma [OVARIAN]
7
,
lung cancer adenocarcinoma [LUAD], squamous cell lung carcinoma
[SQCLC]
9
, colorectal adenocarcinoma [CCR]
8
and breast cancer
10
)of
the TCGA project (Fig. 1A). The cancer type with the highest gene
expression variability was ovarian cancer with 9.1% of the genes
showing an interquartile range of expression above 3-fold, followed
closely by breast cancer (8.9%), LUAD (8.8%) and SQCLC (8.3%).
CCR (4.6%) and GBM (4.5%) showed the lowest gene expression
variability, suggesting that these two cancer types are biologically
more homogenous.
Global gene expression landscape.To assess the global landscape of
expression in the unified dataset, we performed principal compo-
nents analysis (PCA)
15
. Brain, colorectal and breast cancer explained
most of the gene expression variation displayed by the Principal
Components 1 and 2 (PC1 and PC2) with samples of ovarian
cancer, LUAD and SQCLC showing various levels of intermediate
PC1 and PC2 scores (Fig. 1B). Strikingly, a subgroup of breast
cancers almost entirely composed of the Basal-like subtype (in red
color), as determined by the PAM50 subtype predictor, showed
significant higher PC2 scores than the rest of breast tumors (i.e.
Luminal/HER2-enriched/Normal-like) and were found close to
ovarian cancers, SQCLCs and LUADs (Fig. 1B). Similar PC1
versus PC2 results were obtained from an independent gene
expression-based microarray dataset of 153 samples representing
breast cancer, LUAD, SQCLC and CRC (Suppl. Fig. 1).
To better understand the biological significance of PC1 and PC2,
we evaluated the top-300 genes having the largest positive and nega-
tive weights for both PCs (Fig. 1C and Supplemental Data). Gene
weights are indicative of the relative contribution of each gene to the
principal components. For PC1, the top-300 genes having the largest
positive weight were found enriched for neuron differentiation (e.g.
neuronal cell adhesion molecule [NRCAM] and N-cadherin
[CDH2]), gliogenesis (e.g. SRY [sex determining region Y]-box 11
[SOX11]), cell-cell signaling (e.g. synaptotagmin IV [SYT4]) and
synaptogenesis (e.g. neurexin 1 [NRXN1]), whereas the top-300
genes having the largest negative weight were found enriched for
tight junctions (e.g. claudin-3 [CLDN3]), epithelial cell differenti-
ation (i.e. FOXA1) and extracellular matrix (e.g. collagen, type XII,
Figure 1
|
Combined gene expression microarray-based dataset of 1,707 samples representing 6 different cancer types from The Cancer Genome Atlas
Project (TCGA; http://cancergenome.nih.gov/). (A) Microarray samples analyzed from each cancer type (number of samples and color identification).
(B) Principal component 1 and 2 (PC1 and PC2) loading plot using the 3,486 most variable genes. Samples have been colored based on their cancer type,
except for Basal-like breast tumors (n 598) that are colored in red. Weights of each gene for each PC can be found in Supplemental Data. (C)
Correlations between PC1 or PC2 scores and expression of selected genes in the entire dataset. (D) Consensus average linkage clustering matrix for k 53
to k 56 of all samples and the 3,486 most variables genes. The colored bar above the matrix identifies the various cancer types represented in each k group.
A single cancer type is shown in the bar if .98% of the samples of each k group are from that particular cancer type. Orange, GBM; Dark blue, OVARIAN;
Light blue, CCR; Grey, SQCLC; Green, BREAST; Violet, LUAD; Red, Basal-like breast cancer.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 3544 | DOI: 10.1038/srep03544 2
alpha 1 [COL12A1]). For PC2, the top-300 genes having the largest
positive weight were found enriched for serine proteases (e.g. kallik-
rein-related peptidase 7 [KLK7]), drug metabolism (i.e. CYP3A7)
and chemokines (e.g. interleukin-8 [IL8]), whereas the top-300 genes
having the largest negative weight were found enriched for response
to hormone stimulus (e.g. estrogen receptor [ESR1] and GATA3),
cell adhesion (e.g. claudin-8 [CLDN8]) and extracellular matrix (e.g.
fibronectin 1 [FN1]). Similar biological findings were obtained when
the top-100, top-200 and top-400 genes were evaluated (data not
shown).
Testing the molecular uniqueness of Basal-like breast cancer.The
previous results suggested that Basal-like breast cancer is molecularly
distinct from the other cancer types, including the rest of breast
tumors. To test the level of uniqueness of Basal-like breast tumors,
we performed consensus average linkage hierarchical clustering of all
samples (n 51,707) and the 3,486 most variable genes (Fig. 1D). The
consensus clustering method provides quantitative and visual
stability evidence for estimating the number of unsupervised
classes in a dataset
16
. The results showed that clustering stability
increased for k 52tok57 (Suppl. Fig. 2). Strikingly, Basal-like
breast cancer was identified as an unsupervised class at k 55 before
colorectal cancer was separated from both lung cancer types (i.e. at k
56) and before both lung cancer types were separated from each
other (i.e. at k 57). Overall, this result suggests that Basal-like breast
cancer is a reproducible and robust cancer type.
Expression of gene signatures corresponding to human DNA
regions.Somatic copy number aberrations (CNAs) in breast cancer
are associated with expression in ,40% of genes
17
. To estimate the
status of CNAs in our combined dataset, we evaluated the expression
of 326 gene sets corresponding to each human chromosome and each
cytogenetic band with at least one gene. These gene lists were
obtained from the C1-positional gene sets of the Molecular
Signature Database (Broad Institute; http://www.broadinstitute.org/
gsea/msigdb/), and are helpful in identifying effects related to
chromosomal deletions or amplifications.
Unsupervised hierarchical clustering of the 326 signature scores
and the 1,707 samples revealed significant changes in the expression
of genes located in specific DNA regions known to be aberrant in
these cancer types (Fig. 2A–B). For example, high expression of arm
1q-related genes in breast cancer (including Basal-like tumors)
10
,or
high expression of arm 13q-related genes in CCRs
8
, or low expression
of arm 10q-related genes in GBMs
5,6
. In addition, we identified high
expression of arm 10p-related genes and low expression of arm 5q-
related genes in Basal-like breast tumors concordant with the known
CNA status of these two chromosomal regions in Basal-like disease
10
.
Finally, Basal-like breast cancers, SQCLCs and OVARIAN carcino-
mas clustered together consistent with the hypothesis that these three
cancer types share a similar genetic profile with a special focus on
3q21-28 (amplified) and 5q13-22 (deleted) chromosomal regions
(Fig. 2B).
Gene expression relationships among cancer types.To address the
relationships among the 7 cancer types (i.e. Basal-like breast cancer
[identified by the PAM50 subtype predictor], non-Basal-like breast
cancer, CCR, GBM, SQCLC, LUAD and OVARIAN), we first
identified gene expression-based centroids, representing the 7
groups, using all available genes (n 517,987). Second, we evalu-
ated the relationships among the different centroids within all
samples (Fig. 3A), Basal-like breast tumors (Fig. 3B), OVARIAN
tumors (Fig. 3C), SQCLCs (Fig. 3D), LUADs (Fig. 3E), CCRs
(Fig. 3F), GBMs (Fig. 3G) and non-Basal-like breast tumors
(Fig. 3H).
Strikingly, the Basal-like breast tumor centroid was found more
similar to the SQCLC centroid than to the centroid of non-Basal-like
Figure 2
|
Expression of 326 gene signatures corresponding to human DNA regions across 7 cancer types. Signatures have been obtained from the
Molecular Signatures Database (MSigDB) from the Broad Institute online website (http://www.broadinstitute.org/gsea/msigdb/collections.jsp; C1:
positional gene sets). (A) Unsupervised clustering of 326 signatures scores across 1,707 samples. Each colored square on the heatmap represents the relative
median signature score for each sample with highest expression being red, lowest expression being green and average expression being black. Below the
array tree, samples have been colored based on their cancer type. (B) The top-10 up-regulated and down-regulated significant signatures for each cancer
type (or group) are shown. These signatures were identified by performing an unpaired two-class SAM analysis between each cancer type versus the rest
using the 326 signatures and a FDR 50%.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 3544 | DOI: 10.1038/srep03544 3
breast cancer (Fig. 3B). Concordant with this, 55% of Basal-like
breast tumors were found more similar (i.e. lower distances) to
SQCLCs than to non-Basal-like breast cancers. When compared to
the different intrinsic subtypes of breast cancer, 76%, 72% and 17% of
Basal-like breast tumors were found more similar to SQCLC than to
Luminal A, Luminal B and HER2-enriched breast tumors, respect-
ively. Interestingly, Basal-like breast tumors were found more similar
to both lung cancer types and to non-Basal-like breast cancers than to
OVARIAN tumors (Fig. 3B).
To determine the biological processes in common between Basal-
like breast cancers and SQCLC, we identified genes whose expression
is found significantly expressed in both cancer types compared to
luminal cancers (Luminal A and B tumors combined). Among the
top 300 up-regulated genes (False Discover Rate 50%) in Basal-like
breast cancer and SQCLC, we identified genes involved in ectoder-
mal differentiation (e.g. keratin 5, 14 and 17), inflammatory response
(i.e. chemokine [C-X-C motif] ligand 1 [CXCL1] and CXCL3) and
cell cycle (e.g. cyclin E1 [CCNE1] and centromere protein A
[CENPA]). Among the top 300 down-regulated genes, we identified
genes involved in the response to hormone stimulus (e.g. estrogen
receptor [ESR1] and GATA3), mammary gland development (e.g.
prolactin receptor [PRLR] and ERBB4) and microtubule-based pro-
cess (e.g. kinesin family member 12 [KIF12] and microtubule-assoc-
iated protein tau [MAPT]). This data is concordant with the
histological appearance and the immunohistochemical expression
of ER, keratins 5/6 and the proliferation-related biomarker Ki67 in
a Basal-like breast tumor, a SQCLC with a Basal-like profile and a
breast Luminal A tumor (Fig. 4).
Multiclass tumor prediction.To identify genes that are distinctive of
each cancer type, including Basal-like breast cancer, we performed
ClaNC, a nearest centroid-based classifier that balances the number
of genes per class (Fig. 5A). A 126-gene signature (18 genes per cancer
type) was established from the smallest gene set with the lowest cross
validation and prediction error (2.0%) (Fig. 5B). Among the various
cancer types, Basal-like breast cancers and SQCLCs showed the
highest prediction error (7.1% and 15.6%), and the majority of
misclassified SQCLCs (n 55, 71.4%) were identified as Basal-like
breast cancer. Of note, two previously identified diagnostic biomar-
kers of serous ovarian cancer (Wilm’s tumor [WT]-1)
18
and lung
adenocarcinoma (thyroid nuclear factor 1 [TITF-1])
19
were found
in the 18-gene list of these two cancer types (Fig. 5C).
Common patterns of gene expression across cancer types.Although
each cancer type is molecularly distinct, we sought to identify groups
of genes (i.e. gene signatures) with independent patterns of variation.
To accomplish this, we clustered all samples with the 3,486 most
variable genes (Fig. 6) and identified 19 gene clusters of at least 10
genes and an intraclass correlation coefficient .0.70 (Supplemental
Data). Among them, we identified gene signatures tracking lympho-
cyte activation/infiltration (e.g. CD8A and CD2), ectodermal
development (e.g. keratin 6B and 15), interleukin-8 pathway (e.g.
IL8 and CXCL1), tight junctions (e.g. claudin-3 and occludin),
proliferation (e.g. budding uninhibited by benzimidazoles 1 homo-
log [BUB1] and CENPA) and interferon-response pathways (e.g.
STAT1 and interferon-induced protein with tetratricopeptide repeats
1 [IFIT1]) (Fig. 6).
Common patterns of gene signature expression across cancer
types.Similar to the previous analysis, we determined the expres-
sion scores of 329 gene signatures (or modules)
20
in all samples,
including 115 previously published signatures, and then performed
an unsupervised hierarchical clustering (Fig. 7). Thirteen clusters of
at least 5 signatures and an intraclass correlation coefficient .0.70
were identified. These groups of gene signatures were found to track
various types of biological processes/features likely coming from the
tumor cell, the microenvironment or both. Interestingly, the expres-
sion of signatures tracking microenvironment-related (e.g. lympho-
cyte activation/infiltration) biological processes were found to be less
cancer type specific than the expression of gene signatures tracking
tumor-related biological processes (e.g. proliferation).
To illustrate the overlap among cancer types regarding the
expression of a single signature, we evaluated 6 previously identified
gene signatures that are known to track various cancer-related and
stromal/microenviroment-related biological processes related to
breast cancer biology
21,47–51
. The results showed that high expression
of these signatures (i.e. the top 20% expressers in the unified dataset)
occurs across all cancer types, albeit with different proportions
(Fig. 8). Of note, the TP53 signature
21
, which was trained in a prev-
iously reported breast cancer dataset, predicted TP53 somatic muta-
tions in the combined TCGA dataset (area under the receiver
operating characteristic curve 50.782; Suppl. Fig. 3). Moreover,
the scores of the previously reported PTEN-loss signature were
found correlated with INPP4B (correlation coefficient 520.424,
p-value ,0.0001) and phospho-4E-BP1 (correlation coefficient 5
0.368, p-value ,0.0001) protein expression in the TCGA breast
cancer dataset (Suppl. Fig. 4).
Breast cancer intrinsic subtyping of non-breast tumors.To evalu-
ate if the breast cancer ‘intrinsic’ profiles (Luminal A, Luminal B,
Figure 3
|
Transcriptomic relationships among cancer types. Relationships have been determined by calculating the Euclidean distances of each sample
to each of the 7 centroids, which represent each cancer type, using all genes of the unified dataset. Clustering has been performed after median
centering the Euclidean distances of each sample. The following genomic relationships among cancer types are shown based on the following subsets of
patients: (A) all patients (ALL); (B) basal-like breast cancer (BASAL-LIKE); (C) ovarian cancer (OVARIAN); (D) squamous cell lung cancer (SQCLC);
(E) lung adenocarcinoma (LUAD); (F) colorectal adenocarcinoma (CCR); (G) glioblastoma multiforme (GBM); (H) non-Basal-like breast cancer
(BREAST).
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 3544 | DOI: 10.1038/srep03544 4
HER2-enriched and Basal-like) can be identified in non-breast
tumors, we performed breast cancer intrinsic subtyping of non-
breast cancer types using the PAM50 and Claudin-low subtype
predictors
22,23
. Interestingly, all the breast cancer ‘intrinsic’ profiles
were identified albeit with different proportions (Table 1). For
example, the Basal-like profile was identified in 55% and 53% of
SQCLC and ovarian cancers, respectively, whereas virtually all
colorectal cancers (99%), and most lung adenocarcinomas (59%)
showed the HER2-enriched profile. Of note, 28% of ovarian
cancers and 24% of SQCLC tumors also showed the HER2-
enriched profile. Finally, the Claudin-low profile was identified in
20% and 16% of SQCLCs and LUADs, respectively.
To provide further evidence, we performed breast cancer intrinsic
subtyping of non-breast cancer types in two independent datasets
(Suppl. Fig. 5 and 6). First, we evaluated a publicly available micro-
array dataset (GSE23768) that includes 153 samples of breast cancer
Figure 4
|
Immunohistochemical (IHC) and PAM50 gene expression analyses of a Basal-like breast cancer, a SQCLC with a Basal-like profile and a
Luminal A breast cancer. Hematoxylin/eosine (H/E); Estrogen receptor (ER) expression; Keratin 5/6 (KRT5/6) expression; Proliferation-related Ki-67
expression. Each colored square on the heatmap below the IHC images represents the relative transcript abundance (in log2 space) of each PAM50
gene with highest expression being red, lowest expression being green and average expression being black.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 3544 | DOI: 10.1038/srep03544 5
1 / 13 100%