Vast uncharted territories in the human interactome landscape

publicité
Vast uncharted territories in the human interactome landscape
uncovered by a second-generation map
Murat Ta!an1,2,3,4*, Thomas Rolland1,5*, Samuel J. Pevzner1,5,6,7*, Benoit Charloteaux1,5*, Irma
Lemmens8^, Celia Fontanillo9^, Roberto Mosca10^, Nidhi Sahni1,5^, Song Yi1,5^, Atanas Kamburov1,5^,
Susan Ghiassian1,11^, Quan Zhong1,5^, Xinping Yang1,5^, Lila Ghamsari1,5^, Dawit Balcha1,5, Pascal
Braun1,5, Martin P. Broly1,5, Anne-Ruxandra Carvunis1,5, Dan Convery-Zupan1,5, Roser Corominas12,
Elizabeth Dann1,5, Matija Dreze1,5, Amélie Dricot1,5, Changyu Fan1,5, Eric Franzosa1,13, Fana
Gebreab1,5, Bryan J. Gutierrez1,5, Mike Jin1,5, Shuli Kang12, Ruth Kiros1,5, Guan Ning Lin12, Andrew
MacWilliams1,5, Jörg Menche1,11, Ryan R. Murray1,5, Matthew M. Poulin1,5, Xavier Rambout1,5,14, John
Rasla1,5, Patrick Reichert1,5, Viviana Romero1,5, Julie M. Sahalie1,5, Annemarie Scholz1,5, Akash A.
Shah1,5, Amitabh Sharma1,11, Yun Shen1,5, Stanley Tam1,5, Shelly A. Trigg1,5, Jean-Claude
Twizere1,5,14, Kerwin Vega1,5, Jennifer Walsh1,5, Michael E. Cusick1,5, Yu Xia1,13, Albert-László
Barabási1,11,15, Lilia Iakoucheva12, Patrick Aloy10,16, Javier De Las Rivas9, Jan Tavernier8, Michael A.
Calderwood1,5, David E. Hill1,5, Tong Hao1,5, Frederick P. Roth1,2,3,4 & Marc Vidal1,5
1
Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute,
2
Boston, Massachusetts 02215, USA. Departments of Molecular Genetics and Computer Science, University of
3
Toronto, Toronto, Ontario M5S 3E1, Canada. Donnelly Centre, University of Toronto, Toronto, Ontario M5S
4
3E1, Canada. Samuel Lunenfeld Research Institute, Mt. Sinai Hospital, Toronto, Ontario M5G 1X5, Canada.
5
6
Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA. Department of
7
Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA. Boston University School of
8
Medicine, Boston, Massachusetts 02118, USA. Department of Medical Protein Research, VIB, 9000 Ghent,
9
Belgium. Cancer Research Center (Centro de Investigación del Cancer), University of Salamanca and Consejo
10
Superior de Investigaciones Científicas, Salamanca 37008, Spain. Joint IRB-BSC Program in Computational
11
Biology, Institute for Research in Biomedicine, Barcelona 08028, Spain. Center for Complex Networks
Research (CCNR) and Department of Physics, Northeastern University, Boston, Massachusetts 02115, USA.
12
13
Department of Psychiatry, University of California, San Diego, La Jolla, California 92093, USA. Department
14
of Bioengineering, McGill University, Montreal, Quebec H3A 0C3, Canada. Protein Signaling and Interactions
15
Lab, GIGA-R, Université de Liège, 4000 Liège, Belgium. Department of Medicine, Brigham and Women’s
16
Hospital, Harvard Medical School, Boston, Massachusetts 02115, USA. Institució Catalana de Recerca i
Estudis Avançats (ICREA), Barcelona 08010, Spain.
* These authors contributed equally to this work and should be considered co-first authors.
^ These authors contributed equally to this work and should be considered co-second authors.
Tasan et al.
Complex cellular systems and interactome networks underlie most genotypephenotype relationships. Just as a high-quality human reference genome sequence
revolutionized genotype determination a decade ago, proteome-scale reference maps
of protein-protein, protein-nucleic acid, and enzyme-substrate interactions should
provide scaffold information necessary to interpret the increasing amount of genotypic
data. Here we describe a new empirically-controlled “second-generation” map of the
human binary protein-protein interaction network, expanding coverage over similarquality interactions from the literature by three-fold, altogether bringing us only one
order of magnitude away from the complete network. The map uncovers vast
uncharted interactome “territories” rich in genetic material related to complex human
phenotypes. Interaction “detectedness” correlates with co-functionality, suggesting
that the biophysical interactome is biologically heterogeneous. Increased and uniform
coverage of the interactome network helps characterize genotype-phenotype
associations, particularly for cancer. Thus a reference map of the human interactome
network is now within reach.
1
Tasan et al.
Complex cellular systems formed by large numbers of interactions amongst genes and gene
products, or interactome networks, underlie most cellular functions. Understanding how
perturbations of interactome networks relate to inherited and somatic disease susceptibilities
will be required for predictive models of genotype-phenotype relationships in human.
A nearly full description of all genotypic variations in the human population is expected
to be available soon. Several thousand genes associated with Mendelian disorders have
been defined1, susceptibility loci have been uncovered for a few hundred complex traits2, and
several hundred putative cancer genes have been identified3-5. With rapidly evolving nextgeneration sequencing platforms, large numbers of human genome sequences will soon be
available from unaffected individuals6, patients stricken by virtually any disease7, and diverse
tumours8.
As genome-wide genotypic information accumulates, many fundamental questions
pertaining to genotype-phenotype relationships remain unresolved9. The causal changes and
mechanisms that connect genotype to phenotype are still generally unknown, especially for
complex trait loci and cancer-associated mutations. Even when identified, it is often unclear
how a causal change perturbs the function of the corresponding gene product. Lastly, how
such changes perturb interactome networks is largely unknown, which has hindered
comprehensive understanding of gene-gene interactions. All in all, to “connect the dots” of
the genomic revolution, functions and context must be assigned to hundreds of thousands of
genotypic changes.
Globally functionalizing and contextualizing genotype-phenotype relationships requires
genome and proteome-scale maps of interactome networks formed by macromolecular
interactions such as protein-protein interactions (PPIs), protein-DNA interactions, and posttranslational modifiers and their targets9,10. Although first generation interactome maps have
2
Tasan et al.
provided network-based explanations for some genotype-phenotype relationships9, it is
difficult to establish how fully these maps represent the complete network and whether their
quality is sufficient to derive accurate interpretations (Fig. 1a). There is a dire need for
empirically controlled11 and high-quality “second-generation” reference maps, reminiscent of
the high-quality human reference genome sequence12 that revolutionized genotyping.
The challenges are manifold. Even considering only one splice variant per gene, close
to 20,000 protein-coding genes must be handled and ~200 million protein pairs tested to
generate a comprehensive binary protein-protein interactome reference map. Accordingly,
whether such a comprehensive network could ever be mapped by the collective efforts of
small-scale studies remains uncertain. Computational predictions of protein interactions can
generate information at proteome scale13 but are inherently limited by biases in currently
available knowledge used to infer such interactome models. Should interactome maps be
generated for all individual human tissues using biochemical co-complex association data, or
alternatively, might direct binary interaction information on all possible PPIs be preferable or
complementary? Finally, even with nearly complete, high-quality reference interactome maps
of biophysical interactions in hand, how to evaluate the biological relevance of each
interaction in physiological conditions11,14? Here, we begin to address these questions by
generating a second-generation map of the human binary interactome and comparing it to
alternative network maps.
3
Tasan et al.
An uncharted interactome territory in small-scale interaction studies
To investigate whether small-scale studies described in the literature are adequate to
qualitatively and comprehensively map the human binary protein-protein interactome
network, we assembled data available in 2010 from seven public databases15-21 (Lit-2010).
After eliminating protein pairs lacking any evidence of direct binary interaction and filtering out
evidence from proteome-scale interactome screens11,22,23, ~14,000 putative direct binary
PPIs remained (Fig. 1b). Pairs reported in only a single publication and detected by only a
single method have higher rates of curation errors than interactions reported in multiple
papers and/or with multiple detection methods24, yet represented nearly two thirds of the
binary literature set.
We therefore segregated binary interactions with a single piece of evidence (Lit-BS10) from binary interactions with multiple pieces of evidence (Lit-BM-10), and experimentally
tested subsets of both by mammalian PPI trap (MAPPIT)25 and yeast two-hybrid (Y2H)26
assays. Lit-BS-10 pairs validated at rates that were only slightly higher than the randomly
selected protein pairs used as negative control (random reference set; RRS) and
considerably lower than Lit-BM-10 pairs (Fig. 1c and Supplementary Notes). Protein pairs
from Lit-BS-10 were also found to co-occur in the literature significantly less than Lit-BM-10
pairs as indicated by STRING literature mining scores27 (Fig. 1c and Supplementary Notes),
suggesting that these pairs were less thoroughly studied even in their corresponding
publications. Thus, we restricted Lit-2010 to 4,906 high-quality Lit-BM-10 binary interactions
(Supplementary Table 1). This highlights the need for systematic approaches, since
estimates of the number of PPIs in the human interactome are close to two orders of
magnitude larger than Lit-BM-1011,28.
4
Tasan et al.
The limited number of high-quality PPIs drastically diminishes the ability to prioritize
putative disease genes identified in systematic studies. Although cancer driver gene
products5 are well represented in Lit-BM-10, less than 30% of proteins encoded by genes
identified in systematic genome-wide cancer screens have any reported interactions in LitBM-10 (Fig. 1d). Coverage was similarly low for proteins associated with Mendelian traits or
identified in genome-wide association studies (GWAS). Such differences might be explained
by the inspection bias inherent to small-scale studies. The ~20,000 proteins of the human
proteome are not studied with equal intensity; some are described in hundreds of publications
while most have been mentioned only in a few (Fig. 1e). Ranking proteins by the number of
publications that mention them, we observed a strong bias of interactions towards highly
studied proteins in contrast to a sparsely populated territory for interactions involving less
studied proteins. Candidate gene products identified in genome-wide disease screens
distribute
homogeneously
across
the
publication-ranked
interactome
space,
again
highlighting the need for systematic PPI maps to cover this uncharted interactome territory.
A second-generation binary interactome map
The enrichment of interactions detected for highly studied proteins in Lit-BM-10 might indicate
that well studied proteins are truly involved in more PPIs. Under this model, more than half of
all predicted human proteins would not participate substantially in the interactome network.
Alternatively, the sparse territory could be homogenously populated by real PPIs that have
been missed so far due to sociological or experimental biases.
To distinguish between these possibilities and to increase the coverage of the
interactome space, we generated a new systematic binary interaction map, interrogating a
search space four times larger and with deeper sensitivity than in our first proteome-scale
attempt22 (Fig. 2a, Supplementary Fig. 1 and Supplementary Notes). All pairwise
5
Tasan et al.
combinations of proteins corresponding to ~13,000 genes (Space II; Supplementary Table
2)29 were screened twice, and we identified pairs of genes encoding interacting proteins from
~85,000 Y2H positive colonies using next-generation sequencing30 (Fig. 2b and
Supplementary Notes). After quadruplicate retesting, ~14,000 distinct protein pairs were
considered verified interactions.
To validate the quality of these PPIs, we compared a sample of ~800 verified
interactions both to a positive reference set of ~500 binary interactions from Lit-BM-10 and to
our RRS as a negative control. All ~2,000 protein pairs were tested in three binary assays:
MAPPIT25, a well-based nucleic acid programmable protein array (wNAPPA)31,32, and a
fluorescence-reconstituting protein-fragment complementation assay (PCA)33. The verified
pairs exhibited validation rates that were statistically indistinguishable from the positive
reference set (Fig. 2c and Supplementary Tables 3 and 4), demonstrating the high quality of
our PPIs. Our new human interactome dataset covering Space II and first released online in
2011 (“HI-II-11”; Supplementary Table 5) contains 13,944 interactions amongst 4,303 distinct
proteins (517 homodimers and 13,427 heterodimers).
To determine the extent to which HI-II-11 contains direct binary interactions, we used
three-dimensional co-crystal structures from the Protein Databank (PDB)21 to distinguish
direct contacts from indirect associations. Out of 191 HI-II-11 interactions with such structural
information, 189 were in direct contact in the corresponding crystal structure, demonstrating
the high precision of HI-II-11 in reporting direct PPIs (99% precision; P = 4 " 10-5, one-sided
binomial test; Supplementary Table 6 and Supplementary Notes).
To investigate whether particular classes of PPIs might be refractory to our assays, we
evaluated how the number of residue-residue contacts in three-dimensional structures related
to our ability to detect interactions34. Among interacting proteins found in PDB complexes, the
pairs reported in HI-II-11 tended to involve more reside-residue contacts than those that were
6
Tasan et al.
missed (medians of 62 and 44 respectively, P = 1.27 " 10-10, one-sided Wilcoxon rank-sum
test; Fig. 2d and Supplementary Table 7). This result suggests that Y2H sensitivity can be
affected by the number of residue-residue contacts and thus presumably by interaction
affinity.
PPIs are commonly mediated by protein domains such as those catalogued in the
Protein family (Pfam) database35. Observing that HI-II-11 is enriched for proteins containing
domains known to be involved in domain-domain interactions36 (Supplementary Fig. 2), we
reasoned that the pairing of such domains in interacting proteins provides a strong
hypothesis for the mechanism of that interaction. To identify domain pairs mediating HI-II-11
interactions, we looked for pairs of Pfam domains in interacting proteins that appeared more
frequently than expected (Fig. 2e and Supplementary Table 8). HI-II-11 is significantly
enriched for domain pairs supported by three-dimensional structures36 (P = 6 " 10-11,
Wilcoxon rank sum test), further demonstrating that HI-II-11 is comprised of direct binary
interactions. For PPIs involving the 25 most enriched domain pairs, we could identify the
structurally supported domain pairs among all possible ones with high accuracy (70%
sensitivity and 80% specificity), as exemplified by the calmodulin-binding IQ and EF-hand
domains (Fig. 2f).
To compare the growth rates of systematic and literature interactome maps, we
updated the literature binary interaction dataset to include information available up to 2013
(Supplementary Fig. 3). The number of high-quality binary literature PPIs has grown at a
steady rate to ~10,000 interactions, whereas the growth of systematic datasets is punctuated
with a small number of large-scale releases (Fig. 2g). The union of previous systematic
datasets11,22,23,30 with HI-II-11 provides close to ~20,000 binary interactions which, combined
with the high-quality binary literature pairs available as of 2013 (“Lit-BM-13”; Supplementary
Table 9) provides ~30,000 individual PPIs. Collectively, this brings us to only an order of
7
Tasan et al.
magnitude from the estimated complete interactome network11,28.
To examine how the sparsely populated territory has evolved, we visualized adjacency
matrices for literature-based and systematic maps at regular time points since 1994. Although
the sparse territory of the literature map has been slowly populated with interactions, growth
in this territory continues to lag behind that in the densely populated one (Fig. 2h and
Supplementary Fig. 4). In contrast, current systematic binary maps appear to more uniformly
cover the full interactome space. This finding suggests that heterogeneity in literature
coverage across the space reflects uneven investigative efforts, and supports the existence
of true biophysical interactions across the full human interactome space.
Apparently sparse territories contain genuine biophysical interactions
To more deeply explore the heterogeneous coverage of the human interactome, we
compared HI-II-11 and Lit-BM-13 to a collection of ~25,000 predicted binary PPIs of highconfidence (“PrePPI-HC”)13 and a co-fractionation map of ~14,000 putative binary
interactions (“Co-Frac”)37 (Fig. 3a). Although statistically significant (P < 1 " 10-60, Fisher’s
exact test), the overlap of interactions between these four maps is small (Fig. 3b), and the
number of interactors, or “degree”, of a protein in one map does not correlate with its degree
in the other three maps (Supplementary Fig. 5). Each mapping strategy apparently probes
different, perhaps complementary, territories of the underlying true interactome landscape.
To investigate whether these four maps show similar sparse territories as did Lit-BM13 (Fig. 2h), we generated adjacency matrices for each map with respect to publication
frequency (Fig. 3c). Similar to Lit-BM-13, the Co-Frac and PrePPI-HC maps exhibited a
strong tendency to report interactions amongst well-studied proteins (Fig. 3c). A potential
source of bias for Co-Frac towards well-studied genes arises from the literature-based
computational methods used to identify the reported subset of direct PPIs (Supplementary
8
Tasan et al.
Fig. 6)37. Similarly, the structural modelling approach used to generate PrePPI-HC relies on
existing structures as templates for identifying new interactions and on additional genomic
and functional data to enhance prediction specificity. In contrast, HI-II-11 covered the
publication-ranked interactome space uniformly.
Because the sensitivity of some interaction-mapping approaches depends on the level
of gene expression, we examined interactome maps for expression-related sparse territories.
The Co-Frac map shows a strong bias of interactions towards proteins encoded by highly
expressed genes (Fig. 3c and Supplementary Fig. 7). The literature map also shows such a
bias, likely due to the restricted use of common cell lines in the underlying experiments. In
contrast, both HI-II-11 and PrePPI-HC exhibit a uniform interaction density across the full
spectrum of expression levels, likely explained by the more uniform expression of proteins
tested in Y2H and by the independence of the homology-based predictions from levels of
expression.
We next sought more broadly for sources of intrinsic bias that might influence the
appearance of sparsely populated territories, such as biological or knowledge-based
properties of proteins and their corresponding genes. For example, PrePPI-HC is virtually
devoid of interactions between proteins lacking Pfam domains, which is consistent with
conserved domains forming the basis of the prediction method used to produce PrePPI-HC
(Fig. 3c). HI-II-11 appears to be depleted, but not devoid, of interactions amongst proteins
containing predicted transmembrane helices (TMH) (Fig. 3c), consistent with postulated
limitations of the Y2H assay38. The Co-Frac network also appears to be depleted in
interactions involving TMH-containing proteins, which may be a consequence of membranebound proteins being filtered out during biochemical separations37.
In all we examined 21 protein or gene properties, roughly classified as expression-,
sequence-, and knowledge-based (Fig. 3d and Supplementary Table 10). We formally
9
Tasan et al.
defined a “dense zone” and a “sparse zone” of interactions by determining the location where
the imbalance of interaction density is the highest (Supplementary Table 11 and
Supplementary Fig. 8). We considered dense and sparse zones only when this imbalance
was significantly different and strongly deviating from a uniform distribution of interactions.
Besides the dense and sparse zones already described, we uncovered other biases, such as
HI-II-11 exhibiting an enrichment of interactions amongst proteins with extensive disordered
regions, a bias suggested in earlier Y2H binary datasets39. No property resulted in the
existence of dense and sparse zones in all four maps, suggesting that the observed
imbalances reflect heterogeneous testing and technical limitations, rather than the existence
of territories that are truly devoid of interactions.
To confirm that HI-II-11 interactions found in the apparently sparse zones of the three
other maps are of high biophysical quality, we assessed MAPPIT validation of HI-II-11
interactions across the interactome space ranked by the various protein properties examined.
MAPPIT validation rates of dense and sparse zone pairs were consistent for nearly all
properties (Fig. 3e and Supplementary Figure 9), indicating that HI-II-11 interactions
throughout the full interactome space are of equivalent biophysical quality.
In summary, current interactome maps contain vast uncharted territories that limit
knowledge of the full interactome landscape, whereas HI-II-11 provides high-quality
biophysical interactions within these under-explored regions.
Functional heterogeneity in interactome maps
One intuitive model is that all biophysical interactions of an organism correspond to functional
interactions, so that the biophysical interactome network strictly overlaps with a collapsed
version of all functional networks, in all cells and at all stages of development. An alternative
model suggests that the biophysical interactome overlaps more loosely with functional
10
Tasan et al.
networks, with not all biophysical interactions being biologically relevant11.
To determine whether the four interactome maps contain functionally relevant PPIs,
we compared them to functional networks where links between genes represent shared
functional annotations such as Gene Ontology (GO) terms or shared phenotypic annotations
inferred from mouse orthologs. All four interactome maps showed significant enrichments for
the four types of functional annotation examined (Fig. 4a), confirming their overall biological
relevance. Thus, our systematic biophysical map generated independently from pre-existing
biological information can indeed reveal functional relationships. Segregating HI-II-11
interactions with respect to dense and sparse zones found in other maps, we observed
consistent functional enrichments in both dense and sparse zones for all studied protein
rankings (Fig. 4b), supporting that HI-II-11 interactions in sparsely populated territories of
other maps are biologically relevant.
To investigate which properties of biophysical interactions might explain cofunctionality, we categorized interactions according to their “detectedness”, i.e. how
frequently they have been detected. We partitioned the high-quality Lit-BM-13 interactions
between those with two, three, and four or more pieces of evidence. We observed a clear
trend between detectedness and MAPPIT recovery rate, where pairs with the highest
detectedness were recovered approximately three times more often than those with the
lowest detectedness (Fig. 4c). There was a significantly higher functional enrichment for
interactions having four or more pieces of evidence than those having two or three pieces of
evidence, suggesting a possible correlation between how amenable interactions are to
detection and functional relevance. However, this trend may result from a stronger inspection
bias for interactions of higher detectedness, artificially increasing their observed functional
relevance.
To discriminate between these two possibilities, we partitioned our systematic binary
11
Tasan et al.
interaction datasets into sets of increasing detectedness: HI-II-11 interactions found in one of
the two initial screens, those found in both screens (Supplementary Table 5), and the subset
of our first systematic map (HI-I-05)22 detected in two different binary assays, MAPPIT and
Y2H (Supplementary Table 4). MAPPIT recovery rate and functional enrichment both
increased with increasing detectedness (Fig. 4c), suggesting a relationship between
detectedness and functional relevance in the human biophysical interactome (Fig. 4d).
Measurements of overlap between literature and systematic datasets have been used
to characterize the overall quality of the latter, with the presumption that small overlaps reflect
poor quality40,41. An alternative explanation is that interactome networks are larger than
originally assumed, with small overlaps resulting when each map detects only a small fraction
of the whole42. We reasoned that dense zones of the literature datasets could be surrogates
for the effective search space. Hence interacting protein pairs found in systematic screens
that reside in literature sparse zones should not be taken into account to measure overlaps.
Restricting the effective search space of both the systematic and literature maps, we
observed a significant increase in overlap for pairs of increasing detectedness (Fig. 4e and
Supplementary Notes). As detectedness increases, the fraction of interactions in the
systematic maps found in the literature maps rises from 2% to 25%, suggesting that literature
and systematic approaches converge.
We therefore re-investigated estimates of the human binary interactome size using
reference sets consisting of literature binary interactions with increasing detectedness. Our
estimates vary between ~600,000 and ~200,000 interactions, within the range of previous
estimates11,28 (Fig. 4f, Supplementary Table 12 and Supplementary Notes). The decrease of
size estimates with reference set detectedness suggests that the interactome is
heterogeneous with respect to detection and functional relevance. This result was robust
when controlling for effective search space and matching detectedness of systematic and
12
Tasan et al.
literature datasets. More importantly, these estimates suggest that current systematic and
literature datasets are only about an order of magnitude away from the full interactome at
every detectedness level.
Systematic binary interactome networks and human disease
Topological properties of nodes within interactome networks provide insights into disease10,43,
though studies of disease based solely on literature maps might reach potentially biased
conclusions because of inspection biases, as identified in the Lit-BM-13 map. The
significantly higher connectivity in Lit-BM-13 of cancer-associated gene products from the
Sanger Cancer Gene Census (“Cancer Census”)3 might be because they are more heavily
studied. Here we confirm using an unbiased systematic map that cancer proteins do indeed
have higher connectivity (P = 1 " 10-3, one-sided Wilcoxon test; Fig. 5a).
To investigate the role interactome maps may play in identifying cancer genes, we
tested the extent to which products of putative cancer genes identified in GWAS publications
can be connected to known cancer-related proteins (Supplementary Table 13). In HI-II-11
proteins encoded by genes in cancer-associated loci interacted with known cancer proteins
significantly more frequently than expected by chance, demonstrating how our map can be
used to prioritize genes within large loci (Fig. 5b). We found similar results when using a
range of linkage disequilibrium thresholds to define loci (Supplementary Fig. 10), and when
using the Lit-BM-13 or PrePPI-HC maps (Supplementary Fig. 11).
To further assess how well interactions detected without influence of inspection bias
may serve to prioritize cancer-associated genes, we investigated alternative network-free
genomic and functional genomic approaches for identifying genes likely to be involved in
tumorigenesis. We looked at three candidate gene sets: (i) ~200 genes with heightened nonsynonymous and deleterious somatic mutation rates in cancers ("Mut"), (ii) ~350 human
13
Tasan et al.
orthologs of mouse genes associated with tumorigenesis in Sleeping Beauty (“SB”)
transposon-based mutagenesis screens, and (iii) ~400 human proteins found to interact with
proteins from four DNA tumour virus families ("Vir")44. For the proteins present in HI-II-11, the
overlap between Mut, SB, and Vir sets and a reference set of 128 Cancer Census genes
enables us to evaluate the predictive ability of each approach (Fig. 5c). The frequency of
interaction in HI-II-11 between a protein and members of the SB and Vir sets significantly
increased predictive ability (SB: P = 1 " 10-31, Vir: P = 1 " 10-43, one-sided paired Wilcoxon
rank-sum test; Fig. 5c), arguing for network-guided analyses of systematic cancer gene
discovery efforts.
We directly measured for each protein its corrected frequency of interaction with
Cancer Census proteins, directly testing the “guilt-by-association” approach for transfer of
function and annotation through PPIs. This method provided vastly improved predictions (SB:
P = 1 " 10-115, one-sided paired Wilcoxon rank-sum test; Fig. 5c), and validates the existence
of modularity in interactome networks: that proteins sharing similar phenotypes or function
tend to directly interact with each other45. Instead of relying solely on Cancer Census
annotation, we sought to enhance cancer-gene prioritization by integrating information both
about (i) membership of a query protein in candidate lists (“guilt-by-profiling”) and (ii) that of
its interacting partners (“guilt-by-association”). We constructed a forward-stepwise logistic
regression model to combine these fundamentally node- and edge-centric data types, and
found a marked improvement in ability to highly rank Cancer Census proteins (P = 1 " 10-164,
one-sided paired Wilcoxon rank-sum test; Fig. 5c and Supplementary Table 14).
Our top-ranked prediction was the cyclin-dependent kinase 4 (CDK4), a gene already
in the Cancer Census list. CDK4 is connected in HI-II-11 to its inhibitors CDKN2C and
CDKN2D, and to its associated G1/S-specific cyclins CCND1 and CCND3 (Fig. 5d).
CDKN2C, CCND1, and CCND3 all have been causally implicated in tumorigenesis, while
14
Tasan et al.
CDK4 is frequently mutated in cancerous tissue and serves as a viral target of the E7 protein
of human papillomavirus and of the Large T protein of polyomavirus44.
The second ranked and top novel prediction is the coat protein II-coated vesicle
(COPII) component SEC24A, which associates with the COPII complex to mediate protein
transport from the endoplasmic reticulum (ER) (Fig. 5d). Two neighbours of SEC24A in HI-II11 are the Cancer Census members Ewing Sarcoma Breakpoint Region 1 (EWSR1) and
TRK-Fused Gene (TFG). Both contain a similar regulatory N-terminal region that is frequently
present in oncogenic fusion proteins found in several types of sarcoma. Inhibition of TFG
prevents protein secretion from the ER due to ER structural failure46 and depletion of the C.
elegans ortholog TFG-1 leads to a decline in COPII levels at ER exit sites47, together arguing
that TFG shares a function with SEC24A. MAPK1IP1L, a poorly studied protein found in
literature-defined sparse zones of the three other interactome maps, interacts not only with
the sarcoma-associated EWSR1 and TFG, but also with SS18L1, another sarcomaassociated Cancer Census gene. Thus, HI-II-11 offers compelling disease-related
hypotheses in sparsely populated territories of the interactome.
15
Tasan et al.
Discussion
While PPI networks can shed light on complex genotype-phenotype relationships, help to
identify relevant disease genes and unravel pathobiology mechanisms, the quality of current
maps and their ability to thoroughly cover the human interactome landscape has to be
carefully considered. Systematically screening half of the interactome space, independently
from any inspection bias, we tripled the number of high-quality binary PPIs available from the
literature. Covering regions of the human interactome landscape that are hardly touched by
other approaches, the systematic binary map provides functional context to thousands of
proteins, as demonstrated for candidates identified in unbiased cancer genetic screens.
Systematic binary mapping therefore stands as a powerful tool providing necessary scaffold
information to “connect the dots” of the genomic revolution.
How far away is a complete binary interactome map? Combining high-quality binary
pairs from the literature and systematic binary maps, 30,000 high-confidence interactions are
now available, only one order of magnitude away from what is to be found. A more pertinent
question might be whether all interactions are equally functional? In both systematic and
literature maps, functional enrichment increases with detectedness, suggesting that
detectedness can provide a means to better quantify how functional are individual
interactions, to identify functional modules45, and to understand heterogeneity in interactome
networks. However, detectedness may in part reflect variation in dissociation constants and
interactions with lower dissociation constants are more readily detected and hence better
characterized. Development of new, cost-effective systematic experimental approaches will
help to complete the mapping of the human interactome and also further assess the
relationships between detectedness, affinity and biological functionality.
Reference binary interactome maps of increased coverage and quality are prerequisite to interpret condition-specific interactions48 and characterize the effects of splicing
16
Tasan et al.
and genetic variation on interactions49. Just as a reference genome enabled detailed maps of
human genetic variation6, completion of a reference interactome map will enable deeper
insight into genotype-phenotype relationships in human.
17
Tasan et al.
METHODS SUMMARY
The systematic binary map was generated by screening all pairwise combinations of 15,517
ORFs from the hORFeome v5.1 collection. Interactome mapping was performed as
described previously26 but with the following modifications: (i) the space was screened twice
by Y2H to identify primary candidate interacting protein pairs, (ii) candidate interacting protein
pairs were identified either by Sanger or by next generation sequencing30, (iii) candidate
interacting protein pairs were verified in four independent retests and only those interactions
that scored positive at least three times were retained, (iv) to confirm the quality of the full
dataset, a subset of verified interactions were validated in three orthogonal assays MAPPIT,
PCA and wNAPPA relative to a high-quality literature derived reference set and a random
reference set of 460 and 698 pairs, respectively.
The Lit-2010 and Lit-2013 maps were generated by extracting all human PPIs annotated with
tractable publication records from seven databases15-21 through December 2010 and August
2013, respectively. PSI-MI experimental method codes were used to segregate the binary
PPIs and the number of pieces of evidence were counted.
All statistical tests were performed using implementations of the R software50. Empirical P
values were obtained using randomizations detailed in the respective methods. Full details
are provided in Supplementary Methods.
18
Tasan et al.
FIGURE LEGENDS
Figure 1 | An uncharted interactome territory in small-scale interaction maps. a,
Understanding genotype-phenotype relationships through interactome maps of increasing
completeness and quality. b, Flow chart (left) showing the assembly and segregation of
protein-protein interactions from the literature. Different pieces of evidence correspond to
different papers or methods. Histogram (right) of number of interactions for the full literature
and different binary subsets. c, Fraction of Lit-BS-10 and Lit-BM-10 pairs recovered by
MAPPIT at increasing RRS recovery rates (upper left). The shading indicates standard error
of the proportion. Fraction of Lit-BS-10, Lit-BM-10 or RRS pairs recovered by MAPPIT at 1%
RRS recovery rate (lower left), not found to co-occur in the literature as reported in the
STRING database (upper right), and recovered by Y2H (lower right). P values are based on
two-sided Wilcoxon rank sum tests. d, Fraction of full proteome and putative diseaseassociated gene products with protein-protein interactions in Lit-2010, Lit-BM-10, and
systematic binary screens. Error bars indicate standard error of the proportion. Data sources
are cited in Supplementary Information. e, Adjacency matrix showing Lit-BM-10 interactions,
with proteins ordered symmetrically along vertical and horizontal axes by number of
publications (upper histograms), and grouped in bins of ~350 proteins with median number of
publications per bin shown. The colour intensity of each square reflects the total number of
interactions between proteins in each bin. Total number of interactions per bin (lower
histogram). Number of gene products from GWAS loci, Mendelian disease genes, and
Sanger Cancer Gene Census genes per bin (circles).
Figure 2 | A second-generation binary interactome map. a, Improved empirical
parameters from first-generation to second-generation interactome mapping. b, Experimental
19
Tasan et al.
pipeline for identifying high-quality binary protein-protein interactions. ORF: Open-reading
frame, IST: Interaction Sequence Tag. c, Fraction of HI-II-11, Lit-BM-10, and RRS pairs
recovered by MAPPIT (left), PCA (centre), and wNAPPA (right) at increasing assay scores.
Shading indicates standard error of the proportion. d, Fraction of protein pairs in the PDB that
HI-II-11 detected or missed as a function of number of residue-residue contacts. Vertical
dashed lines indicate distribution medians. P value is based on a one-sided Wilcoxon rank
sum test. Schematic of residue-residue contact count (bottom). e, Schematic of domain pair
enrichment scoring method (top). Normalized frequency of all (left) and the top 25 (right)
domain pairs. 3did: database of three-dimensional interacting domains. f, Network showing
all possible domain combinations between HI-II-11 protein pairs involving the top 25 enriched
domain pairs (left). Example of enriched domain pair corresponding to a known direct
interaction (right). g, Total number of binary interactions in binary literature and systematic
binary interactome maps over the past 20 years. h, Adjacency matrices showing binary
literature and systematic binary maps at selected time points, with proteins ordered by
number of publications. Colour intensity reflects the total number of interactions between
proteins in each bin of ~350 proteins.
Figure 3 | Apparent sparse zones contain genuine biophysical interactions. a, Network
representations of HI-II-11, Lit-BM-13, Co-Frac, and PrePPI-HC interactome maps. b,
Topological properties of the four interactome maps. Venn diagrams illustrating the number of
shared proteins (upper) and interactions (lower) in the four maps. c, Adjacency matrices for
HI-II-11, Lit-BM-13, Co-Frac, and PrePPI-HC maps, with proteins ordered by number of
publications, mRNA abundance in HEK cells, fraction of protein sequence covered by Pfam
domains, and fraction of protein sequence covered by transmembrane helices. Median
values of the protein properties for each bin of ~350 proteins are shown (top histograms).
20
Tasan et al.
Colour intensity reflects the total number of interactions between proteins in each bin. Lower
histograms show total number of interactions per bin. d, Interaction density imbalances for 21
protein properties in the four maps. e, MAPPIT recovery rate of HI-II-11 pairs found in dense
and sparse zones of Lit-BM-13, Co-Frac, and PrePPI-HC at 1% RRS recovery. Error bars
indicate standard error of the proportion.
Figure 4 | Functional heterogeneity in interactome maps. a, Functional enrichment of
PPIs in the four interactome maps. Error bars indicate 95% confidence intervals. BP:
Biological process, MF: Molecular function, CC: Cellular component. b, Functional
enrichment of HI-II-11 interactions found in dense and sparse zones of Lit-BM-13, Co-Frac,
and PrePPI-HC. Error bars indicate 95% confidence intervals. c, MAPPIT recovery rate of
subsets of systematic binary and binary literature interactions of increasing detectedness at
1% RRS rate (left histograms; literature: Lit-BM-13 with 2, 3, and 4 or more pieces of
evidence; systematic: HI-II-11 found in one screen, two screens, and HI-I-05 found in two
assays). Error bars indicate standard error of the proportion. Functional enrichment of
systematic binary and binary literature interactions of increasing detectedness (right
histograms). Error bars indicate 95% confidence intervals. P values are based on two-sided
Fisher’s exact tests (* means P ! 0.05). d, Model for how interaction detectedness may
correlate with functionality. e, Fraction of shared interactions between the binary literature
and systematic binary maps of increasing detectedness, controlling or not for their effective
search spaces, i.e. space I and II for the systematic maps and the dense zones for the
literature maps. f, Binary human interactome size estimated using literature reference sets of
increasing detectedness (top left), or using reference sets and systematic binary maps of
matched detectedness and controlling for their effective search spaces (top right). Total
21
Tasan et al.
number of high-quality interactions at increasing detectedness in the union of binary literature
and systematic binary maps (bottom).
Figure 5 | Systematic binary interactome networks and human disease. a, Average
degree of Sanger Cancer Gene Census (Cancer Census) proteins and proteins not
associated to any disease in all four maps. P values are based on Wilcoxon rank sum test. b,
Schematic representing interactions between proteins encoded by genes in cancerassociated GWAS loci and known Cancer Census proteins (top left). Fraction of cancerrelated GWAS loci containing at least one gene encoding a protein that interacts with a
Cancer Census protein in 1,000 randomly selected loci genes (histogram) and the observed
value (arrow) for HI-II-11 (top right). Empirical P value is shown. Network representing
proteins encoded by genes in cancer-associated GWAS loci interacting with Cancer Census
proteins in HI-II-11, and an example network after randomizing genes (bottom). c, Schematic
of the cancer association scoring system (top). Predictive ability of separate and combined
models for cancer association scoring (box plot). P values are based on one-sided paired
Wilcoxon rank sum test. d, The 20 proteins with the highest cancer-association scores
according to the combined model. For four cancer candidates (bold) their characteristics and
their interacting partners in HI-II-11 are indicated. * Cancer Census proteins.
22
Tasan et al.
REFERENCES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian
Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic
Acids Res. 33, D514-517 (2005).
Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association
loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362-9367 (2009).
Futreal, P. A. et al. A census of human cancer genes. Nature Rev. Cancer 4, 177-183 (2004).
Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic
Mutations in Cancer. Nucleic Acids Res. 39, D945-950 (2011).
Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546-1558 (2013).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human
genomes. Nature 491, 56-65 (2012).
Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nature
Rev. Genet. 12, 745-755 (2011).
Chin, L., Hahn, W. C., Getz, G. & Meyerson, M. Making sense of cancer genomic data. Genes
Dev. 25, 534-555 (2011).
Vidal, M., Cusick, M. E. & Barabási, A.-L. Interactome networks and human disease. Cell 144,
986-998 (2011).
Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based approach to
human disease. Nature Rev. Genet. 12, 56-68 (2011).
Venkatesan, K. et al. An empirical framework for binary interactome mapping. Nature Methods 6,
83-90 (2009).
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of
the human genome. Nature 431, 931-945 (2004).
Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide
scale. Nature 490, 556-560 (2012).
Dreze, M. et al. 'Edgetic' perturbation of a C. elegans BCL2 ortholog. Nature Methods 6, 843-849
(2009).
Bader, G. D., Betel, D. & Hogue, C. W. BIND: the Biomolecular Interaction Network Database.
Nucleic Acids Res. 31, 248-250 (2003).
Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res.
41, D816-823 (2013).
Salwinski, L. et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 32,
D449-451 (2004).
Prasad, T. S. K. et al. Human Protein Reference Database-2009 update. Nucleic Acids Res. 37,
D767-772 (2009).
Kerrien, S. et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 40,
D841-846 (2012).
Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40,
D857-861 (2012).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235-242 (2000).
Rual, J.-F. et al. Towards a proteome-scale map of the human protein-protein interaction
network. Nature 437, 1173-1178 (2005).
Stelzl, U. et al. A human protein-protein interaction network: a resource for annotating the
proteome. Cell 122, 957-968 (2005).
Cusick, M. E. et al. Literature-curated protein interaction datasets. Nature Methods 6, 39-46
(2009).
Eyckerman, S. et al. Design and application of a cytokine-receptor-based interaction trap. Nature
Cell Biol. 3, 1114-1119 (2001).
Dreze, M. et al. High-quality binary interactome mapping. Methods Enzymol. 470, 281-315
(2010).
von Mering, C. et al. STRING: a database of predicted functional associations between proteins.
23
Tasan et al.
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Nucleic Acids Res. 31, 258-261 (2003).
Stumpf, M. P. et al. Estimating the size of the human interactome. Proc. Natl Acad. Sci. USA
105, 6959-6964 (2008).
Yang, X. et al. A public genome-scale lentiviral expression library of human ORFs. Nature
Methods 8, 659-661 (2011).
Yu, H. et al. Next-generation sequencing to generate interactome datasets. Nature Methods 8,
478-480 (2011).
Braun, P. et al. An experimentally derived confidence score for binary protein-protein
interactions. Nature Methods 6, 91-97 (2009).
Ramachandran, N. et al. Next-generation high-density self-assembling functional protein arrays.
Nature Methods 5, 535-538 (2008).
Nyfeler, B., Michnick, S. W. & Hauri, H. P. Capturing protein interactions in the secretory
pathway of living cells. Proc. Natl Acad. Sci. USA 102, 6350-6355 (2005).
Wass, M. N., David, A. & Sternberg, M. J. Challenges for the prediction of macromolecular
interactions. Curr. Opin. Struct. Biol. 21, 382-390 (2011).
Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290-301 (2012).
Stein, A., Ceol, A. & Aloy, P. 3did: identification and classification of domain-based interactions
of known three-dimensional structure. Nucleic Acids Res. 39, D718-723 (2011).
Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 1068-1081
(2012).
Stagljar, I. & Fields, S. Analysis of membrane protein interactions using yeast-based
technologies. Trends Biochem. Sci. 27, 559-563 (2002).
Bjorklund, A. K., Light, S., Hedin, L. & Elofsson, A. Quantitative assessment of the structural bias
in protein-protein interaction assays. Proteomics 8, 4657-4667 (2008).
Bader, G. D. & Hogue, C. W. Analyzing yeast protein-protein interaction data obtained from
different sources. Nature Biotechnol. 20, 991-997 (2002).
von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein
interactions. Nature 417, 399-403 (2002).
Futschik, M. E., Chaurasia, G. & Herzel, H. Comparison of human protein-protein interaction
maps. Bioinformatics 23, 605-611 (2007).
Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685-8690 (2007).
Rozenblatt-Rosen, O. et al. Interpreting cancer genomes using systematic host network
perturbations by tumour virus proteins. Nature 487, 491-495 (2012).
Mitra, K., Carvunis, A.-R., Ramesh, S. K. & Ideker, T. Integrative approaches for finding modular
structure in biological networks. Nature Rev. Genet. 14, 719-732 (2013).
Beetz, C. et al. Inhibition of TFG function causes hereditary axon degeneration by impairing
endoplasmic reticulum structure. Proc. Natl Acad. Sci. USA 110, 5091-5096 (2013).
Witte, K. et al. TFG-1 function in protein secretion and oncogenesis. Nature Cell Biol. 13, 550558 (2011).
Zhang, J. & Lautar, S. A yeast three-hybrid method to clone ternary protein complex
components. Anal. Biochem. 242, 68-72 (1996).
Zhong, Q. et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 5, 321
(2009).
R Core Team. R: A language and environment for statistical computing. (R Foundation for
Statistical Computing, 2013).
Supplementary information is available in the online version of the paper.
24
Tasan et al.
Acknowledgements: The authors wish to acknowledge members of the Center for Cancer
Systems Biology (CCSB) and particularly H. Yu for helpful discussions. This work was
supported by the following grants and agencies: NHGRI grants R01HG001715 to M.V.,
D.E.H., F.P.R. and J.T. and P50HG004233 to M.V., F.P.R. and A.-L.B.; NHLBI grants
U01HL098166 subaward to M.V. and U01HL108630 subaward to A.-L.B.; NCI grant
U54CA112962 subaward to M.V.; NICHD ARRA R01HD065288 to L.M.I.; NIMH
R01MH091350 to L.M.I. and T.H.; NSF Grant CCF-1219007 to Y.X.; Krembil Foundation,
Canada Excellence Research Chair, Ontario Research Fund–Research Excellence Award,
Canadian Institute for Advanced Research Fellowship to F.P.R.; grant CSI07A09 from Junta
de Castilla y Leon (Valladolid, Spain), grant PI12/00624 from Ministerio de Economia y
Competitividad (AES 2012, ISCiii, Madrid, Spain,) and grant i-Link0398 from Consejo
Superior de Investigaciones Científicas (CSIC, Madrid, Spain) to J.D.L.R.; Spanish Ministerio
de Ciencia e Innovación (BIO2010-22073) and the European Commission through the FP7
project SyStemAge grant agreement n:306240 to P.A.; Group-ID Multidisciplinary Research
Partnerships of Ghent University and grant FWO-V G.0864.10 from the Fund for Scientific
Research Flanders to J.T.; I.L. is a postdoctoral fellow with the FWO-V; Institute Sponsored
Research funds from the Dana-Farber Cancer Institute Strategic Initiative to M.V. M.V. is a
"Chercheur Qualifié Honoraire" from the Fonds de la Recherche Scientifique (FRS-FNRS,
Wallonia-Brussels Federation, Belgium).
Author Contributions I.L., N.S., S.Y., Q.Z., X.Y., L.G., D.B., P.B., M.P.B., D.C.-Z., R.C.,
E.D., M.D., A.D., F.G., B.J.G., R.K., A.M., R.R.M., M.M.P., X.R., J.R., P.R., V.R., J.S., An.S.,
Ak.S., S.T., S.A.T., J.-C.T., K.V. and J.W. performed experiments or contributed new
reagents. M.T., T.R., S.J.P., B.C., Ce.F., R.M., A.K., S.G., A.-R.C., Ch.F., E.F., M.J., S.K.,
G.N.L., J.M., Am.S., Y.S. and Y.X. performed computational analyses. M.T., T.R., S.J.P.,
B.C., M.E.C., J.D.L.R., M.A.C., D.E.H., T.H., F.P.R. and M.V. wrote the manuscript. A.-L.B.,
L.I., P.A., J.D.L.R., J.T., M.A.C., D.E.H., T.H., F.P.R. and M.V. designed or advised research.
Author Information The authors declare no competing financial interests. Correspondence
and requests for materials should be addressed to M.A.C.
([email protected]). D.E.H. ([email protected]), T.H.
([email protected]), F.P.R. ([email protected]) and M.V.
([email protected]).
25
Tasan et al.
1a
Mendelian
traits
Complex
traits
Genotype-phenotype relationships
Underlying
interactome
View from
available maps
Completeness and quality of maps
Gene or gene product
Biophysical interaction
Functional interaction relevant for a phenotype of interest
Gene or gene product with no characterized interactions
Uncharacterized interaction
26
Tasan et al.
Lit-BS-10
27
0
em
at
-1
st
M
Sy
01
t-2
Li
Lit-BM-10
ic
0
0
Pieces of evidence
1
2
3
4+
9,056 2,646 1,013 1,247
10
10
Binary
5,871
t-B
Non-binary
0
20
S-
Binary
13,962
30
t-B
Non-binary
42,781
Systematic
5,871
40
Li
Non-systematic
56,743
50
Li
7 public databases (Lit-2010)
(PPIs with valid evidence)
62,163
Number of interactions (thousands)
1b
Tasan et al.
1c
STRING literature mining
Fraction of pairs not found to co-occur (%)
MAPPIT assay
Fraction of pairs recovered (%)
30
25
20
15
10
5
0
0
1
2
3
4
100
P < 2 × 10-16
80
60
40
20
0
5
Fraction of RRS pairs recovered (%)
Y2H assay
P = 1 × 10-17
20
Fraction of pairs recovered (%)
Fraction of pairs recovered (%)
20
P = 5 × 10-4
15
10
P = 0.03
5
P = 0.09
15
10
5
0
0
RRS pairs
Binary literature pairs:
Lit-BS-10
Lit-BM-10
28
P = 5 × 10-15
P = 8 × 10-3
Tasan et al.
1d
Fraction of gene products with reported interactions (%)
100
80
60
40
20
0
Whole
proteome
(n = 19,111)
Well-accepted cancer
driver
(n = 124)
Sanger Cancer
Gene Census
(n = 465)
Mendelian disorder
gene
(n = 2,826)
Systematic cancer
screen gene
(n = 2,996)
GWAS loci gene
for any trait
(n = 2,883)
100
80
60
40
20
0
Lit-2010
Lit-BM-10
Systematic binary screens
29
Tasan et al.
Number of
publications
1e
1,000
100
10
1
Lit-BM-10
Total number
of interactions
Number of interactions
0 10 •
1,000
500
0
Sanger Cancer Gene Census products
GWAS loci gene products
Mendelian disorder gene products
30
Protein count
128
32
8
Tasan et al.
2a
Parameters
Methods
7K
Completeness
Relative gain
19K
13K 19K 3.1X space
7K
19K
Interactions
Sampling
sensitivity
19K
Interactions
13K
1
Assay
sensitivity
Precision
PRS
RRS
1.6X sensitivity
1
2
PRS
RRS
1.3X sensitivity
92 PRS
92 RRS
1 assay
~500 PRS
~700 RRS
3 assays
2005
2011
31
> 5X reference sets
Tasan et al.
2b
Screen (2X)
Verification
Validation
~13,000 proteins
hORFeome 5.1
~66,000 ORF pairs
4x pairwise retested
~800 protein pairs
selected at random
~82,000,000
protein pairs tested
Sequence verification
of 3x and 4x positives
Tested in 3 binary
orthogonal assays
~35,000 IST pairs
identified
~15,000 gene pairs
verified
Recovery compared
to Lit/RRS pairs
HI-II-11
~14,000 interactions
between ~4,300 proteins
32
Tasan et al.
2c
Fraction of pairs recovered (%)
MAPPIT
30
PCA
wNAPPA
30
30
25
25
20
20
15
15
15
10
10
10
5
5
5
0
0
0
HI-II-11
Lit-BM-10
25
RRS
20
1
2
3
4
5
1
2
3
4
5
Score threshold (arbitrary units)
33
1
2
3
4
5
Tasan et al.
2d
Fraction of protein pairs (%)
P = 1.3 × 10-10
Detected (189)
Missed (498)
8
6
4
2
0
0
50
100
150
200
1XPEHURIUHVLGXHïUHVLGXHFRQWDFWV
34
Tasan et al.
2e
Domain pairs in
HI-II-11 interacting
proteins
1,000 randomized
HI-II-11 networks
?
Normalized
frequency
60% precision
'RPDLQSDLUVQ Normalized frequency
0 60 80
Three-dimensional
structures of domain-based
interactions
Normalized frequency
0 60 80
Synaptobrevin :: SNARE
SCAN :: SCAN
PX :: Reticulon
Hormone_recep :: Hormone_recep
Syntaxin :: Synaptobrevin
Arm :: Ran_BP1
6NSB32=)ïEo[ïOLke
6NS)ïEo[ïOLke
SNARE :: SNARE
HorPRQHBUHFHS]Iï&
Syntaxin :: Syntaxin
CARD :: BIR
]Iï&]Iï&
IQ :: EF_hand_5
6&$10\EB'1$ïELQGB
Ras :: PBD
3'=/
84BFRQ]Iï5,1*B
LIM :: LIM_bind
bZIP_1 :: bZIP_1
Syntaxin :: SNARE
GAGE :: BACK
CH :: CH
Filament :: Filament
UQ_con :: IBR
35
Domain pairs:
,QGLG
1RWLQGLG
Tasan et al.
2f
Protein
Possible domain pairs for HI-II-11 interactions:
Predicted and in 3did (TP = 182)
Predicted and not in 3did (FP = 81)
Not predicted and in 3did (FN = 91)
Not predicted and not in 3did (TN = 318)
IQCE
KIAA1683
IQGAP3
CALM1
CALM2
Protein domains
EF_hand_5 CH
IQ
RasGAP_C
36
CAMTA2
CALML3
RasGAP
CG_1
TIG
Tasan et al.
Number of binary
interactions (thousands)
2g
20
Binary literature (multiple pieces of evidence)
Systematic binary screens
15
10
5
0
Time
1994
2000
2005
37
2010
2013
Tasan et al.
2h
Space ordered by number of publications
1994
1998
2002
2006
Number of interactions
0 10 •
Systematic binary maps
Binary literature maps
38
2010
2013
Time
Tasan et al.
3a
HI-II-11
Lit-BM-13
Co-Frac
PrePPI-HC
Protein
Protein-protein interaction
39
Tasan et al.
3b
HI-II-11
Lit-BM-13
Co-Frac
PrePPI-HC
Nodes
4,303
5,545
2,999
4,989
Edges
13,944
11,045
13,982
25,403
Nodes in largest component
4,100
4,919
2,820
3,681
Mean degree
6.348
3.772
9.324
10.183
Mean clustering
0.053
0.065
0.164
0.399
Mean distance
4.063
5.305
4.506
5.251
Power law exponent
2.195
1.659
1.462
1.451
Proteins (heterodimers)
Lit-BM-13
Co-Frac
HI-II-11
1,491
893
499
701
1,766
276
269
253
416
299
94
PrePPI-HC
1,870
489 1,214
336
Protein-protein interactions (heterodimers)
Lit-BM-13
Co-Frac
HI-II-11
9,128
13,347
195
255
13,045
14
27
287
92
9
10
27
40
40
PrePPI-HC
24,410
435
Tasan et al.
3c
Number of
publications
Number of interactions
0 10 •
HI-II-11
Lit-BM-13
Co-Frac
PrePPI-HC
mRNA abundance
in HEK cells
1,000
1,000
100
100
10
10
1
1
HI-II-11
Lit-BM-13
Co-Frac
Total number of interactions
PrePPI-HC
500
0
2,000
1,000
0
3,000
2,500
2,000
1,500
1,000
500
0
2,500
2,000
1,500
1,000
500
0
41
Fraction of sequence
in transmembrane
helices
Fraction of sequence
in Pfam domains
100
50
50
25
0
0
Tasan et al.
HII
Lit I-11
-B
Co M-1
-F 3
Pre rac
PP
I-H
C
3d
mRNA abundance in HEK cells
mRNA abundance in HeLa cells
Protein abundance in HEK cells
Protein abundance in HeLa cells
Number of publications
First publication date
Number of protein complexes
Number of pathways
Number of GO associations
Fraction of sequence in Pfam domains
Fraction of basic amino-acids
Fraction of charged amino-acids
Fraction of acidic amino-acids
Fraction of polar amino-acids
Fraction of polar and uncharged amino-acids
Fraction of hydrophobic amino-acids
Fraction of sequence in transmembrane helices
Number of binding sites
Fraction of sequence in disordered regions
Number of linear motifs
ORF length
Enrichment
Balanced representation
Depletion
42
Abundance
Knowledge
Sequence
Tasan et al.
3e
Number of
publications
(mirrored from Lit-BM-13)
Dense
zone
Sparse
zone
Fraction of pairs
recovered in
MAPPIT (%)
Mirrored zones
in HI-II-11
20
15
10
5
0
43
mRNA abundance
in HEK cells
(mirrored from Co-Frac)
Fraction of sequence
in Pfam domains
(mirrored from PrePPI-HC)
Tasan et al.
4a
1,000
HI-II-11
100
10
1
Lit-BM-13
1,000
10
1
1,000
Co-Frac
100
10
1
1,000
PrePPI-HC
100
10
M
en
e
ou On
se tol
ph ogy
en
ot
yp
e
U s
ni
on
BP
M
F
C
C
1
G
Odds ratio
100
44
Tasan et al.
4b
Mirrored zones
in HI-II-11
GO MF
GO CC
1,000
100
10
1
1,000
100
10
1
Union
Odds ratio
1,000
100
10
1
Mouse
phenotypes
GO BP
Dense
zone
Sparse
zone
Number of
publications
(mirrored from Lit-BM-13)
1,000
100
10
1
1,000
100
10
1
45
mRNA abundance
Fraction of sequence
in HEK cells
in Pfam domains
(mirrored from Co-Frac) (mirrored from PrePPI-HC)
Tasan et al.
*
*
10
RRS
20
Odds ratio
30
0
100
20
10
*
*
10
*
*
1
MF
CC
Mouse
phenotypes
Union
1,000
100
*
*
10
1
0
*
*
*
BP
Odds ratio
*
*
*
Gene Ontology
100
30
*
1,000
Literature map
detectedness
RRS
MAPPIT recovery (%)
MAPPIT recovery (%)
4c
Systematic map
detectedness
46
Tasan et al.
4d
“True” biophysical interaction network
Pseudointeractions
Functional
interactions
Detectedness
Fraction of
interactions
47
Tasan et al.
4e
Fraction of
systematic map
in literature map (%)
Fraction of
literature map
in systematic map (%)
Detectedness
Lit-BM-13 map
(pieces of evidence)
Systematic map
2
3
•
HI-II-11 in 1 screen
HI-II-11 in 2 screens
HI-I-05 in 2 assays
10
10
10
8
8
6
6
4
4
4
2
2
2
0
0
0
10
10
40
8
8
30
8
6
P = 2 × 10-3
6
4
P = 9 × 10-5
6
P = 1 × 10-6
4
P = 1 × 10-3
20
2
2
10
0
0
0
Effective search space:
48
P = 5 × 10-6
Not controlled
Controlled
Tasan et al.
4f
Detectedness
Binary interactome
size (thousands)
800
600
400
200
0
Lit-BM-13 map
(pieces of evidence)
Systematic map
Controlled for
systematic map
•2
•3
Controlled for
both maps
•4
HI-II-11
•2
•3
•4
•
2
2
HI-II-11 HI-I-05
(screens) (assays)
Mapped binary
interactions (thousands)
Effective search space
49
50
40
30
20
10
0
Tasan et al.
5a
Knowledge based
P = 1 × 10-7
Average degree
20
15
10
Systematic
P = 0.01
P = 0.2
P = 1 × 10-31
5
0
ac
3
-1
t-
BM
Li
Co
1
-1
C
r
-F
e
Pr
P
-H
PI
I
I-I
H
Cancer Census proteins
Proteins not in OMIM or Cancer Census
50
Tasan et al.
Sanger Cancer
Gene Census
Cancer-associated GWAS loci
and
400
Frequency in
random networks
5b
300
P = 0.02
200
100
0
0.0
0.2
0.4
0.6
Fraction of loci with
cancer candidates
Gene product in HI-II-11
Interaction in HI-II-11
MIR4646 MSH5-SAPCD1 LY6G6F
CDSN
PSORS1C2
PSORS1C1
CCHCR1
SRSF2
CSNK2B
CLIC1
HSPA1B
HSPA1A
APOM
C6orf47
HSPA1L
C6orf48
SNORA38
VARS
LY6G5B
LY6G6D
BAG6
ABHD16A
LY6G6E
PRRC2A
LY6G6C
LY6G5C
VWA7
C6orf25
GPANK1
LSM2
C15orf55
DDX6
DDAH2
SNORD48 SNORD52
TPM3
ZBTB38
REL
LMO1
XPA
FLI1
TRIM27
PDE4DIP
LASP1
LOC401242
TNFRSF6B
ZGPAT
CTBP2
STAMBPL1
CHRNA5 CHRNA3
ARHGEF5
ARFRP1 RTEL1-TNFRSF6B
LIME1
RTEL1
ACTA2
FAS
IKZF1
CDK6
C9orf53
DOCK5
KCTD9
FARP2
CDCA2
GNRH1
HDLBP
PPP2R2A
EBF2
ANKLE1
CDKN2B-AS1 CDKN2A
MTAP
MIR4296
BABAM1
SEPT5
SEPT2
PSMA4 AGPHD1
SLBP
CEP57L1
TMEM129
SESN1
ARMC2
ASPSCR1
TACC3
CDKN2B
SEPT6
Randomized genes
Gene in GWAS locus whose product:
Is present in HI-II-11
Is absent from HI-II-11
GWAS locus
Protein in Sanger Cancer Gene Census
HI-II-11 interaction
51
Tasan et al.
5c
HI-II-11 neighbor data
“guilt-by-association”
SB
Mu
Vir t
SB
Mu
Vir t
Ce
n
su
s
Query node data only
“guilt-by-profiling”
Protein A
Protein P
Protein B
Protein C
0 0 1 2
In the set
Not in the set
correct for frequencies
accross random networks
Cancer
association
score
SB = Sleeping Beauty transposon-based mouse cancer screen
Mut = Somatic mutations screen in cancer tissues
Vir = Tumor virus host targets
Census = Sanger Cancer Gene Census
“Guilt-byprofiling”
Mut
Vir
P = 1 × 10-43
SB
P = 1 × 10-31
Mut
“Guilt-byassociation”
Vir
SB
Census
P = 1 × 10-115
P = 1 × 10-164
Combined model
5
10
15
20
Average precision of predicted cancer proteins (%)
52
Tasan et al.
SB
Mu
Vir t
Ce
ns
us
*
5d
Top 20 ranked proteins
SB
Mu
Vir t
CCND1*
CCND3*
CDKN2C*
CDK4*
CDK4* 43
SEC24A 35
CDKN2D
STAT3 33
HOOK1
GRB2 32
CDKN2C* 31
1 0 1 3
MAPK1IP1L 27
EWSR1*
SEC24A
TFG*
1 0 1 2
HNRNPM 23
ATPAF2 22
ARHGEF16
DDX6* 21
C1orf94
NAGK 20
CEP55
NFE2L2* 20
EWSR1*
CDKAL1 20
HGS
MAPK1IP1L
SH2D4A 25
AP2B1 19
HNRNPUL1
ARMC8 17
RBFOX2
PSME4 17
SS18L1*
KRAS* 16
SUMO1
SUMO1P1
TFG*
UBE2I
WAC 16
TAX1BP1 15
NEUROG3 15
VPS37C
4 1 4 3
TCF12*
NEUROG3
REL*
2 0 1 1
53
In the set
Not in the set
Téléchargement