Vast uncharted territories in the human interactome landscape uncovered by a second-generation map Murat Ta!an1,2,3,4*, Thomas Rolland1,5*, Samuel J. Pevzner1,5,6,7*, Benoit Charloteaux1,5*, Irma Lemmens8^, Celia Fontanillo9^, Roberto Mosca10^, Nidhi Sahni1,5^, Song Yi1,5^, Atanas Kamburov1,5^, Susan Ghiassian1,11^, Quan Zhong1,5^, Xinping Yang1,5^, Lila Ghamsari1,5^, Dawit Balcha1,5, Pascal Braun1,5, Martin P. Broly1,5, Anne-Ruxandra Carvunis1,5, Dan Convery-Zupan1,5, Roser Corominas12, Elizabeth Dann1,5, Matija Dreze1,5, Amélie Dricot1,5, Changyu Fan1,5, Eric Franzosa1,13, Fana Gebreab1,5, Bryan J. Gutierrez1,5, Mike Jin1,5, Shuli Kang12, Ruth Kiros1,5, Guan Ning Lin12, Andrew MacWilliams1,5, Jörg Menche1,11, Ryan R. Murray1,5, Matthew M. Poulin1,5, Xavier Rambout1,5,14, John Rasla1,5, Patrick Reichert1,5, Viviana Romero1,5, Julie M. Sahalie1,5, Annemarie Scholz1,5, Akash A. Shah1,5, Amitabh Sharma1,11, Yun Shen1,5, Stanley Tam1,5, Shelly A. Trigg1,5, Jean-Claude Twizere1,5,14, Kerwin Vega1,5, Jennifer Walsh1,5, Michael E. Cusick1,5, Yu Xia1,13, Albert-László Barabási1,11,15, Lilia Iakoucheva12, Patrick Aloy10,16, Javier De Las Rivas9, Jan Tavernier8, Michael A. Calderwood1,5, David E. Hill1,5, Tong Hao1,5, Frederick P. Roth1,2,3,4 & Marc Vidal1,5 1 Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, 2 Boston, Massachusetts 02215, USA. Departments of Molecular Genetics and Computer Science, University of 3 Toronto, Toronto, Ontario M5S 3E1, Canada. Donnelly Centre, University of Toronto, Toronto, Ontario M5S 4 3E1, Canada. Samuel Lunenfeld Research Institute, Mt. Sinai Hospital, Toronto, Ontario M5G 1X5, Canada. 5 6 Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA. Department of 7 Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA. Boston University School of 8 Medicine, Boston, Massachusetts 02118, USA. Department of Medical Protein Research, VIB, 9000 Ghent, 9 Belgium. Cancer Research Center (Centro de Investigación del Cancer), University of Salamanca and Consejo 10 Superior de Investigaciones Científicas, Salamanca 37008, Spain. Joint IRB-BSC Program in Computational 11 Biology, Institute for Research in Biomedicine, Barcelona 08028, Spain. Center for Complex Networks Research (CCNR) and Department of Physics, Northeastern University, Boston, Massachusetts 02115, USA. 12 13 Department of Psychiatry, University of California, San Diego, La Jolla, California 92093, USA. Department 14 of Bioengineering, McGill University, Montreal, Quebec H3A 0C3, Canada. Protein Signaling and Interactions 15 Lab, GIGA-R, Université de Liège, 4000 Liège, Belgium. Department of Medicine, Brigham and Women’s 16 Hospital, Harvard Medical School, Boston, Massachusetts 02115, USA. Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona 08010, Spain. * These authors contributed equally to this work and should be considered co-first authors. ^ These authors contributed equally to this work and should be considered co-second authors. Tasan et al. Complex cellular systems and interactome networks underlie most genotypephenotype relationships. Just as a high-quality human reference genome sequence revolutionized genotype determination a decade ago, proteome-scale reference maps of protein-protein, protein-nucleic acid, and enzyme-substrate interactions should provide scaffold information necessary to interpret the increasing amount of genotypic data. Here we describe a new empirically-controlled “second-generation” map of the human binary protein-protein interaction network, expanding coverage over similarquality interactions from the literature by three-fold, altogether bringing us only one order of magnitude away from the complete network. The map uncovers vast uncharted interactome “territories” rich in genetic material related to complex human phenotypes. Interaction “detectedness” correlates with co-functionality, suggesting that the biophysical interactome is biologically heterogeneous. Increased and uniform coverage of the interactome network helps characterize genotype-phenotype associations, particularly for cancer. Thus a reference map of the human interactome network is now within reach. 1 Tasan et al. Complex cellular systems formed by large numbers of interactions amongst genes and gene products, or interactome networks, underlie most cellular functions. Understanding how perturbations of interactome networks relate to inherited and somatic disease susceptibilities will be required for predictive models of genotype-phenotype relationships in human. A nearly full description of all genotypic variations in the human population is expected to be available soon. Several thousand genes associated with Mendelian disorders have been defined1, susceptibility loci have been uncovered for a few hundred complex traits2, and several hundred putative cancer genes have been identified3-5. With rapidly evolving nextgeneration sequencing platforms, large numbers of human genome sequences will soon be available from unaffected individuals6, patients stricken by virtually any disease7, and diverse tumours8. As genome-wide genotypic information accumulates, many fundamental questions pertaining to genotype-phenotype relationships remain unresolved9. The causal changes and mechanisms that connect genotype to phenotype are still generally unknown, especially for complex trait loci and cancer-associated mutations. Even when identified, it is often unclear how a causal change perturbs the function of the corresponding gene product. Lastly, how such changes perturb interactome networks is largely unknown, which has hindered comprehensive understanding of gene-gene interactions. All in all, to “connect the dots” of the genomic revolution, functions and context must be assigned to hundreds of thousands of genotypic changes. Globally functionalizing and contextualizing genotype-phenotype relationships requires genome and proteome-scale maps of interactome networks formed by macromolecular interactions such as protein-protein interactions (PPIs), protein-DNA interactions, and posttranslational modifiers and their targets9,10. Although first generation interactome maps have 2 Tasan et al. provided network-based explanations for some genotype-phenotype relationships9, it is difficult to establish how fully these maps represent the complete network and whether their quality is sufficient to derive accurate interpretations (Fig. 1a). There is a dire need for empirically controlled11 and high-quality “second-generation” reference maps, reminiscent of the high-quality human reference genome sequence12 that revolutionized genotyping. The challenges are manifold. Even considering only one splice variant per gene, close to 20,000 protein-coding genes must be handled and ~200 million protein pairs tested to generate a comprehensive binary protein-protein interactome reference map. Accordingly, whether such a comprehensive network could ever be mapped by the collective efforts of small-scale studies remains uncertain. Computational predictions of protein interactions can generate information at proteome scale13 but are inherently limited by biases in currently available knowledge used to infer such interactome models. Should interactome maps be generated for all individual human tissues using biochemical co-complex association data, or alternatively, might direct binary interaction information on all possible PPIs be preferable or complementary? Finally, even with nearly complete, high-quality reference interactome maps of biophysical interactions in hand, how to evaluate the biological relevance of each interaction in physiological conditions11,14? Here, we begin to address these questions by generating a second-generation map of the human binary interactome and comparing it to alternative network maps. 3 Tasan et al. An uncharted interactome territory in small-scale interaction studies To investigate whether small-scale studies described in the literature are adequate to qualitatively and comprehensively map the human binary protein-protein interactome network, we assembled data available in 2010 from seven public databases15-21 (Lit-2010). After eliminating protein pairs lacking any evidence of direct binary interaction and filtering out evidence from proteome-scale interactome screens11,22,23, ~14,000 putative direct binary PPIs remained (Fig. 1b). Pairs reported in only a single publication and detected by only a single method have higher rates of curation errors than interactions reported in multiple papers and/or with multiple detection methods24, yet represented nearly two thirds of the binary literature set. We therefore segregated binary interactions with a single piece of evidence (Lit-BS10) from binary interactions with multiple pieces of evidence (Lit-BM-10), and experimentally tested subsets of both by mammalian PPI trap (MAPPIT)25 and yeast two-hybrid (Y2H)26 assays. Lit-BS-10 pairs validated at rates that were only slightly higher than the randomly selected protein pairs used as negative control (random reference set; RRS) and considerably lower than Lit-BM-10 pairs (Fig. 1c and Supplementary Notes). Protein pairs from Lit-BS-10 were also found to co-occur in the literature significantly less than Lit-BM-10 pairs as indicated by STRING literature mining scores27 (Fig. 1c and Supplementary Notes), suggesting that these pairs were less thoroughly studied even in their corresponding publications. Thus, we restricted Lit-2010 to 4,906 high-quality Lit-BM-10 binary interactions (Supplementary Table 1). This highlights the need for systematic approaches, since estimates of the number of PPIs in the human interactome are close to two orders of magnitude larger than Lit-BM-1011,28. 4 Tasan et al. The limited number of high-quality PPIs drastically diminishes the ability to prioritize putative disease genes identified in systematic studies. Although cancer driver gene products5 are well represented in Lit-BM-10, less than 30% of proteins encoded by genes identified in systematic genome-wide cancer screens have any reported interactions in LitBM-10 (Fig. 1d). Coverage was similarly low for proteins associated with Mendelian traits or identified in genome-wide association studies (GWAS). Such differences might be explained by the inspection bias inherent to small-scale studies. The ~20,000 proteins of the human proteome are not studied with equal intensity; some are described in hundreds of publications while most have been mentioned only in a few (Fig. 1e). Ranking proteins by the number of publications that mention them, we observed a strong bias of interactions towards highly studied proteins in contrast to a sparsely populated territory for interactions involving less studied proteins. Candidate gene products identified in genome-wide disease screens distribute homogeneously across the publication-ranked interactome space, again highlighting the need for systematic PPI maps to cover this uncharted interactome territory. A second-generation binary interactome map The enrichment of interactions detected for highly studied proteins in Lit-BM-10 might indicate that well studied proteins are truly involved in more PPIs. Under this model, more than half of all predicted human proteins would not participate substantially in the interactome network. Alternatively, the sparse territory could be homogenously populated by real PPIs that have been missed so far due to sociological or experimental biases. To distinguish between these possibilities and to increase the coverage of the interactome space, we generated a new systematic binary interaction map, interrogating a search space four times larger and with deeper sensitivity than in our first proteome-scale attempt22 (Fig. 2a, Supplementary Fig. 1 and Supplementary Notes). All pairwise 5 Tasan et al. combinations of proteins corresponding to ~13,000 genes (Space II; Supplementary Table 2)29 were screened twice, and we identified pairs of genes encoding interacting proteins from ~85,000 Y2H positive colonies using next-generation sequencing30 (Fig. 2b and Supplementary Notes). After quadruplicate retesting, ~14,000 distinct protein pairs were considered verified interactions. To validate the quality of these PPIs, we compared a sample of ~800 verified interactions both to a positive reference set of ~500 binary interactions from Lit-BM-10 and to our RRS as a negative control. All ~2,000 protein pairs were tested in three binary assays: MAPPIT25, a well-based nucleic acid programmable protein array (wNAPPA)31,32, and a fluorescence-reconstituting protein-fragment complementation assay (PCA)33. The verified pairs exhibited validation rates that were statistically indistinguishable from the positive reference set (Fig. 2c and Supplementary Tables 3 and 4), demonstrating the high quality of our PPIs. Our new human interactome dataset covering Space II and first released online in 2011 (“HI-II-11”; Supplementary Table 5) contains 13,944 interactions amongst 4,303 distinct proteins (517 homodimers and 13,427 heterodimers). To determine the extent to which HI-II-11 contains direct binary interactions, we used three-dimensional co-crystal structures from the Protein Databank (PDB)21 to distinguish direct contacts from indirect associations. Out of 191 HI-II-11 interactions with such structural information, 189 were in direct contact in the corresponding crystal structure, demonstrating the high precision of HI-II-11 in reporting direct PPIs (99% precision; P = 4 " 10-5, one-sided binomial test; Supplementary Table 6 and Supplementary Notes). To investigate whether particular classes of PPIs might be refractory to our assays, we evaluated how the number of residue-residue contacts in three-dimensional structures related to our ability to detect interactions34. Among interacting proteins found in PDB complexes, the pairs reported in HI-II-11 tended to involve more reside-residue contacts than those that were 6 Tasan et al. missed (medians of 62 and 44 respectively, P = 1.27 " 10-10, one-sided Wilcoxon rank-sum test; Fig. 2d and Supplementary Table 7). This result suggests that Y2H sensitivity can be affected by the number of residue-residue contacts and thus presumably by interaction affinity. PPIs are commonly mediated by protein domains such as those catalogued in the Protein family (Pfam) database35. Observing that HI-II-11 is enriched for proteins containing domains known to be involved in domain-domain interactions36 (Supplementary Fig. 2), we reasoned that the pairing of such domains in interacting proteins provides a strong hypothesis for the mechanism of that interaction. To identify domain pairs mediating HI-II-11 interactions, we looked for pairs of Pfam domains in interacting proteins that appeared more frequently than expected (Fig. 2e and Supplementary Table 8). HI-II-11 is significantly enriched for domain pairs supported by three-dimensional structures36 (P = 6 " 10-11, Wilcoxon rank sum test), further demonstrating that HI-II-11 is comprised of direct binary interactions. For PPIs involving the 25 most enriched domain pairs, we could identify the structurally supported domain pairs among all possible ones with high accuracy (70% sensitivity and 80% specificity), as exemplified by the calmodulin-binding IQ and EF-hand domains (Fig. 2f). To compare the growth rates of systematic and literature interactome maps, we updated the literature binary interaction dataset to include information available up to 2013 (Supplementary Fig. 3). The number of high-quality binary literature PPIs has grown at a steady rate to ~10,000 interactions, whereas the growth of systematic datasets is punctuated with a small number of large-scale releases (Fig. 2g). The union of previous systematic datasets11,22,23,30 with HI-II-11 provides close to ~20,000 binary interactions which, combined with the high-quality binary literature pairs available as of 2013 (“Lit-BM-13”; Supplementary Table 9) provides ~30,000 individual PPIs. Collectively, this brings us to only an order of 7 Tasan et al. magnitude from the estimated complete interactome network11,28. To examine how the sparsely populated territory has evolved, we visualized adjacency matrices for literature-based and systematic maps at regular time points since 1994. Although the sparse territory of the literature map has been slowly populated with interactions, growth in this territory continues to lag behind that in the densely populated one (Fig. 2h and Supplementary Fig. 4). In contrast, current systematic binary maps appear to more uniformly cover the full interactome space. This finding suggests that heterogeneity in literature coverage across the space reflects uneven investigative efforts, and supports the existence of true biophysical interactions across the full human interactome space. Apparently sparse territories contain genuine biophysical interactions To more deeply explore the heterogeneous coverage of the human interactome, we compared HI-II-11 and Lit-BM-13 to a collection of ~25,000 predicted binary PPIs of highconfidence (“PrePPI-HC”)13 and a co-fractionation map of ~14,000 putative binary interactions (“Co-Frac”)37 (Fig. 3a). Although statistically significant (P < 1 " 10-60, Fisher’s exact test), the overlap of interactions between these four maps is small (Fig. 3b), and the number of interactors, or “degree”, of a protein in one map does not correlate with its degree in the other three maps (Supplementary Fig. 5). Each mapping strategy apparently probes different, perhaps complementary, territories of the underlying true interactome landscape. To investigate whether these four maps show similar sparse territories as did Lit-BM13 (Fig. 2h), we generated adjacency matrices for each map with respect to publication frequency (Fig. 3c). Similar to Lit-BM-13, the Co-Frac and PrePPI-HC maps exhibited a strong tendency to report interactions amongst well-studied proteins (Fig. 3c). A potential source of bias for Co-Frac towards well-studied genes arises from the literature-based computational methods used to identify the reported subset of direct PPIs (Supplementary 8 Tasan et al. Fig. 6)37. Similarly, the structural modelling approach used to generate PrePPI-HC relies on existing structures as templates for identifying new interactions and on additional genomic and functional data to enhance prediction specificity. In contrast, HI-II-11 covered the publication-ranked interactome space uniformly. Because the sensitivity of some interaction-mapping approaches depends on the level of gene expression, we examined interactome maps for expression-related sparse territories. The Co-Frac map shows a strong bias of interactions towards proteins encoded by highly expressed genes (Fig. 3c and Supplementary Fig. 7). The literature map also shows such a bias, likely due to the restricted use of common cell lines in the underlying experiments. In contrast, both HI-II-11 and PrePPI-HC exhibit a uniform interaction density across the full spectrum of expression levels, likely explained by the more uniform expression of proteins tested in Y2H and by the independence of the homology-based predictions from levels of expression. We next sought more broadly for sources of intrinsic bias that might influence the appearance of sparsely populated territories, such as biological or knowledge-based properties of proteins and their corresponding genes. For example, PrePPI-HC is virtually devoid of interactions between proteins lacking Pfam domains, which is consistent with conserved domains forming the basis of the prediction method used to produce PrePPI-HC (Fig. 3c). HI-II-11 appears to be depleted, but not devoid, of interactions amongst proteins containing predicted transmembrane helices (TMH) (Fig. 3c), consistent with postulated limitations of the Y2H assay38. The Co-Frac network also appears to be depleted in interactions involving TMH-containing proteins, which may be a consequence of membranebound proteins being filtered out during biochemical separations37. In all we examined 21 protein or gene properties, roughly classified as expression-, sequence-, and knowledge-based (Fig. 3d and Supplementary Table 10). We formally 9 Tasan et al. defined a “dense zone” and a “sparse zone” of interactions by determining the location where the imbalance of interaction density is the highest (Supplementary Table 11 and Supplementary Fig. 8). We considered dense and sparse zones only when this imbalance was significantly different and strongly deviating from a uniform distribution of interactions. Besides the dense and sparse zones already described, we uncovered other biases, such as HI-II-11 exhibiting an enrichment of interactions amongst proteins with extensive disordered regions, a bias suggested in earlier Y2H binary datasets39. No property resulted in the existence of dense and sparse zones in all four maps, suggesting that the observed imbalances reflect heterogeneous testing and technical limitations, rather than the existence of territories that are truly devoid of interactions. To confirm that HI-II-11 interactions found in the apparently sparse zones of the three other maps are of high biophysical quality, we assessed MAPPIT validation of HI-II-11 interactions across the interactome space ranked by the various protein properties examined. MAPPIT validation rates of dense and sparse zone pairs were consistent for nearly all properties (Fig. 3e and Supplementary Figure 9), indicating that HI-II-11 interactions throughout the full interactome space are of equivalent biophysical quality. In summary, current interactome maps contain vast uncharted territories that limit knowledge of the full interactome landscape, whereas HI-II-11 provides high-quality biophysical interactions within these under-explored regions. Functional heterogeneity in interactome maps One intuitive model is that all biophysical interactions of an organism correspond to functional interactions, so that the biophysical interactome network strictly overlaps with a collapsed version of all functional networks, in all cells and at all stages of development. An alternative model suggests that the biophysical interactome overlaps more loosely with functional 10 Tasan et al. networks, with not all biophysical interactions being biologically relevant11. To determine whether the four interactome maps contain functionally relevant PPIs, we compared them to functional networks where links between genes represent shared functional annotations such as Gene Ontology (GO) terms or shared phenotypic annotations inferred from mouse orthologs. All four interactome maps showed significant enrichments for the four types of functional annotation examined (Fig. 4a), confirming their overall biological relevance. Thus, our systematic biophysical map generated independently from pre-existing biological information can indeed reveal functional relationships. Segregating HI-II-11 interactions with respect to dense and sparse zones found in other maps, we observed consistent functional enrichments in both dense and sparse zones for all studied protein rankings (Fig. 4b), supporting that HI-II-11 interactions in sparsely populated territories of other maps are biologically relevant. To investigate which properties of biophysical interactions might explain cofunctionality, we categorized interactions according to their “detectedness”, i.e. how frequently they have been detected. We partitioned the high-quality Lit-BM-13 interactions between those with two, three, and four or more pieces of evidence. We observed a clear trend between detectedness and MAPPIT recovery rate, where pairs with the highest detectedness were recovered approximately three times more often than those with the lowest detectedness (Fig. 4c). There was a significantly higher functional enrichment for interactions having four or more pieces of evidence than those having two or three pieces of evidence, suggesting a possible correlation between how amenable interactions are to detection and functional relevance. However, this trend may result from a stronger inspection bias for interactions of higher detectedness, artificially increasing their observed functional relevance. To discriminate between these two possibilities, we partitioned our systematic binary 11 Tasan et al. interaction datasets into sets of increasing detectedness: HI-II-11 interactions found in one of the two initial screens, those found in both screens (Supplementary Table 5), and the subset of our first systematic map (HI-I-05)22 detected in two different binary assays, MAPPIT and Y2H (Supplementary Table 4). MAPPIT recovery rate and functional enrichment both increased with increasing detectedness (Fig. 4c), suggesting a relationship between detectedness and functional relevance in the human biophysical interactome (Fig. 4d). Measurements of overlap between literature and systematic datasets have been used to characterize the overall quality of the latter, with the presumption that small overlaps reflect poor quality40,41. An alternative explanation is that interactome networks are larger than originally assumed, with small overlaps resulting when each map detects only a small fraction of the whole42. We reasoned that dense zones of the literature datasets could be surrogates for the effective search space. Hence interacting protein pairs found in systematic screens that reside in literature sparse zones should not be taken into account to measure overlaps. Restricting the effective search space of both the systematic and literature maps, we observed a significant increase in overlap for pairs of increasing detectedness (Fig. 4e and Supplementary Notes). As detectedness increases, the fraction of interactions in the systematic maps found in the literature maps rises from 2% to 25%, suggesting that literature and systematic approaches converge. We therefore re-investigated estimates of the human binary interactome size using reference sets consisting of literature binary interactions with increasing detectedness. Our estimates vary between ~600,000 and ~200,000 interactions, within the range of previous estimates11,28 (Fig. 4f, Supplementary Table 12 and Supplementary Notes). The decrease of size estimates with reference set detectedness suggests that the interactome is heterogeneous with respect to detection and functional relevance. This result was robust when controlling for effective search space and matching detectedness of systematic and 12 Tasan et al. literature datasets. More importantly, these estimates suggest that current systematic and literature datasets are only about an order of magnitude away from the full interactome at every detectedness level. Systematic binary interactome networks and human disease Topological properties of nodes within interactome networks provide insights into disease10,43, though studies of disease based solely on literature maps might reach potentially biased conclusions because of inspection biases, as identified in the Lit-BM-13 map. The significantly higher connectivity in Lit-BM-13 of cancer-associated gene products from the Sanger Cancer Gene Census (“Cancer Census”)3 might be because they are more heavily studied. Here we confirm using an unbiased systematic map that cancer proteins do indeed have higher connectivity (P = 1 " 10-3, one-sided Wilcoxon test; Fig. 5a). To investigate the role interactome maps may play in identifying cancer genes, we tested the extent to which products of putative cancer genes identified in GWAS publications can be connected to known cancer-related proteins (Supplementary Table 13). In HI-II-11 proteins encoded by genes in cancer-associated loci interacted with known cancer proteins significantly more frequently than expected by chance, demonstrating how our map can be used to prioritize genes within large loci (Fig. 5b). We found similar results when using a range of linkage disequilibrium thresholds to define loci (Supplementary Fig. 10), and when using the Lit-BM-13 or PrePPI-HC maps (Supplementary Fig. 11). To further assess how well interactions detected without influence of inspection bias may serve to prioritize cancer-associated genes, we investigated alternative network-free genomic and functional genomic approaches for identifying genes likely to be involved in tumorigenesis. We looked at three candidate gene sets: (i) ~200 genes with heightened nonsynonymous and deleterious somatic mutation rates in cancers ("Mut"), (ii) ~350 human 13 Tasan et al. orthologs of mouse genes associated with tumorigenesis in Sleeping Beauty (“SB”) transposon-based mutagenesis screens, and (iii) ~400 human proteins found to interact with proteins from four DNA tumour virus families ("Vir")44. For the proteins present in HI-II-11, the overlap between Mut, SB, and Vir sets and a reference set of 128 Cancer Census genes enables us to evaluate the predictive ability of each approach (Fig. 5c). The frequency of interaction in HI-II-11 between a protein and members of the SB and Vir sets significantly increased predictive ability (SB: P = 1 " 10-31, Vir: P = 1 " 10-43, one-sided paired Wilcoxon rank-sum test; Fig. 5c), arguing for network-guided analyses of systematic cancer gene discovery efforts. We directly measured for each protein its corrected frequency of interaction with Cancer Census proteins, directly testing the “guilt-by-association” approach for transfer of function and annotation through PPIs. This method provided vastly improved predictions (SB: P = 1 " 10-115, one-sided paired Wilcoxon rank-sum test; Fig. 5c), and validates the existence of modularity in interactome networks: that proteins sharing similar phenotypes or function tend to directly interact with each other45. Instead of relying solely on Cancer Census annotation, we sought to enhance cancer-gene prioritization by integrating information both about (i) membership of a query protein in candidate lists (“guilt-by-profiling”) and (ii) that of its interacting partners (“guilt-by-association”). We constructed a forward-stepwise logistic regression model to combine these fundamentally node- and edge-centric data types, and found a marked improvement in ability to highly rank Cancer Census proteins (P = 1 " 10-164, one-sided paired Wilcoxon rank-sum test; Fig. 5c and Supplementary Table 14). Our top-ranked prediction was the cyclin-dependent kinase 4 (CDK4), a gene already in the Cancer Census list. CDK4 is connected in HI-II-11 to its inhibitors CDKN2C and CDKN2D, and to its associated G1/S-specific cyclins CCND1 and CCND3 (Fig. 5d). CDKN2C, CCND1, and CCND3 all have been causally implicated in tumorigenesis, while 14 Tasan et al. CDK4 is frequently mutated in cancerous tissue and serves as a viral target of the E7 protein of human papillomavirus and of the Large T protein of polyomavirus44. The second ranked and top novel prediction is the coat protein II-coated vesicle (COPII) component SEC24A, which associates with the COPII complex to mediate protein transport from the endoplasmic reticulum (ER) (Fig. 5d). Two neighbours of SEC24A in HI-II11 are the Cancer Census members Ewing Sarcoma Breakpoint Region 1 (EWSR1) and TRK-Fused Gene (TFG). Both contain a similar regulatory N-terminal region that is frequently present in oncogenic fusion proteins found in several types of sarcoma. Inhibition of TFG prevents protein secretion from the ER due to ER structural failure46 and depletion of the C. elegans ortholog TFG-1 leads to a decline in COPII levels at ER exit sites47, together arguing that TFG shares a function with SEC24A. MAPK1IP1L, a poorly studied protein found in literature-defined sparse zones of the three other interactome maps, interacts not only with the sarcoma-associated EWSR1 and TFG, but also with SS18L1, another sarcomaassociated Cancer Census gene. Thus, HI-II-11 offers compelling disease-related hypotheses in sparsely populated territories of the interactome. 15 Tasan et al. Discussion While PPI networks can shed light on complex genotype-phenotype relationships, help to identify relevant disease genes and unravel pathobiology mechanisms, the quality of current maps and their ability to thoroughly cover the human interactome landscape has to be carefully considered. Systematically screening half of the interactome space, independently from any inspection bias, we tripled the number of high-quality binary PPIs available from the literature. Covering regions of the human interactome landscape that are hardly touched by other approaches, the systematic binary map provides functional context to thousands of proteins, as demonstrated for candidates identified in unbiased cancer genetic screens. Systematic binary mapping therefore stands as a powerful tool providing necessary scaffold information to “connect the dots” of the genomic revolution. How far away is a complete binary interactome map? Combining high-quality binary pairs from the literature and systematic binary maps, 30,000 high-confidence interactions are now available, only one order of magnitude away from what is to be found. A more pertinent question might be whether all interactions are equally functional? In both systematic and literature maps, functional enrichment increases with detectedness, suggesting that detectedness can provide a means to better quantify how functional are individual interactions, to identify functional modules45, and to understand heterogeneity in interactome networks. However, detectedness may in part reflect variation in dissociation constants and interactions with lower dissociation constants are more readily detected and hence better characterized. Development of new, cost-effective systematic experimental approaches will help to complete the mapping of the human interactome and also further assess the relationships between detectedness, affinity and biological functionality. Reference binary interactome maps of increased coverage and quality are prerequisite to interpret condition-specific interactions48 and characterize the effects of splicing 16 Tasan et al. and genetic variation on interactions49. Just as a reference genome enabled detailed maps of human genetic variation6, completion of a reference interactome map will enable deeper insight into genotype-phenotype relationships in human. 17 Tasan et al. METHODS SUMMARY The systematic binary map was generated by screening all pairwise combinations of 15,517 ORFs from the hORFeome v5.1 collection. Interactome mapping was performed as described previously26 but with the following modifications: (i) the space was screened twice by Y2H to identify primary candidate interacting protein pairs, (ii) candidate interacting protein pairs were identified either by Sanger or by next generation sequencing30, (iii) candidate interacting protein pairs were verified in four independent retests and only those interactions that scored positive at least three times were retained, (iv) to confirm the quality of the full dataset, a subset of verified interactions were validated in three orthogonal assays MAPPIT, PCA and wNAPPA relative to a high-quality literature derived reference set and a random reference set of 460 and 698 pairs, respectively. The Lit-2010 and Lit-2013 maps were generated by extracting all human PPIs annotated with tractable publication records from seven databases15-21 through December 2010 and August 2013, respectively. PSI-MI experimental method codes were used to segregate the binary PPIs and the number of pieces of evidence were counted. All statistical tests were performed using implementations of the R software50. Empirical P values were obtained using randomizations detailed in the respective methods. Full details are provided in Supplementary Methods. 18 Tasan et al. FIGURE LEGENDS Figure 1 | An uncharted interactome territory in small-scale interaction maps. a, Understanding genotype-phenotype relationships through interactome maps of increasing completeness and quality. b, Flow chart (left) showing the assembly and segregation of protein-protein interactions from the literature. Different pieces of evidence correspond to different papers or methods. Histogram (right) of number of interactions for the full literature and different binary subsets. c, Fraction of Lit-BS-10 and Lit-BM-10 pairs recovered by MAPPIT at increasing RRS recovery rates (upper left). The shading indicates standard error of the proportion. Fraction of Lit-BS-10, Lit-BM-10 or RRS pairs recovered by MAPPIT at 1% RRS recovery rate (lower left), not found to co-occur in the literature as reported in the STRING database (upper right), and recovered by Y2H (lower right). P values are based on two-sided Wilcoxon rank sum tests. d, Fraction of full proteome and putative diseaseassociated gene products with protein-protein interactions in Lit-2010, Lit-BM-10, and systematic binary screens. Error bars indicate standard error of the proportion. Data sources are cited in Supplementary Information. e, Adjacency matrix showing Lit-BM-10 interactions, with proteins ordered symmetrically along vertical and horizontal axes by number of publications (upper histograms), and grouped in bins of ~350 proteins with median number of publications per bin shown. The colour intensity of each square reflects the total number of interactions between proteins in each bin. Total number of interactions per bin (lower histogram). Number of gene products from GWAS loci, Mendelian disease genes, and Sanger Cancer Gene Census genes per bin (circles). Figure 2 | A second-generation binary interactome map. a, Improved empirical parameters from first-generation to second-generation interactome mapping. b, Experimental 19 Tasan et al. pipeline for identifying high-quality binary protein-protein interactions. ORF: Open-reading frame, IST: Interaction Sequence Tag. c, Fraction of HI-II-11, Lit-BM-10, and RRS pairs recovered by MAPPIT (left), PCA (centre), and wNAPPA (right) at increasing assay scores. Shading indicates standard error of the proportion. d, Fraction of protein pairs in the PDB that HI-II-11 detected or missed as a function of number of residue-residue contacts. Vertical dashed lines indicate distribution medians. P value is based on a one-sided Wilcoxon rank sum test. Schematic of residue-residue contact count (bottom). e, Schematic of domain pair enrichment scoring method (top). Normalized frequency of all (left) and the top 25 (right) domain pairs. 3did: database of three-dimensional interacting domains. f, Network showing all possible domain combinations between HI-II-11 protein pairs involving the top 25 enriched domain pairs (left). Example of enriched domain pair corresponding to a known direct interaction (right). g, Total number of binary interactions in binary literature and systematic binary interactome maps over the past 20 years. h, Adjacency matrices showing binary literature and systematic binary maps at selected time points, with proteins ordered by number of publications. Colour intensity reflects the total number of interactions between proteins in each bin of ~350 proteins. Figure 3 | Apparent sparse zones contain genuine biophysical interactions. a, Network representations of HI-II-11, Lit-BM-13, Co-Frac, and PrePPI-HC interactome maps. b, Topological properties of the four interactome maps. Venn diagrams illustrating the number of shared proteins (upper) and interactions (lower) in the four maps. c, Adjacency matrices for HI-II-11, Lit-BM-13, Co-Frac, and PrePPI-HC maps, with proteins ordered by number of publications, mRNA abundance in HEK cells, fraction of protein sequence covered by Pfam domains, and fraction of protein sequence covered by transmembrane helices. Median values of the protein properties for each bin of ~350 proteins are shown (top histograms). 20 Tasan et al. Colour intensity reflects the total number of interactions between proteins in each bin. Lower histograms show total number of interactions per bin. d, Interaction density imbalances for 21 protein properties in the four maps. e, MAPPIT recovery rate of HI-II-11 pairs found in dense and sparse zones of Lit-BM-13, Co-Frac, and PrePPI-HC at 1% RRS recovery. Error bars indicate standard error of the proportion. Figure 4 | Functional heterogeneity in interactome maps. a, Functional enrichment of PPIs in the four interactome maps. Error bars indicate 95% confidence intervals. BP: Biological process, MF: Molecular function, CC: Cellular component. b, Functional enrichment of HI-II-11 interactions found in dense and sparse zones of Lit-BM-13, Co-Frac, and PrePPI-HC. Error bars indicate 95% confidence intervals. c, MAPPIT recovery rate of subsets of systematic binary and binary literature interactions of increasing detectedness at 1% RRS rate (left histograms; literature: Lit-BM-13 with 2, 3, and 4 or more pieces of evidence; systematic: HI-II-11 found in one screen, two screens, and HI-I-05 found in two assays). Error bars indicate standard error of the proportion. Functional enrichment of systematic binary and binary literature interactions of increasing detectedness (right histograms). Error bars indicate 95% confidence intervals. P values are based on two-sided Fisher’s exact tests (* means P ! 0.05). d, Model for how interaction detectedness may correlate with functionality. e, Fraction of shared interactions between the binary literature and systematic binary maps of increasing detectedness, controlling or not for their effective search spaces, i.e. space I and II for the systematic maps and the dense zones for the literature maps. f, Binary human interactome size estimated using literature reference sets of increasing detectedness (top left), or using reference sets and systematic binary maps of matched detectedness and controlling for their effective search spaces (top right). Total 21 Tasan et al. number of high-quality interactions at increasing detectedness in the union of binary literature and systematic binary maps (bottom). Figure 5 | Systematic binary interactome networks and human disease. a, Average degree of Sanger Cancer Gene Census (Cancer Census) proteins and proteins not associated to any disease in all four maps. P values are based on Wilcoxon rank sum test. b, Schematic representing interactions between proteins encoded by genes in cancerassociated GWAS loci and known Cancer Census proteins (top left). Fraction of cancerrelated GWAS loci containing at least one gene encoding a protein that interacts with a Cancer Census protein in 1,000 randomly selected loci genes (histogram) and the observed value (arrow) for HI-II-11 (top right). Empirical P value is shown. Network representing proteins encoded by genes in cancer-associated GWAS loci interacting with Cancer Census proteins in HI-II-11, and an example network after randomizing genes (bottom). c, Schematic of the cancer association scoring system (top). Predictive ability of separate and combined models for cancer association scoring (box plot). P values are based on one-sided paired Wilcoxon rank sum test. d, The 20 proteins with the highest cancer-association scores according to the combined model. For four cancer candidates (bold) their characteristics and their interacting partners in HI-II-11 are indicated. * Cancer Census proteins. 22 Tasan et al. REFERENCES 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514-517 (2005). Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362-9367 (2009). Futreal, P. A. et al. A census of human cancer genes. Nature Rev. Cancer 4, 177-183 (2004). Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945-950 (2011). Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546-1558 (2013). 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56-65 (2012). Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Rev. Genet. 12, 745-755 (2011). Chin, L., Hahn, W. C., Getz, G. & Meyerson, M. Making sense of cancer genomic data. Genes Dev. 25, 534-555 (2011). Vidal, M., Cusick, M. E. & Barabási, A.-L. Interactome networks and human disease. Cell 144, 986-998 (2011). Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based approach to human disease. Nature Rev. Genet. 12, 56-68 (2011). Venkatesan, K. et al. An empirical framework for binary interactome mapping. Nature Methods 6, 83-90 (2009). International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931-945 (2004). Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556-560 (2012). Dreze, M. et al. 'Edgetic' perturbation of a C. elegans BCL2 ortholog. Nature Methods 6, 843-849 (2009). Bader, G. D., Betel, D. & Hogue, C. W. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 31, 248-250 (2003). Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816-823 (2013). Salwinski, L. et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 32, D449-451 (2004). Prasad, T. S. K. et al. Human Protein Reference Database-2009 update. Nucleic Acids Res. 37, D767-772 (2009). Kerrien, S. et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 40, D841-846 (2012). Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857-861 (2012). Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235-242 (2000). Rual, J.-F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173-1178 (2005). Stelzl, U. et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell 122, 957-968 (2005). Cusick, M. E. et al. Literature-curated protein interaction datasets. Nature Methods 6, 39-46 (2009). Eyckerman, S. et al. Design and application of a cytokine-receptor-based interaction trap. Nature Cell Biol. 3, 1114-1119 (2001). Dreze, M. et al. High-quality binary interactome mapping. Methods Enzymol. 470, 281-315 (2010). von Mering, C. et al. STRING: a database of predicted functional associations between proteins. 23 Tasan et al. 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Nucleic Acids Res. 31, 258-261 (2003). Stumpf, M. P. et al. Estimating the size of the human interactome. Proc. Natl Acad. Sci. USA 105, 6959-6964 (2008). Yang, X. et al. A public genome-scale lentiviral expression library of human ORFs. Nature Methods 8, 659-661 (2011). Yu, H. et al. Next-generation sequencing to generate interactome datasets. Nature Methods 8, 478-480 (2011). Braun, P. et al. An experimentally derived confidence score for binary protein-protein interactions. Nature Methods 6, 91-97 (2009). Ramachandran, N. et al. Next-generation high-density self-assembling functional protein arrays. Nature Methods 5, 535-538 (2008). Nyfeler, B., Michnick, S. W. & Hauri, H. P. Capturing protein interactions in the secretory pathway of living cells. Proc. Natl Acad. Sci. USA 102, 6350-6355 (2005). Wass, M. N., David, A. & Sternberg, M. J. Challenges for the prediction of macromolecular interactions. Curr. Opin. Struct. Biol. 21, 382-390 (2011). Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290-301 (2012). Stein, A., Ceol, A. & Aloy, P. 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Res. 39, D718-723 (2011). Havugimana, P. C. et al. A census of human soluble protein complexes. Cell 150, 1068-1081 (2012). Stagljar, I. & Fields, S. Analysis of membrane protein interactions using yeast-based technologies. Trends Biochem. Sci. 27, 559-563 (2002). Bjorklund, A. K., Light, S., Hedin, L. & Elofsson, A. Quantitative assessment of the structural bias in protein-protein interaction assays. Proteomics 8, 4657-4667 (2008). Bader, G. D. & Hogue, C. W. Analyzing yeast protein-protein interaction data obtained from different sources. Nature Biotechnol. 20, 991-997 (2002). von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399-403 (2002). Futschik, M. E., Chaurasia, G. & Herzel, H. Comparison of human protein-protein interaction maps. Bioinformatics 23, 605-611 (2007). Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685-8690 (2007). Rozenblatt-Rosen, O. et al. Interpreting cancer genomes using systematic host network perturbations by tumour virus proteins. Nature 487, 491-495 (2012). Mitra, K., Carvunis, A.-R., Ramesh, S. K. & Ideker, T. Integrative approaches for finding modular structure in biological networks. Nature Rev. Genet. 14, 719-732 (2013). Beetz, C. et al. Inhibition of TFG function causes hereditary axon degeneration by impairing endoplasmic reticulum structure. Proc. Natl Acad. Sci. USA 110, 5091-5096 (2013). Witte, K. et al. TFG-1 function in protein secretion and oncogenesis. Nature Cell Biol. 13, 550558 (2011). Zhang, J. & Lautar, S. A yeast three-hybrid method to clone ternary protein complex components. Anal. Biochem. 242, 68-72 (1996). Zhong, Q. et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 5, 321 (2009). R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, 2013). Supplementary information is available in the online version of the paper. 24 Tasan et al. Acknowledgements: The authors wish to acknowledge members of the Center for Cancer Systems Biology (CCSB) and particularly H. Yu for helpful discussions. This work was supported by the following grants and agencies: NHGRI grants R01HG001715 to M.V., D.E.H., F.P.R. and J.T. and P50HG004233 to M.V., F.P.R. and A.-L.B.; NHLBI grants U01HL098166 subaward to M.V. and U01HL108630 subaward to A.-L.B.; NCI grant U54CA112962 subaward to M.V.; NICHD ARRA R01HD065288 to L.M.I.; NIMH R01MH091350 to L.M.I. and T.H.; NSF Grant CCF-1219007 to Y.X.; Krembil Foundation, Canada Excellence Research Chair, Ontario Research Fund–Research Excellence Award, Canadian Institute for Advanced Research Fellowship to F.P.R.; grant CSI07A09 from Junta de Castilla y Leon (Valladolid, Spain), grant PI12/00624 from Ministerio de Economia y Competitividad (AES 2012, ISCiii, Madrid, Spain,) and grant i-Link0398 from Consejo Superior de Investigaciones Científicas (CSIC, Madrid, Spain) to J.D.L.R.; Spanish Ministerio de Ciencia e Innovación (BIO2010-22073) and the European Commission through the FP7 project SyStemAge grant agreement n:306240 to P.A.; Group-ID Multidisciplinary Research Partnerships of Ghent University and grant FWO-V G.0864.10 from the Fund for Scientific Research Flanders to J.T.; I.L. is a postdoctoral fellow with the FWO-V; Institute Sponsored Research funds from the Dana-Farber Cancer Institute Strategic Initiative to M.V. M.V. is a "Chercheur Qualifié Honoraire" from the Fonds de la Recherche Scientifique (FRS-FNRS, Wallonia-Brussels Federation, Belgium). Author Contributions I.L., N.S., S.Y., Q.Z., X.Y., L.G., D.B., P.B., M.P.B., D.C.-Z., R.C., E.D., M.D., A.D., F.G., B.J.G., R.K., A.M., R.R.M., M.M.P., X.R., J.R., P.R., V.R., J.S., An.S., Ak.S., S.T., S.A.T., J.-C.T., K.V. and J.W. performed experiments or contributed new reagents. M.T., T.R., S.J.P., B.C., Ce.F., R.M., A.K., S.G., A.-R.C., Ch.F., E.F., M.J., S.K., G.N.L., J.M., Am.S., Y.S. and Y.X. performed computational analyses. M.T., T.R., S.J.P., B.C., M.E.C., J.D.L.R., M.A.C., D.E.H., T.H., F.P.R. and M.V. wrote the manuscript. A.-L.B., L.I., P.A., J.D.L.R., J.T., M.A.C., D.E.H., T.H., F.P.R. and M.V. designed or advised research. Author Information The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to M.A.C. ([email protected]). D.E.H. ([email protected]), T.H. ([email protected]), F.P.R. ([email protected]) and M.V. ([email protected]). 25 Tasan et al. 1a Mendelian traits Complex traits Genotype-phenotype relationships Underlying interactome View from available maps Completeness and quality of maps Gene or gene product Biophysical interaction Functional interaction relevant for a phenotype of interest Gene or gene product with no characterized interactions Uncharacterized interaction 26 Tasan et al. Lit-BS-10 27 0 em at -1 st M Sy 01 t-2 Li Lit-BM-10 ic 0 0 Pieces of evidence 1 2 3 4+ 9,056 2,646 1,013 1,247 10 10 Binary 5,871 t-B Non-binary 0 20 S- Binary 13,962 30 t-B Non-binary 42,781 Systematic 5,871 40 Li Non-systematic 56,743 50 Li 7 public databases (Lit-2010) (PPIs with valid evidence) 62,163 Number of interactions (thousands) 1b Tasan et al. 1c STRING literature mining Fraction of pairs not found to co-occur (%) MAPPIT assay Fraction of pairs recovered (%) 30 25 20 15 10 5 0 0 1 2 3 4 100 P < 2 × 10-16 80 60 40 20 0 5 Fraction of RRS pairs recovered (%) Y2H assay P = 1 × 10-17 20 Fraction of pairs recovered (%) Fraction of pairs recovered (%) 20 P = 5 × 10-4 15 10 P = 0.03 5 P = 0.09 15 10 5 0 0 RRS pairs Binary literature pairs: Lit-BS-10 Lit-BM-10 28 P = 5 × 10-15 P = 8 × 10-3 Tasan et al. 1d Fraction of gene products with reported interactions (%) 100 80 60 40 20 0 Whole proteome (n = 19,111) Well-accepted cancer driver (n = 124) Sanger Cancer Gene Census (n = 465) Mendelian disorder gene (n = 2,826) Systematic cancer screen gene (n = 2,996) GWAS loci gene for any trait (n = 2,883) 100 80 60 40 20 0 Lit-2010 Lit-BM-10 Systematic binary screens 29 Tasan et al. Number of publications 1e 1,000 100 10 1 Lit-BM-10 Total number of interactions Number of interactions 0 10 1,000 500 0 Sanger Cancer Gene Census products GWAS loci gene products Mendelian disorder gene products 30 Protein count 128 32 8 Tasan et al. 2a Parameters Methods 7K Completeness Relative gain 19K 13K 19K 3.1X space 7K 19K Interactions Sampling sensitivity 19K Interactions 13K 1 Assay sensitivity Precision PRS RRS 1.6X sensitivity 1 2 PRS RRS 1.3X sensitivity 92 PRS 92 RRS 1 assay ~500 PRS ~700 RRS 3 assays 2005 2011 31 > 5X reference sets Tasan et al. 2b Screen (2X) Verification Validation ~13,000 proteins hORFeome 5.1 ~66,000 ORF pairs 4x pairwise retested ~800 protein pairs selected at random ~82,000,000 protein pairs tested Sequence verification of 3x and 4x positives Tested in 3 binary orthogonal assays ~35,000 IST pairs identified ~15,000 gene pairs verified Recovery compared to Lit/RRS pairs HI-II-11 ~14,000 interactions between ~4,300 proteins 32 Tasan et al. 2c Fraction of pairs recovered (%) MAPPIT 30 PCA wNAPPA 30 30 25 25 20 20 15 15 15 10 10 10 5 5 5 0 0 0 HI-II-11 Lit-BM-10 25 RRS 20 1 2 3 4 5 1 2 3 4 5 Score threshold (arbitrary units) 33 1 2 3 4 5 Tasan et al. 2d Fraction of protein pairs (%) P = 1.3 × 10-10 Detected (189) Missed (498) 8 6 4 2 0 0 50 100 150 200 1XPEHURIUHVLGXHïUHVLGXHFRQWDFWV 34 Tasan et al. 2e Domain pairs in HI-II-11 interacting proteins 1,000 randomized HI-II-11 networks ? Normalized frequency 60% precision 'RPDLQSDLUVQ Normalized frequency 0 60 80 Three-dimensional structures of domain-based interactions Normalized frequency 0 60 80 Synaptobrevin :: SNARE SCAN :: SCAN PX :: Reticulon Hormone_recep :: Hormone_recep Syntaxin :: Synaptobrevin Arm :: Ran_BP1 6NSB32=)ïEo[ïOLke 6NS)ïEo[ïOLke SNARE :: SNARE HorPRQHBUHFHS]Iï& Syntaxin :: Syntaxin CARD :: BIR ]Iï&]Iï& IQ :: EF_hand_5 6&$10\EB'1$ïELQGB Ras :: PBD 3'=/ 84BFRQ]Iï5,1*B LIM :: LIM_bind bZIP_1 :: bZIP_1 Syntaxin :: SNARE GAGE :: BACK CH :: CH Filament :: Filament UQ_con :: IBR 35 Domain pairs: ,QGLG 1RWLQGLG Tasan et al. 2f Protein Possible domain pairs for HI-II-11 interactions: Predicted and in 3did (TP = 182) Predicted and not in 3did (FP = 81) Not predicted and in 3did (FN = 91) Not predicted and not in 3did (TN = 318) IQCE KIAA1683 IQGAP3 CALM1 CALM2 Protein domains EF_hand_5 CH IQ RasGAP_C 36 CAMTA2 CALML3 RasGAP CG_1 TIG Tasan et al. Number of binary interactions (thousands) 2g 20 Binary literature (multiple pieces of evidence) Systematic binary screens 15 10 5 0 Time 1994 2000 2005 37 2010 2013 Tasan et al. 2h Space ordered by number of publications 1994 1998 2002 2006 Number of interactions 0 10 Systematic binary maps Binary literature maps 38 2010 2013 Time Tasan et al. 3a HI-II-11 Lit-BM-13 Co-Frac PrePPI-HC Protein Protein-protein interaction 39 Tasan et al. 3b HI-II-11 Lit-BM-13 Co-Frac PrePPI-HC Nodes 4,303 5,545 2,999 4,989 Edges 13,944 11,045 13,982 25,403 Nodes in largest component 4,100 4,919 2,820 3,681 Mean degree 6.348 3.772 9.324 10.183 Mean clustering 0.053 0.065 0.164 0.399 Mean distance 4.063 5.305 4.506 5.251 Power law exponent 2.195 1.659 1.462 1.451 Proteins (heterodimers) Lit-BM-13 Co-Frac HI-II-11 1,491 893 499 701 1,766 276 269 253 416 299 94 PrePPI-HC 1,870 489 1,214 336 Protein-protein interactions (heterodimers) Lit-BM-13 Co-Frac HI-II-11 9,128 13,347 195 255 13,045 14 27 287 92 9 10 27 40 40 PrePPI-HC 24,410 435 Tasan et al. 3c Number of publications Number of interactions 0 10 HI-II-11 Lit-BM-13 Co-Frac PrePPI-HC mRNA abundance in HEK cells 1,000 1,000 100 100 10 10 1 1 HI-II-11 Lit-BM-13 Co-Frac Total number of interactions PrePPI-HC 500 0 2,000 1,000 0 3,000 2,500 2,000 1,500 1,000 500 0 2,500 2,000 1,500 1,000 500 0 41 Fraction of sequence in transmembrane helices Fraction of sequence in Pfam domains 100 50 50 25 0 0 Tasan et al. HII Lit I-11 -B Co M-1 -F 3 Pre rac PP I-H C 3d mRNA abundance in HEK cells mRNA abundance in HeLa cells Protein abundance in HEK cells Protein abundance in HeLa cells Number of publications First publication date Number of protein complexes Number of pathways Number of GO associations Fraction of sequence in Pfam domains Fraction of basic amino-acids Fraction of charged amino-acids Fraction of acidic amino-acids Fraction of polar amino-acids Fraction of polar and uncharged amino-acids Fraction of hydrophobic amino-acids Fraction of sequence in transmembrane helices Number of binding sites Fraction of sequence in disordered regions Number of linear motifs ORF length Enrichment Balanced representation Depletion 42 Abundance Knowledge Sequence Tasan et al. 3e Number of publications (mirrored from Lit-BM-13) Dense zone Sparse zone Fraction of pairs recovered in MAPPIT (%) Mirrored zones in HI-II-11 20 15 10 5 0 43 mRNA abundance in HEK cells (mirrored from Co-Frac) Fraction of sequence in Pfam domains (mirrored from PrePPI-HC) Tasan et al. 4a 1,000 HI-II-11 100 10 1 Lit-BM-13 1,000 10 1 1,000 Co-Frac 100 10 1 1,000 PrePPI-HC 100 10 M en e ou On se tol ph ogy en ot yp e U s ni on BP M F C C 1 G Odds ratio 100 44 Tasan et al. 4b Mirrored zones in HI-II-11 GO MF GO CC 1,000 100 10 1 1,000 100 10 1 Union Odds ratio 1,000 100 10 1 Mouse phenotypes GO BP Dense zone Sparse zone Number of publications (mirrored from Lit-BM-13) 1,000 100 10 1 1,000 100 10 1 45 mRNA abundance Fraction of sequence in HEK cells in Pfam domains (mirrored from Co-Frac) (mirrored from PrePPI-HC) Tasan et al. * * 10 RRS 20 Odds ratio 30 0 100 20 10 * * 10 * * 1 MF CC Mouse phenotypes Union 1,000 100 * * 10 1 0 * * * BP Odds ratio * * * Gene Ontology 100 30 * 1,000 Literature map detectedness RRS MAPPIT recovery (%) MAPPIT recovery (%) 4c Systematic map detectedness 46 Tasan et al. 4d “True” biophysical interaction network Pseudointeractions Functional interactions Detectedness Fraction of interactions 47 Tasan et al. 4e Fraction of systematic map in literature map (%) Fraction of literature map in systematic map (%) Detectedness Lit-BM-13 map (pieces of evidence) Systematic map 2 3 HI-II-11 in 1 screen HI-II-11 in 2 screens HI-I-05 in 2 assays 10 10 10 8 8 6 6 4 4 4 2 2 2 0 0 0 10 10 40 8 8 30 8 6 P = 2 × 10-3 6 4 P = 9 × 10-5 6 P = 1 × 10-6 4 P = 1 × 10-3 20 2 2 10 0 0 0 Effective search space: 48 P = 5 × 10-6 Not controlled Controlled Tasan et al. 4f Detectedness Binary interactome size (thousands) 800 600 400 200 0 Lit-BM-13 map (pieces of evidence) Systematic map Controlled for systematic map 2 3 Controlled for both maps 4 HI-II-11 2 3 4 2 2 HI-II-11 HI-I-05 (screens) (assays) Mapped binary interactions (thousands) Effective search space 49 50 40 30 20 10 0 Tasan et al. 5a Knowledge based P = 1 × 10-7 Average degree 20 15 10 Systematic P = 0.01 P = 0.2 P = 1 × 10-31 5 0 ac 3 -1 t- BM Li Co 1 -1 C r -F e Pr P -H PI I I-I H Cancer Census proteins Proteins not in OMIM or Cancer Census 50 Tasan et al. Sanger Cancer Gene Census Cancer-associated GWAS loci and 400 Frequency in random networks 5b 300 P = 0.02 200 100 0 0.0 0.2 0.4 0.6 Fraction of loci with cancer candidates Gene product in HI-II-11 Interaction in HI-II-11 MIR4646 MSH5-SAPCD1 LY6G6F CDSN PSORS1C2 PSORS1C1 CCHCR1 SRSF2 CSNK2B CLIC1 HSPA1B HSPA1A APOM C6orf47 HSPA1L C6orf48 SNORA38 VARS LY6G5B LY6G6D BAG6 ABHD16A LY6G6E PRRC2A LY6G6C LY6G5C VWA7 C6orf25 GPANK1 LSM2 C15orf55 DDX6 DDAH2 SNORD48 SNORD52 TPM3 ZBTB38 REL LMO1 XPA FLI1 TRIM27 PDE4DIP LASP1 LOC401242 TNFRSF6B ZGPAT CTBP2 STAMBPL1 CHRNA5 CHRNA3 ARHGEF5 ARFRP1 RTEL1-TNFRSF6B LIME1 RTEL1 ACTA2 FAS IKZF1 CDK6 C9orf53 DOCK5 KCTD9 FARP2 CDCA2 GNRH1 HDLBP PPP2R2A EBF2 ANKLE1 CDKN2B-AS1 CDKN2A MTAP MIR4296 BABAM1 SEPT5 SEPT2 PSMA4 AGPHD1 SLBP CEP57L1 TMEM129 SESN1 ARMC2 ASPSCR1 TACC3 CDKN2B SEPT6 Randomized genes Gene in GWAS locus whose product: Is present in HI-II-11 Is absent from HI-II-11 GWAS locus Protein in Sanger Cancer Gene Census HI-II-11 interaction 51 Tasan et al. 5c HI-II-11 neighbor data “guilt-by-association” SB Mu Vir t SB Mu Vir t Ce n su s Query node data only “guilt-by-profiling” Protein A Protein P Protein B Protein C 0 0 1 2 In the set Not in the set correct for frequencies accross random networks Cancer association score SB = Sleeping Beauty transposon-based mouse cancer screen Mut = Somatic mutations screen in cancer tissues Vir = Tumor virus host targets Census = Sanger Cancer Gene Census “Guilt-byprofiling” Mut Vir P = 1 × 10-43 SB P = 1 × 10-31 Mut “Guilt-byassociation” Vir SB Census P = 1 × 10-115 P = 1 × 10-164 Combined model 5 10 15 20 Average precision of predicted cancer proteins (%) 52 Tasan et al. SB Mu Vir t Ce ns us * 5d Top 20 ranked proteins SB Mu Vir t CCND1* CCND3* CDKN2C* CDK4* CDK4* 43 SEC24A 35 CDKN2D STAT3 33 HOOK1 GRB2 32 CDKN2C* 31 1 0 1 3 MAPK1IP1L 27 EWSR1* SEC24A TFG* 1 0 1 2 HNRNPM 23 ATPAF2 22 ARHGEF16 DDX6* 21 C1orf94 NAGK 20 CEP55 NFE2L2* 20 EWSR1* CDKAL1 20 HGS MAPK1IP1L SH2D4A 25 AP2B1 19 HNRNPUL1 ARMC8 17 RBFOX2 PSME4 17 SS18L1* KRAS* 16 SUMO1 SUMO1P1 TFG* UBE2I WAC 16 TAX1BP1 15 NEUROG3 15 VPS37C 4 1 4 3 TCF12* NEUROG3 REL* 2 0 1 1 53 In the set Not in the set