Sriganesh Srihari

Computational Prediction of Protein Complexes from Protein Interaction Networks


Скачать книгу

E. coli 99 104 H. sapiens 230,843 20,006 M. musculus 18,465 8,611 R. norvegicus 4,537 3,328 S. cerevisiae 82,327 6,278 S. pombe 9,492 2,944 X. laevis 532 471

      Based on BioGrid version 3.4.130 (November 2015) [Stark et al. 2011, Chatr-Aryamontri et al. 2015].

      Despite differences in procedures and technologies, the use of different experimental protocols can effectively complement one another in detecting interactions. While TAP can be more specific and detect mainly stable (co-complexed) protein interactions, Y2H can be more exhaustive and detect even transient and between-complex interactions. Based on BioGrid version 3.4.130 (November 2015) (http://thebiogrid.org/) [Stark et al. 2011, Chatr-Aryamontri et al. 2015], the numbers of mapped physical interactions range from 99 in E. coli to ~82,300 in S. cerevisiae and ~230,900 in H. sapiens (summarized in Table 1.2). It remains to be seen how many of these interactions actually occur in the physiological contexts of living cells or cell types, how many are subject to genetic and physiological variations, and how many still remain to be mapped.

      The binary interactions inferred from the different experiments are assembled into a protein-protein interaction network, or simply, PPI network. The PPI network presents a global or “systems” view of the interactome, and provides a mathematical (topological) framework to analyze these interactions. Protein complexes are expected to be embedded as modular structures within the PPI network [Hartwell et al. 1999, Spirin and Mirny 2003]. Topologically, this modularity refers to densely connected subsets of proteins separated by less-dense regions in the network [Newman 2004, Newman 2010]. Biologically, this modularity represents division of labor among the complexes, and provides robustness against disruptions to the network from internal (e.g., mutations) and external (e.g., chemical attacks) agents. Computational methods developed to identify protein complexes therefore mine for modular subnetworks in the PPI network. While this strategy appears reasonable in general, limitations in PPI datasets, arising due to the shortcomings highlighted above in experimental protocols, severely restrict the feasibility of accurately predicting complexes from the network. Specifically, the limitations in existing PPI datasets that directly impact protein complex prediction include:

      1. presence of a large number of spurious (noisy) interactions;

      2. relative paucity of interactions between “complexed” proteins; and

      3. missing contextual—e.g., temporal and spatial—information about the interactions.

      These limitations translate to the following three main challenges currently faced by computational methods for protein complex prediction:

      1. difficulty in detecting sparse complexes;

      2. difficulty in detecting small (containing fewer than four proteins) and sub-complexes; and

      3. difficulty in deconvoluting overlapping complexes (i.e., complexes that share many proteins), especially when these complexes occur under different cellular contexts.

      While the interactome coverage can be improved by integrating multiple PPI datasets, the lack of agreement between the datasets from different experimental protocols [Von Mering et al. 2002, Bader et al. 2004], and the multifold increase in accompanying noise (spurious interactions), tend to cancel out the advantage gained from the increased coverage. Consequently, the confidence of each interaction has to be assessed (confidence scoring) and low-confidence interactions have to be first removed from the datasets (filtering) before performing any downstream analysis. To summarize, computational identification of protein complexes from interaction datasets follows these steps (Figure 1.1):

      1. integrating interactions from multiple experiments and stringently assessing the confidence (reliability) of these interactions;

      2. constructing a reliable PPI network using only the high-confidence interactions;

      Figure 1.1 Identification of protein complexes from protein interaction data. (a) A high-confidence PPI network is assembled from physical interactions between proteins after discarding low-confidence (potentially spurious) interactions. (b) Candidate protein complexes are predicted from this PPI network using network-clustering approaches. The quality of the predicted complexes is validated against bona fide complexes, whereas novel complexes are functionally assessed and assigned new roles where possible.

      3. identifying modular subnetworks from the PPI network to generate a candidate list of protein complexes; and

      4. evaluating these candidate complexes against bona fide complexes, and validating and assigning roles for novel complexes.

      As we shall see in the following chapters, several sophisticated approaches have been developed over the years to overcome some of the above-mentioned challenges.

      Computational methods have co-evolved with proteomics technologies, and over the last ten years a plethora of computational methods have been developed to predict complexes from PPI networks, which is the subject of this book. In general, computational methods complement experimental approaches in several ways. These methods have helped counter some of the limitations arising in proteomic studies, e.g., by eliminating spurious interactions via interaction scoring, and by enriching true interactions via prediction of missing interactions. The novel interactions and protein complexes predicted from these methods have been added back to proteomics databases, and these have helped to further enhance our resources and knowledge in the field.

      Several high-quality resources for protein complexes have been developed over the years covering both lower-order model and higher-order organisms (summarized in Table 1.3). In total, Aloy [Aloy et al. 2004], CYC2008 [Pu et al. 2009], and MIPS [Mewes et al. 2008] contain over 450 manually curated complexes from S. cerevisiae (budding yeast). CORUM [Reuepp et al. 2008, 2010] contains ∼3,000 mammalian complexes of which ∼1,970 are protein complexes identified from human cells. The European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI) maintain a database of manually curated protein complexes from 18 different species including C. elegans, H. sapiens, M. musculus, S. cerevisiae, and S. pombe [Meldal et al. 2015].

      Havugimana et al. [2012] present a dataset of 622 putative human soluble protein complexes (http://human.med.utoronto.ca/) identified using high-throughput AP/MS pulldown and PPI-clustering approaches. Huttlin et al. [2015] present 352 putative human complexes identified from human embryonic (HEK293T) cells (http://wren.hms.harvard.edu/bioplex/). Wan et al. [2015] present a catalog of conserved metazoan complexes (http://metazoa.med.utoronto.ca/) identified by clustering of high-quality