for prediction of interactions on a larger scale is not feasible. Zhang et al. proposed to overcome this limitation to some extent by deriving homology models for proteins without available 3D structures. Homology models were derived for an additional ∼3,600 yeast proteins using the ModBase (http://modbase.compbio.ucsf.edu/) [Pieper et al. 2006] and Skybase (http://skybase.c2b2.columbia.edu/pdb60_new/struct_show.php) [Mirkovic et al. 2007] databases. Given a query protein, these databases predict the most likely 3D structure for the protein based on its sequence similarity with templates built from proteins with available 3D structures. The final set of structurally predicted interactions from the Zhang et al. study is available in the PrePPI database (http://bhapp.c2b2.columbia.edu/PrePPI/). Struct2Net (http://groups.csail.mit.edu/cb/struct2net/webserver/) [Singh et al. 2010] uses a structure-threading approach to predict interactions between proteins. Given two protein sequences, Struct2Net “threads” the sequences to known 3D structures from PDB, and then based on the best-matching structures, estimates the interaction between the two proteins. PredictProtein (http://www.predictprotein.org/) [Yachdav et al. 2014] combines structure and GO-based methods to predict new interactions. Wang et al. [2012] curated a 3D-structure resolved dataset of 4,222 high-quality human PPIs enriched for human disease genes by examining relationships between 3,949 genes, 62,663 mutations, and 3,453 associated with human disorders.
Literature Mining. Interactions missed in PPI datasets but have direct or indirect reference in scientific publications can be identified by mining the literature. For example, these references may include abstracts or full-texts of publications maintained in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) by the National Center for Biotechnology Information (NCBI). Text-mining tools based on natural language processing (NLP) and other machine-learning techniques mine for co-occurrence of protein names in these literature sources, and proteins frequently referenced together can be predicted to interact. For example, the PubGene tool, which is a part of COREMINE (http://www.coremine.com/medical/), mines for information on genes and proteins including their co-occurrences in abstracts of publications, their sequence homology, and association with cell processes and diseases. This information can be used to predict interactions between frequently co-occurring proteins. Similarly, iHOP (http://www.ihop-net.org/UniPub/iHOP/) [Hoffman and Valencia 2004] presents a network of genes and proteins that co-occur in the scientific literature.
Limitations of Computationally Predicted Protein Interactions. Despite their promise in enhancing experimentally curated interaction datasets, computational methods have their own limitations. For example, methods that rely on (high-throughput) experimental datasets to predict new interactions—e.g., PPI topology-based methods—carry an inherent bias in their predictions toward proteins already accounted for or enriched in these datasets. Moreover, these predictions are also affected by biological and technical noise (spurious interactions) in experimental datasets. The methods that are based on genomic distances, fusion of genes, and co-transcription of (operonic) genes are applicable only to a selected subset of species (prokaryotes), since most eurkaryotic systems do not exhibit these properties so strongly. Knowledge-based methods—e.g., those based on the GO graph, 3D structures, and literature abstracts—are restricted by the amount and kind of available information, and are again applicable to only a subset of proteins. As a result of these limitations, accurate prediction of physical interactions between proteins is a challenging problem, and most methods in fact predict only functional associations instead of direct physical interactions between proteins. Depending on the application, these functional associations can turn out to be more useful or less useful.
However, despite these limitations, computational techniques are orthogonal to and can effectively complement experimental techniques. Computational predictions can be used to enhance the topologies of PPI networks—for example, by “densifying” sparse network regions or piecing together disconnected proteins or regions of the network—which in turn can improve downstream applications of PPI networks including protein complex prediction.
3
Computational Methods for Protein Complex Prediction from PPI Networks
All models are wrong, but some models are useful.
—George Box (1919–2013), British statistician (as quoted in Box [1979])
The process of identifying protein complexes from high-throughput interaction datasets involves the following steps [Spirin and Mirny 2003, Srihari and Leong 2012b, Srihari et al. 2015a] (Figure 1.1):
1. integrating high-throughput datasets from multiple sources and assessing the confidence of interactions;
2. constructing a reliable PPI network using only the high-confidence interactions;
3. identifying modular subnetworks from the network to generate a candidate list of complexes; and
4. evaluating the identified complexes against bona fide complexes, and validating and assigning roles to novel complexes.
In this chapter, we review some of the representative methods developed to date for computational prediction of protein complexes from PPI networks.
3.1 Basic Definitions and Terminologies
A protein-protein interaction (PPI) network is modeled as an undirected graph G = 〈V, E〉, where V is the set of proteins, and E ⊆ V × V is the set of interactions between the proteins. We also use V(G) and E(G) to refer to the set of proteins and interactions of a (sub-)network G, respectively. For a protein v ∈ V, N(v) or Nv is the set of immediate neighbors of v, deg(v) = |N(v)| = |Nv| is the degree of v, and E(v) is the set of interactions in the immediate neighborhood of v. The interaction density of G (or its subnetwork) is defined as:
which gives a real number in [0, 1] quantifying the “richness of interactions” within G: 0 for a network without any interactions and 1 for a fully connected network.
The clustering coefficient CC(v) measures the “cliquishness” for the neighborhood of v, and is given by:
If the interactions of the network are scored (weighted), i.e., G = 〈V, E, w〉, then the weighted versions—weighted degree, weighted interaction density, and weighted clustering coefficients—are given as follows:
There are other variants for these definitions proposed in the literature; see Kalna and Higham [2007] and Newman [2010] for a survey.
3.2 Taxonomy of Methods for Protein Complex Prediction
Although at a generic level most methods make the basic assumption that protein complexes are embedded among densely connected subsets of proteins within the PPI network, these methods vary considerably in their algorithmic strategies or in the biological information that they employ to detect complexes. Accordingly, we classify protein complex detection methods into the following two categories [Srihari and Leong 2012b, Srihari et al. 2015a, Li et al. 2010]:
1.