non-homologous proteins that co-evolve are under evolutionary pressure to conserve their interactions and to maintain their co-functioning ability [Shoemaker and Panchenko 2007, Sun et al. 2005].
Co-Evolution of Interacting Proteins. Interacting proteins often co-evolve so that changes in one protein in a pair leading to the loss of function or interaction should be compensated by correlated changes in the other protein [Shoemaker and Panchenko 2007]. This co-evolution is reflected by the similarity between the phylogenetic protein trees (or simply, protein trees) of non-homologous interacting protein families. A protein tree represents the evolutionary history of protein families, i.e., proteins or protein families that diverged from a common ancestor. These protein trees reconciled with their species trees have their internal nodes annotated to speciation and duplication events [Vilella et al. 2009]. TreeSoft (http://treesoft.sourceforge.net/treebest.shtml) provides a suite of tools to build and visualize protein trees. The similarity between two protein trees can be computed by aligning the corresponding distance matrices so as to minimize the difference between the matrix elements: the smaller the difference between the matrices, the stronger the co-evolution between the two protein families. Interactions are predicted between proteins corresponding to the aligned columns of the two matrices. The similarity between two protein trees is influenced by the speciation process and, therefore, there is a certain background similarity between any two protein trees, irrespective of whether the proteins interact or not. Statistical approaches exist to correct for these factors (phylogenetic subtraction) [Harvey and Pagel 1991, Harvey et al. 1995]. It is also worth noting that a protein can have multiple partners, and so taking into consideration its co-evolution with all its partners further enhances the accuracy of the interaction prediction [Juan et al. 2008].
Gene Fusion. Gene fusion is a common event in evolution, wherein two or more genes in one species fuse into a single gene in another species. Gene fusion is a result of duplication, translocation, or inversion events that affect coding sequences during the evolution of genomes. Therefore, gene fusions play an important role in determining the gene (and genomic) architecture of species. Gene fusions may occur to optimize co-transcription of genes involved in the fusion: by fusing two or more genes, it may be easier to transcribe these genes as a single entity, thus resulting in a single protein product. Typically, proteins coded by these fused genes in a species carry multiple functional domains, which originate from different proteins (genes) in the ancestor species. Therefore, one may infer interactions between these individual proteins in the ancestor species: it is likely that these proteins are partners in performing a particular function and they interact in the ancestor species, and that gene fusion has occurred in another species to optimize the transcription and to produce a single multidomain protein [Marcotte et al. 1999]. These fused proteins are referred to as chimeric or Rosetta Stone proteins [Marcotte et al. 1999]. The Rosetta Stone approach [Enright and Ouzounis 2001, Suhre 2007] infers protein interactions by detecting fusion events between protein sequences across species. In E. coli, this approach identified 6,809 putative interacting pairs of proteins, wherein both proteins from each pair had significant sequence similarity to a single (fused) protein from at least one other species (genome). The analysis of these interacting pairs revealed that, for more than half of these pairs, both the proteins were functionally related [Marcotte et al. 1999].
PPI Network Topology. The pattern of interactions between proteins in a PPI network says a lot about how proteins interact, and provides a way to predict new interactions. For example, if a pair of proteins have many common neighbors in the PPI network, then most likely the two proteins in the pair and their common neighbors are involved in the same or similar function(s). Therefore, one may infer a direct physical interaction between the two proteins based on the number of neighbors and/or functions the two proteins share. Chua et al. [2006] used FS Weight interaction-scoring approach in this manner to predict interactions between level-2 neighbors (connected via one other protein) in the PPI network. This is based on the observation that level-2 neighbors in the PPI network show the same or similar annotations for functions and/or cellular compartment, and therefore these are more likely to interact compared to random pairs of proteins in the network. These FS-weighted predicted interactions between level-2 neighbors are added back to the PPI network after removing low-weighted interactions. Using the same rationale, one can predict new interactions using other topology-based (common-neighbor counting) schemes including Dice coefficient [Zhang et al. 2008] and Iterative CD [Liu et al. 2008]. Likewise, the geometric embedding model [Pržulj et al. 2004, Higham et al. 2008] can also be used to predict new interactions: Proteins that are ϵ-close in the geometric embedding of the PPI network are more likely to interact compared to random pairs of proteins and proteins that are farther than ϵ-distance away in the embedding.
Functional Features. Interacting proteins are often involved in the same or similar functions. Therefore, if a pair of proteins are annotated with the same or similar functions, one could, with some degree of accuracy, infer a physical interaction between the two proteins. This is often referred to as “guilt by association,” which refers to the principle that genes or proteins with related functions tend to share properties such as genetic or physical interactions [Oliver 2000]. This inference can be further enhanced by combining other evidence that supports their functional similarity—for example, if the genes coding for the two proteins are located close by on the genome or are transcribed as an operonic unit (for prokaryotes) [Dandekar et al. 1998, Kumar et al. 2002], or the coding genes are co-transcribed or co-expressed [Huynen et al. 2000, Bowers et al. 2004, Jansen et al. 2002], or show similar phylogenetic profiles [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. Proteins within the same protein complex (co-complexed proteins) show a strong tendency to share functions and cellular localization and therefore physically interact. On the other hand, proteins from different cellular compartments most likely do not meet and therefore do not interact in vivo during their lifetimes. Jansen et al. [2003] used interactions between co-complexed proteins from the MIPS protein complex catalog [Mewes et al. 2006] as the positive training set, and non-interacting pairs of proteins as the negative training set, in a Bayesian framework, to predict new interactions in yeast. Blohm et al. [2014] present a dataset, the “Negatome,” of protein pairs that are highly unlikely to interact, which can be used as a negative training set.
The Gene Ontology graph [Ashburner et al. 2000] integrates information on the functional and localization properties of proteins, and therefore provides a way to predict new interactions. For example, the TCSS approach by Jain and Bader [2010] can be used to compute similarity between pairs of proteins using the GO graph, and protein pairs showing high GO-semantic similarity can be predicted to physically interact. Likewise, multiple pieces of experimental and functional information can be combined to predict new interactions. For example, GeneMANIA (http://www.genemania.org/) [Warde-Farley et al. 2010] combines experimentally detected interactions from BioGrid [Stark et al. 2011, Chatr-Aryamontri et al. 2015], pathway annotations from Pathway Commons (http://www.pathwaycommons.org/) [Cerami et al. 2011], and information on evolutionary conservation of interactions from the Interologous Interaction Database (I2D) [Brown and Jurisica 2005], along with GO-based similarity, to predict new interactions (GeneMANIA and I2D are also listed in Table 2.5). The HumanNet [Lee et al. 2011] is a human functional interaction network which includes predicted interactions based on guilt by association for genes involved in human diseases.
Structural Information on Proteins. 3D structures of proteins provide first-hand evidence for protein interaction sites and binding surfaces of proteins. Therefore, by assessing the compatibility between the binding surfaces between two proteins, one can predict whether the two proteins interact or not. For example, Zhang et al. [2012, 2013] analyzed 3D structures of proteins from the Protein Data Bank (PDB) (http://www.rcsb.org/pdb/home/home.do) [Berman et al. 2000], a database which stores 3D structures for over 600 of the ∼6,000 characterized yeast proteins (∼10%), to predict new interactions between proteins in yeast. However, since the 3D structures are available for