Sriganesh Srihari

Computational Prediction of Protein Complexes from Protein Interaction Networks


Скачать книгу

from each protein in the network. Let pr(s) be the steady-state probability distribution of a random walk with restart probability α, where the starting vector s gives the probability distribution upon restarting. Then, pr(s) is the unique solution for the linear system given by

Image

      The starting vector here is set as follows:

Image

      Therefore, pr(su) is the steady-state probability distribution of a random walk that always restarts at u. Let pr(su)[v] represent the steady-state probability value that a protein v has in the distribution vector pr(su). This can be thought of as the probability contribution that u makes to v, and is denoted as pr(uv). This provides the affinity between u and v when the walks always restart at u. The final affinity between u and v is computed as the minimum of pr(uv) and pr(vu).

      Pržulj et al. [2004] and Higham et al. [2008] argue that PPI networks are best modeled by geometric random graphs instead of scale-free or small-world graphs, as suggested by earlier works [Watts and Strogatz 1998, Barabási 1999]. A geometric random graph G = 〈V, ϵ〉 with radius ϵ is a graph with node set V of points in a metric space and edge set E = {(u, v) : u, vV ; 0 < ||uv|| ≤ ϵ}, where ||.|| is an arbitrary distance norm in this space. The authors show that a PPI network can be represented as a geometric random graph by embedding the proteins of the PPI network in metric spaces R2, R3, or R4, and finding an ϵ such that any two proteins are connected in the PPI network if and only if the proteins are ϵ-close in the chosen metric space. This embedding can be used to not only weight all interactions, but also predict new interactions between protein pairs and remove spurious interactions from the PPI network: Interactions between proteins that are farther than ϵ-distance away in the metric embedding are pruned off as noise, whereas non-interacting protein pairs that are ϵ-close are predicted to be interacting.

      The embedding of the PPI network in a chosen metric space is described briefly as follows. Given a PPI network of N proteins, the interaction weights for protein pairs (i, j) are interpreted as pairwise distances dij, and the task is to find locations in m-dimensional Euclidean space (vectors Image in Rm) for these proteins so that the pairwise distances are preserved, i.e., Image for all i, j. In PPI networks with only {0, 1} connectivity information, the length of the shortest path between proteins in the network is used in lieu of the Euclidean distance. Przulj et al. suggest that using the square root of the path length, Image, where pathij denotes the path length between i and j, is a good function for the purpose.

      Conditional probabilities p(dij|interact) and p(dij|noninteract) are learned for distances dij from the embedding. Here, p(dij|interact) is the probability density function describing the distances between pairs of proteins which are known to interact, and p(dij|noninteract) is the probability density function describing the distances between pairs of proteins which do not interact in the dataset (and are known to not interact). Given a distance threshold δ, these probabilities are used to compute the posterior probabilities p(interact|dijδ) and p(noninteract|dijδ) for protein pairs to interact or not interact. For each protein pair (i, j) within the δ-threshold, the weight of the interaction between i and j is then estimated as

Image

      A threshold on this estimated weight is applied to remove false-positive interactions from the network.

       Combining Confidence Scores

      Interactions scored high by all or a majority of confidence-scoring schemes (consensus interactions) are likely to be true interactions. Therefore, a simple majority-voting scheme counts the number of times each interaction is assigned a high score (above the recommended cut-off) by each scheme, and considers only the interactions scored high by a majority of the schemes.

      Chua et al. [2009] integrated multiple scoring schemes using a naïve Bayesian approach. For an interaction (u, v) that is assigned scores pi(u, v) by different schemes i (assuming the scores are in the same range [0,1]), the combined score can be computed as 1 − Πi(1 − pi(u, v)). However, even within the same range [0,1], usually different scoring schemes tend to have different distribution of scores they assign to the interactions. Some schemes assign high scores (close to 1) to most or a sizeable fraction of the interactions, whereas other schemes are more conservative and assign low scores to many interactions. To account for the variability in distributions of scores, it is important to consider the relative ranking of interactions within each scoring scheme instead of their absolute scores.

      Chua et al. [2009] therefore proposed a ranked-based combination scheme, which works as follows. For each scheme i, the scored interactions are first binned in increasing order of their scores: The first 100 interactions are placed in the first bin, the second 100 interactions in the second bin, and so on. For each bin k in scheme i, a weight p(i, k) is assigned based on the number of interactions from the bin that match known interactions from an independent dataset:

Image

      While combining the scores for interaction (u, v), the Bayesian weighting is modified to: 1 − Π(i,k)∈D(u,v)(1 − p(i, k)), where D(u, v) is the list of scheme-bin pairs (i, k) that contain (u, v) across all schemes. This ensures that, irrespective of the scoring distribution of the schemes, if an interaction belongs to reliable bins, it is assigned a high final score.

      Yong et al. [2012] present a supervised maximum-likelihood weighting scheme (SWC) to combine PPI datasets and to infer co-complexed protein pairs. The method uses a naïve Bayes maximum-likelihood model to derive the posterior probability that an interaction (u, v) is a co-complexed interaction based on the scores assigned to (u, v) across multiple data sources. These data sources include PPI databases, namely BioGrid [Stark et al. 2011], IntAct [Hermjakob et al. 2004, Kerrien et al. 2012], MINT [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007], and STRING [Von Mering et al. 2003, Szklarczyk et al. 2011], and evidence from cooccurrence of proteins in the PubMed literature abstracts (http://www.ncbi.nlm.nih.gov/pubmed). The set of features is the set of these data sources, and a feature F has value f if proteins u and v are related by data source F with score f, else f = 0. The features are discretized using minimum-description length supervised discretization [Fayyad and Irani 1993]. Using a reference set of protein complexes, each (u, v) in the training set is given a class label co-complex if both u and v are in the same complex, otherwise it is given the class label non-co-complex. The maximum-likelihood parameters are learned for the two classes,

Image

      where Nc is the number of interactions with label co-complex, Nc, F = f is the number of interactions with label co-complex and feature value F = f, and likewise for interactions with label non-co-complex, N¬c, F = f′. After learning the maximum-likelihood model, the score for each interaction is computed as the posterior probability of being a co-complex interaction based on the naïve Bayes assumption