and the prior probability πT of true interactions in the dataset are inferred from the spectral counts of all interactions involving prey i and bait j. Essentially, SAINT assumes that, if proteins i and j interact, then their “interaction abundance” is proportional to the product XiXj of their spectral counts Xi and Xj. To compute P(Xij|True), the spectral counts Xi and Xj are learned not only from the interaction between i and j, but also from all bona fide interactions that involve i and j. The same principle is applied to compute P(Xij|False) for false interactions. These probability distributions are then used to calculate the posterior probability of true interaction P(True|Xij). The interactions are then ranked in decreasing order of their probabilities, and a threshold is used to select the most likely true interactions.
Comparative Proteomic Analysis (ComPASS) [Sowa et al. 2009] employs a comparative methodology to assign scores to proteins identified within parallel proteomic datasets. It constructs a stats table X[k × m] where each cell X[i, j] = Xi, j is the total spectral count (TSC) for an interactor j (arranged as m rows) in an experiment i (arranged as k columns). ComPASS uses a D-score to normalize the TSCs across proteins such that the highest scores are given to proteins in each experiment that are found rarely, found in duplicate runs, and have high TSCs—all characteristics that qualify proteins to be candidate high-confidence interactors. The D-score is a modification of the Z-score, which weights all interactors equally regardless of the number of replicates or their TSCs. Let X̄j be the average TSC for interactor j across all the experiments,
The Z-score is computed as
where σj is the standard deviation for the spectral counts of interactor j across the experiments. The D-score improves the Z-score by incorporating the uniqueness, the TSC, and the reproducibility of the interaction to assign a score to each protein within each experiment. The D-score first rescales Xi, j as
and p is the number of replicate runs in which the interactor is present. A D-score distribution is generated using a simulated random dataset, and a D-score threshold DT is determined below which 95% of this randomized data falls. A normalized D-score is then computed using this threshold as
Evidence-Based Schemes
These schemes use external or independence evidence to estimate confidence of interactions in the PPI dataset. For example, these evidence may include Gene Ontology (GO) annotations [Ashburner et al. 2000, Mi et al. 2013] for protein functions and localization (compartmentalization), and co-complex memberships in validated complexes. Some of the methods are learning-based; for example, given a (training) set of known interacting pairs and GO annotations, the methods learn (train on) the conditional probability distribution for interacting pairs with and without similar GO annotations (e.g., similar functions or localization), and using this learned distribution, the methods estimate the probability of interaction for the proteins in each pair. In Krogan et al.’s study [Krogan et al. 2006], a machine learning approach using Bayesian networks and C4.5-decision trees trained on validated physical interactions and functional evidence—co-occurrence in manually curated complexes from MIPS—was used to estimate confidence for protein pairs in a spoke-modeled experimental dataset. Collins et al. [2007] developed Purification Enrichment (PE) scoring to generate a “Consolidated network” using the matrix-modeled relationships from Gavin et al. and Krogan et al. datasets. The PE scoring is based on a Bayes classifier trained on manually curated co-complexed protein pairs, GO annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. The Consolidated network was shown to be of high quality, comparable to that of PPIs derived from low-throughput experiments.
In other methods, explicit learning may not be involved, but instead the evidence is directly used to assess the interaction confidence of protein pairs. For example, Resnick’s measure [Resnick 1995] for computing the semantic similarity between annotation terms has been adopted to compute confidence based on GO annotations between the proteins [Xu et al. 2008, Pesquita et al. 2008, Jain and Bader 2010]. Specifically, the semantic similarity between two ontology terms (S, T) having a set A of common ancestors in the GO graph is given as the information content,
where p(A) is the fraction of proteins annotated to term A and all its descendants in the GO graph. Suppose that proteins u and v are annotated to sets of GO terms S and T, respectively. Then the semantic similarity between u and v is defined as the maximum information content (Resnick’s measure) of the set S × T,
The interaction confidence between u and v is then estimated as sim(u, v).
GO graphs tend to be unbalanced with some paths containing more details (depth) compared to others, which stems from the complex biological structure of the GO annotations. However, this creates a bias against terms that do not represent such complex structures, i.e., terms that do not have sufficient depth in the GO graph. To account for this topological imbalance of the GO graph, Jain and Bader [2010] developed Topological Clustering Semantic Similarity (TCSS) which collapses subgraphs that define similar concepts. Terms that are lower down the GO tree have higher information content (i.e., more specific) than the terms at higher levels (i.e., less specific). A cut-off is used to identify subgraphs—subgraph root terms and all their children—with high information content. Since GO terms often have multiple parents, it is likely that this results in overlapping subgraphs. Overlaps between subgraphs are removed in two steps: edge removal by transitive reduction and term duplication. In the edge-removal step, a reduction is performed on the subgraphs: If nodes u and v are connected both via a directed edge u → v as well as a directed path u → w1, …, wk → v, then a transitive reduction is performed to preserve only u → v. After this step, if a term still belongs to more than one subgraph, then the term and all its descendents are duplicated across the subgraphs. The similarity between two proteins is measured using this reduced GO graph using Resnick’s similarity between their GO terms, as described above.
Topology-Based Schemes
These schemes analyze the topology of the PPI network—usually in the immediate neighborhood of each protein pair—to estimate interaction confidence for the protein pairs. If the proteins in an interacting pair have many common neighbors in the network, the proteins and their shared neighbors have similar functions and/or are co-localized [Batada et al. 2004], and it is likely that the observed interaction between the proteins is true. This can be understood from the following simple example. Two proteins need to be localized to the same compartment to interact physically. Let us assume that a PPI screen with a false-positive rate of p reports