Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec;10(12):M111.012500.
doi: 10.1074/mcp.M111.012500. Epub 2011 Aug 29.

Categorizing biases in high-confidence high-throughput protein-protein interaction data sets

Affiliations

Categorizing biases in high-confidence high-throughput protein-protein interaction data sets

Xueping Yu et al. Mol Cell Proteomics. 2011 Dec.

Abstract

We characterized and evaluated the functional attributes of three yeast high-confidence protein-protein interaction data sets derived from affinity purification/mass spectrometry, protein-fragment complementation assay, and yeast two-hybrid experiments. The interacting proteins retrieved from these data sets formed distinct, partially overlapping sets with different protein-protein interaction characteristics. These differences were primarily a function of the deployed experimental technologies used to recover these interactions. This affected the total coverage of interactions and was especially evident in the recovery of interactions among different functional classes of proteins. We found that the interaction data obtained by the yeast two-hybrid method was the least biased toward any particular functional characterization. In contrast, interacting proteins in the affinity purification/mass spectrometry and protein-fragment complementation assay data sets were over- and under-represented among distinct and different functional categories. We delineated how these differences affected protein complex organization in the network of interactions, in particular for strongly interacting complexes (e.g. RNA and protein synthesis) versus weak and transient interacting complexes (e.g. protein transport). We quantified methodological differences in detecting protein interactions from larger protein complexes, in the correlation of protein abundance among interacting proteins, and in their connectivity of essential proteins. In the latter case, we showed that minimizing inherent methodology biases removed many of the ambiguous conclusions about protein essentiality and protein connectivity. We used these findings to rationalize how biological insights obtained by analyzing data sets originating from different sources sometimes do not agree or may even contradict each other. An important corollary of this work was that discrepancies in biological insights did not necessarily imply that one detection methodology was better or worse, but rather that, to a large extent, the insights reflected the methodological biases themselves. Consequently, interpreting the protein interaction data within their experimental or cellular context provided the best avenue for overcoming biases and inferring biological knowledge.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Functional diversity among proteins and interactions present in the high-confidence data sets. A, The relative distribution of all proteins from Saccharomyces cerevisiae, as annotated according to the Munich Information Center for Protein Sequences functional categories, is shown by the gray line labeled “Expected.” It denotes the relative fraction (coverage) of all yeast proteins that belong to a given category. The deviations of the AP/MS, PCA, and Y2H high-confidence data sets from this distribution are shown by the different colored bars, e.g. proteins labeled “Metabolism” are relatively under-represented in the AP/MS data set. B, The average degree of the proteins in a given functional category for each high-confidence data set. C, The relative distribution of protein-protein interactions in the different functional categories for each high-confidence data set.
Fig. 2.
Fig. 2.
High-confidence data set coverage of the protein-protein interaction matrix. We have mapped each interaction in the AP/MS, PCA, and Y2H data sets to the yeast proteome protein-protein interaction matrix defined by all possible binary interactions. The proteins were ordered according to their Munich Information Center for Protein Sequences functional categories. We have indicated the location of proteins belonging to the categories of metabolism (ME), cell cycle (CC), transcription (TR), protein synthesis (PS), protein fate (PF), protein binding (PB), and cellular transport (CT). We have enlarged the symbols of each interaction to make the differences among the data sets and functional categories more visible. The different methodologies retrieved different interaction sets influenced by the underlying experimental platform, e.g. AP/MS recovered tightly bound protein complexes associated with transcription, protein synthesis, and proteins binding, whereas PCA recovered many more weakly bound interactions, e.g. those involved in cellular transport.
Fig. 3.
Fig. 3.
Organization of interactions and biological processes in the AP/MS data set. The top row illustrates the effect of increasing the fidelity of the network representation by decreasing the false positive rate (FPR) associated with the interactions. The data set commensurate with a 5% false positive was designated as the high-confidence AP/MS data set in this work. The bottom row shows the projection of Munich Information Center for Protein Sequences (MIPS) high-level annotations color coded for different “Function,” “Location,” and “Complex” categories. We assigned each protein in the network only one of its MIPS function annotation item(s) to maximize the number of homogeneous interactions of the network using a Monte Carlo algorithm (See Methods). We also outlined selected major biological processes in these interaction maps. The complete color scheme and annotations for the “Complex” annotations are provided as an interactive and viewable map in the Supporting Material.
Fig. 4.
Fig. 4.
Complex size and intra- and interfunction detection. A, The relative distribution of sizes of Munich Information Center for Protein Sequences complexes in the Saccharomyces cerevisiae proteome is shown by the gray line labeled “Expected.” The y axis shows the fraction of proteins in a given data set that is associated with a specific complex size. We have indicated the corresponding distributions for proteins found in the complexes of each high-confidence data set by different symbols. Whereas both the PCA and Y2H data sets lacked representations among the large-sized clusters, the AP/MS data set roughly followed the “Expected” distribution derived from all yeast proteins. B, The influence of the false positive rate on the fraction of intra- and interfunction protein interactions detected in the AP/MS data set. For reference, we have indicated by arrows the PCA and Y2H values for intra- and interfunction annotation fractions from Table I on the y axis.
Fig. 5.
Fig. 5.
The “Complex” annotated protein interaction map. Known complexes from the Munich Information Center for Protein Sequences (MIPS) database and their biological relations were apparent in the reconstructed protein-protein interaction network derived from the high-confidence AP/MS data set. We colored components in a same MIPS complex the same, with gray nodes representing unannotated proteins. This reconstructed network captured a global organization of protein complexes, in particular for protein assemblies related to RNA synthesis, chromatin remodeling, DNA replication and repair, protein synthesis, and protein and RNA degradation (proteasome and exosome). Proteins in transport complexes were connected among themselves, but the proteins that they transport were not captured in this mapping. Some labels have been removed for clarity, and the complete annotations are provided in the Supplementary Material.
Fig. 6.
Fig. 6.
Connectivity among and between Munich Information Center for Protein Sequences complexes. A, The number of interactions between proteins among the constituent functional complexes associated with RNA synthesis (rows and columns 1–5), Chromatin remodeling (rows and columns 6–9), and Other DNA interactions (rows and columns 10–15) are color-coded from none (dark) to three or more (bright red). The number of proteins in each of the 15 complexes is given in square brackets below the graph. Abbreviations: TFIIF, Transcription factor complexes II F; TFIIIC, Transcription factor complexes III C; SWI/SNF, SWItch/Sucrose NonFermentable; TAFIIs, TATA-binding protein associated factors; SAGA, Spt/Ada/GCN5/acetyltransferase; NuA4, nucleosome acetyltransferase of H4. B, The link between the cellular locations of proteins and the different protein transport assemblies that interact with these proteins. The figure shows the connection Z-score (See Methods) associated with co-occurring protein labels of interacting proteins as a function of their locations (rows and columns 1–5) and association with different transporter complexes (rows and columns 6–16). Abbreviations: ER, Endoplasmic reticulum; TIM, the inner mitochondrial membrane protein translocase; TOM, transport across the outer membrane; GIM, prefoldin protein complex; t-SNARE, target SNAP (Soluble NSF Attachment Protein) Receptor; v-SNARE, vesicle SNAP (Soluble NSF Attachment Protein) Receptor; AP-2, Adaptor protein complex-2; AP-3, Adaptor protein complex-3; ERV25, ER Vesicle 25.
Fig. 7.
Fig. 7.
Protein abundance correlation between interacting protein pairs. A, Spearman rank-correlation coefficients rSpearman between abundance ranks of interacting proteins. The histograms are shown for all interactions (All interactions), the subset that encompasses interactions that share the same intrafunction annotations (Intrafunction), and all remaining ones that do not (Other). We indicated correlations that were not statistically significant by a star (*). B, Abundance map of the AP/MS data set. Abundance values (50) were divided into 11 classes from 0 (smallest) to 10 (largest), and each class was represented by a color. Class 0 is the collection of proteins whose abundances were too small to be detected, and classes 1–10 were equally divided among proteins whose abundance could be detected. Interacting proteins in each visible cluster tended to have the same abundance values.
Fig. 8.
Fig. 8.
Distribution of essential proteins. A, The fraction of essential proteins among hub proteins as a function of hub threshold, represented as the fraction of (hub) proteins with degrees larger than a given degree among all proteins of the studied protein interaction network. For example, a hub threshold of 0.1 means that we selected the highest top 10% connected proteins as hubs. B, The essential fraction of proteins in the AP/MS data are shown for several different false positive rates, indicating the sensitivity of the essentiality-connectivity correlation to the confidence level of the data. C, The fraction of all yeast proteins that were essential as a function of the Munich Information Center for Protein Sequences (MIPS) function categories. D, More connected proteins within the same MIPS function category tended to be essential for all five data sets. For each MIPS complex, we calculated the average degree and extracted proteins of above-average degrees (Higher-connectivity proteins) into a group and those with below-average degrees (Lower-connectivity proteins) into another group. After scanning all MIPS complexes, we calculated the fraction of essential proteins in each group. The consensus conclusion indicated that higher-connectivity proteins were indeed more essential than lower-connectivity ones.

Similar articles

Cited by

References

    1. Alberts B. (1998) The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell 92, 291–294 - PubMed
    1. Gentleman R., Huber W. (2007) Making the most of high-throughput protein-interaction data. Genome Biol. 8, 112. - PMC - PubMed
    1. Hakes L., Pinney J. W., Robertson D. L., Lovell S. C. (2008) Protein-protein interaction networks and biology–what's the connection? Nat. Biotechnol. 26, 69–72 - PubMed
    1. Gavin A. C., Bösche M., Krause R., Grandi P., Marzioch M., Bauer A., Schultz J., Rick J. M., Michon A. M., Cruciat C. M., Remor M., Höfert C., Schelder M., Brajenovic M., Ruffner H., Merino A., Klein K., Hudak M., Dickson D., Rudi T., Gnau V., Bauch A., Bastuck S., Huhse B., Leutwein C., Heurtier M. A., Copley R. R., Edelmann A., Querfurth E., Rybin V., Drewes G., Raida M., Bouwmeester T., Bork P., Seraphin B., Kuster B., Neubauer G., Superti-Furga G. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 - PubMed
    1. Ho Y., Gruhler A., Heilbut A., Bader G. D., Moore L., Adams S. L., Millar A., Taylor P., Bennett K., Boutilier K., Yang L., Wolting C., Donaldson I., Schandorff S., Shewnarane J., Vo M., Taggart J., Goudreault M., Muskat B., Alfarano C., Dewar D., Lin Z., Michalickova K., Willems A. R., Sassi H., Nielsen P. A., Rasmussen K. J., Andersen J. R., Johansen L. E., Hansen L. H., Jespersen H., Podtelejnikov A., Nielsen E., Crawford J., Poulsen V., Sørensen B. D., Matthiesen J., Hendrickson R. C., Gleeson F., Pawson T., Moran M. F., Durocher D., Mann M., Hogue C. W., Figeys D., Tyers M. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources