Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005;6(5):R40.
doi: 10.1186/gb-2005-6-5-r40. Epub 2005 Apr 15.

Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome

Affiliations

Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome

Arun K Ramani et al. Genome Biol. 2005.

Abstract

Background: Extensive protein interaction maps are being constructed for yeast, worm, and fly to ask how the proteins organize into pathways and systems, but no such genome-wide interaction map yet exists for the set of human proteins. To prepare for studies in humans, we wished to establish tests for the accuracy of future interaction assays and to consolidate the known interactions among human proteins.

Results: We established two tests of the accuracy of human protein interaction datasets and measured the relative accuracy of the available data. We then developed and applied natural language processing and literature-mining algorithms to recover from Medline abstracts 6,580 interactions among 3,737 human proteins. A three-part algorithm was used: first, human protein names were identified in Medline abstracts using a discriminator based on conditional random fields, then interactions were identified by the co-occurrence of protein names across the set of Medline abstracts, filtering the interactions with a Bayesian classifier to enrich for legitimate physical interactions. These mined interactions were combined with existing interaction data to obtain a network of 31,609 interactions among 7,748 human proteins, accurate to the same degree as the existing datasets.

Conclusion: These interactions and the accuracy benchmarks will aid interpretation of current functional genomics data and provide a basis for determining the quality of future large-scale human protein interaction assays. Projecting from the approximately 15 interactions per protein in the best-sampled interaction set to the estimated 25,000 human genes implies more than 375,000 interactions in the complete human protein interaction network. This set therefore represents no more than 10% of the complete network.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overlap between existing human protein interaction sets. A Venn diagram shows the overlap is small among the existing, publicly available human protein interaction datasets (specifically, Reactome, BIND, and HPRD protein interaction data). The small overlap (< 0.1% in common in all three datasets) implies that the number of protein interactions described in the literature is actually quite large and that the individual datasets carry specific biases.
Figure 2
Figure 2
Comparison of precision and accuracy of the algorithms. The conditional random fields (CRF) algorithm considerably outperforms other approaches for identifying human protein names in Medline abstracts, such as the simple matching of words to a dictionary of protein names, as well as the other available protein name-tagging algorithms in [32], Kex [34] and Abgene [35]. The tests are performed on 200 manually annotated Medline abstracts [33]. The precision (the number of correct protein names among all identified names) in identifying proteins is plotted against the recall (the number of correct protein names among all possible correct protein names). Higher scores on both precision and recall are preferable; however, for this purpose, we seek to maximize precision and can tolerate lower recall.
Figure 3
Figure 3
The performance of the co-citation algorithm at identifying protein interactions. (a) The probabilistic score effectively ranks co-cited proteins by their tendency to participate in the same pathway, as measured on the functional annotation training benchmark. As the probability of random co-citation decreases, the functional relatedness of the co-cited proteins increases. This tendency is robust to changes in the CRF confidence threshold chosen (data not shown). Each point represents 3,000 protein pairs. (b) An examination of the number of protein pairs identified at different CRF thresholds (0.8, 0.6, and 0.4) shows that the recall of the method is increased with lowered thresholds. Re-ranking the 15,000 top-scoring protein pairs (CRF threshold = 0.8) by the tendency of the abstracts to discuss physical protein interactions shows their consistent performance in the annotation benchmark.
Figure 4
Figure 4
A comparison of the available human protein interaction data on the two benchmarks. (a) An examination of the initial performance of the datasets on the functional annotation test benchmark reveals the relative quality of each dataset. The interactions extracted using co-citation analysis filtered by the Bayesian estimator show a robust behavior in terms of their scores. (b) Comparison of the performance of the interactions retrieved from the co-citation analysis after incorporating the Bayesian filter and the interactions from HPRD and orthology transfer, as assessed on the physical interaction benchmark. The Bayesian filter effectively ranks the co-citation-derived interactions in terms of their correspondence to physical protein interactions.
Figure 5
Figure 5
Comparison of extracted interactions with existing interactions. A comparison of interactions inferred from orthology [21] and those recovered by co-citation with the other existing human protein interaction datasets reveals that the overlap is small. The trend implies that the different methods are sampling relatively exclusive sets of interactions although, with the exception of the orthology-derived interactions, they are all derived directly from the primary biological literature.
Figure 6
Figure 6
Visualization of the final consolidated network of protein interactions. A view of the composite interaction network (31,609 interactions between 7,748 proteins). Of these, 6,706 proteins (87%) are connected by at least one interaction into the central, connected network component. The modularity in the network can be seen in the superimposed three-dimensional visualization, a histogram in which higher peaks correspond to larger numbers of edges per unit area. The network coordinates were generated by LGL [46] and visualized with Zlab by Zack Simpson.

Similar articles

Cited by

References

    1. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001;98:4569–4574. doi: 10.1073/pnas.061034498. - DOI - PMC - PubMed
    1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. doi: 10.1038/35001009. - DOI - PubMed
    1. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415:141–147. doi: 10.1038/415141a. - DOI - PubMed
    1. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002;415:180–183. doi: 10.1038/415180a. - DOI - PubMed
    1. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science. 2001;294:2364–2368. doi: 10.1126/science.1065810. - DOI - PubMed

Publication types

LinkOut - more resources