Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar 21:2:4.
doi: 10.1186/2042-5783-2-4.

Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction

Affiliations

Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction

Seshan Ananthasubramanian et al. Microb Inform Exp. .

Abstract

Background: Protein-protein interaction (PPI) networks (interactomes) of most organisms, except for some model organisms, are largely unknown. Experimental methods including high-throughput techniques are highly resource intensive. Therefore, computational discovery of PPIs can accelerate biological discovery by presenting "most-promising" pairs of proteins that are likely to interact. For many bacteria, genome sequence, and thereby genomic context of proteomes, is readily available; additionally, for some of these proteomes, localization and functional annotations are also available, but interactomes are not available. We present here a method for rapid development of computational system to predict interactome of bacterial proteomes. While other studies have presented methods to transfer interologs across species, here, we propose transfer of computational models to benefit from cross-species annotations, thereby predicting many more novel interactions even in the absence of interologs. Mycobacterium tuberculosis (Mtb) and Clostridium difficile (CD) have been used to demonstrate the work.

Results: We developed a random forest classifier over features derived from Gene Ontology annotations and genetic context scores provided by STRING database for predicting Mtb and CD interactions independently. The Mtb classifier gave a precision of 94% and a recall of 23% on a held out test set. The Mtb model was then run on all the 8 million protein pairs of the Mtb proteome, resulting in 708 new interactions (at 94% expected precision) or 1,595 new interactions at 80% expected precision. The CD classifier gave a precision of 90% and a recall of 16% on a held out test set. The CD model was run on all the 8 million protein pairs of the CD proteome, resulting in 143 new interactions (at 90% expected precision) or 580 new interactions (at 80% expected precision). We also compared the overlap of predictions of our method with STRING database interactions for CD and Mtb and also with interactions identified recently by a bacterial 2-hybrid system for Mtb. To demonstrate the utility of transfer of computational models, we made use of the developed Mtb model and used it to predict CD protein-pairs. The cross species model thus developed yielded a precision of 88% at a recall of 8%. To demonstrate transfer of features from other organisms in the absence of feature-based and interaction-based information, we transferred missing feature values from Mtb orthologs into the CD data. In transferring this data from orthologs (not interologs), we showed that a large number of interactions can be predicted.

Conclusions: Rapid discovery of (partial) bacterial interactome can be made by using existing set of GO and STRING features associated with the organisms. We can make use of cross-species interactome development, when there are not even sufficient known interactions to develop a computational prediction system. Computational model of well-studied organism(s) can be employed to make the initial interactome prediction for the target organism. We have also demonstrated successfully, that annotations can be transferred from orthologs in well-studied organisms enabling accurate predictions for organisms with no annotations. These approaches can serve as building blocks to address the challenges associated with feature coverage, missing interactions towards rapid interactome discovery for bacterial organisms.

Availability: The predictions for all Mtb and CD proteins are made available at: http://severus.dbmi.pitt.edu/TB and http://severus.dbmi.pitt.edu/CD respectively for browsing as well as for download.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Wide-gap in the availability of annotations and protein-protein interactions. Wide-gap in the availability of Gene Ontology annotations and protein-protein interactions. The blue line shows the number of proteins whose annotations are available in Gene Ontology () [1] and the red line shows the number of interactions available in BioGrid (thebiogrid.org) [2] for the organisms.
Figure 2
Figure 2
Approach. Block diagram showing the approach for different cases of availability of features and interactions in bacterial genomes.
Figure 3
Figure 3
Precision-recall curves for intraspecies prediction (Case I). Precision v/s Recall observed by random forest are compared with that by each STRING scores, which score functional interactions based on genomic context. Random forest predictor achieves a higher recall compared to individual STRING scores for similar values of precision. The "combined score" provided by STRING is also shown to have high false positives in text for biophysical interactions. (A) represents the results for training and testing done with Mtb while (B) represents training and testing done with CD.
Figure 4
Figure 4
Mean values of feature elements for interacting and non-interacting pairs. The figure shows the mean value of each of the 7 features for interacting protein-pairs and for random protein-pairs.
Figure 5
Figure 5
Number of novel interactions at different levels of precision. (A) Observed number of novel predictions in the Mtb interactome uncovered at varying level of estimated precisions. For example, at 90% precision, the algorithm uncovers 786 novel interactions and 1,068 known interactions; at expected 80% precision it uncovers 1,595 novel interactions and 1,488 known interactions. The threshold on the random forest output that corresponds to each of the precisions shown in the figure: 0.88 (90% precision), 0.79 (80%), 0.67 (70%) and 0.49 (60%).(B) Observed number of novel predictions in the CD interactome uncovered at varying level of estimated precisions. For example, at 90% precision, the algorithm uncovers 143 novel interactions and 812 known interactions; at expected 80% precision it uncovers 580 novel interactions and 1,587 known interactions. The threshold on the random forest output that corresponds to each of the precisions shown in the figure: 0.98 (90% precision), 0.86 (80%), 0.66 (70%) and 0.58 (60%).
Figure 6
Figure 6
Overlap of interactome data. For Mtb, the overlap between interactions predicted by the random forest at 94% precision, and those given by STRING combined score, STRING interactions with experimental evidence and interactions identified by Wang et al by bacterial 2-hybrid method [36]. A. Overlap is shown for entire datasets. For STRING combined score, all interactions with non-zero score combined are considered. B. Similar to (A), scores that have a STRING combined score greater than 700 only are shown in the rectangle, while those with less than 700 score are shown outside the rectangle. For CD, the overlap between interactions predicted by the random forest at 90% precision, and those given by STRING combined score, STRING interactions with experimental evidence. C. Overlap is shown for entire datasets. For STRING combined score, all interactions with non-zero score combined are considered. D. Similar to (C), scores that have a STRING combined score greater than 700 only are shown in the rectangle, while those with less than 700 score are shown outside the rectangle.
Figure 7
Figure 7
Network view of predicted interactions. (A) Complete network of the protein-protein interactions having high confidence. (B) Sub-cluster depicting some of the proteins involved in Histidine biosynthesis pathway. Node for the HisG protein is colored as dark green. Blue node (Rv2584c) is involved in purine biosynthesis.
Figure 8
Figure 8
Precision - recall curve for cross species prediction (Case II). Precision v/s Recall observed for cross species evaluation. The CD test set was used for all two evaluations. The purple line shows us the classifier that was trained on the CD features and the red line shows us the classifier that was trained on the Mtb features.
Figure 9
Figure 9
Precision and recall for ortholog transfer (Case III). The data that used orthologous Gene Ontology values in substitution (experimental) consistently performed better than the data with removed value (blank), and managed to reach an accuracy equal to our control, which would be having all available data for CD.

Similar articles

Cited by

References

    1. Sears CL. A dynamic partnership: celebrating our gut flora. Anaerobe. 2005;11(5):247–251. doi: 10.1016/j.anaerobe.2005.05.001. - DOI - PubMed
    1. Organisation WH. Multidrug and extensively drug-resistant TB (M/XDR-TB): 2010 global report on surveillance and response. 2010.
    1. Koch R, Brock TD, Fred EB. The etiology of tuberculosis. Rev Infect Dis. 1982;4(6):1270–1274. doi: 10.1093/clinids/4.6.1270. - DOI - PubMed
    1. Walzl G, Ronacher K, Hanekom W, Scriba TJ, Zumla A. Immunological biomarkers of tuberculosis. Nat Rev Immunol. 2011;11(5):343–354. doi: 10.1038/nri2960. - DOI - PubMed
    1. Rachman H, Kaufmann SHE. Exploring functional genomics for the development of novel intervention strategies against tuberculosis. International Journal of Medical Microbiology. 2007;297(7-8):559–567. doi: 10.1016/j.ijmm.2007.03.003. - DOI - PubMed

LinkOut - more resources