. 2012 Mar 21:2:4.

doi: 10.1186/2042-5783-2-4.

Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction

Seshan Ananthasubramanian^{1

2}, Rahul Metri³, Ankur Khetan^#⁴, Aman Gupta^#⁵, Adam Handen^#⁶, Nagasuma Chandra³, Madhavi Ganapathiraju^{1

2}

Affiliations

¹ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh 15260, USA.
² Intelligent Systems Program, University of Pittsburgh, Pittsburgh 15260, USA.
³ Department of Biochemistry, Indian Institute of Science, Bangalore 560012, India.
⁴ Indian Institute of Technology, Roorkee, India.
⁵ Birla Institute of Technology and Science, Pilani, India.
⁶ Rochester Institute of Technology, Henrietta, USA.

^# Contributed equally.

PMID: 22587966
PMCID: PMC3353838
DOI: 10.1186/2042-5783-2-4

Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction

Seshan Ananthasubramanian et al. Microb Inform Exp. 2012.

. 2012 Mar 21:2:4.

doi: 10.1186/2042-5783-2-4.

Authors

Seshan Ananthasubramanian^{1

2}, Rahul Metri³, Ankur Khetan^#⁴, Aman Gupta^#⁵, Adam Handen^#⁶, Nagasuma Chandra³, Madhavi Ganapathiraju^{1

2}

Affiliations

¹ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh 15260, USA.
² Intelligent Systems Program, University of Pittsburgh, Pittsburgh 15260, USA.
³ Department of Biochemistry, Indian Institute of Science, Bangalore 560012, India.
⁴ Indian Institute of Technology, Roorkee, India.
⁵ Birla Institute of Technology and Science, Pilani, India.
⁶ Rochester Institute of Technology, Henrietta, USA.

^# Contributed equally.

PMID: 22587966
PMCID: PMC3353838
DOI: 10.1186/2042-5783-2-4

Abstract

Background: Protein-protein interaction (PPI) networks (interactomes) of most organisms, except for some model organisms, are largely unknown. Experimental methods including high-throughput techniques are highly resource intensive. Therefore, computational discovery of PPIs can accelerate biological discovery by presenting "most-promising" pairs of proteins that are likely to interact. For many bacteria, genome sequence, and thereby genomic context of proteomes, is readily available; additionally, for some of these proteomes, localization and functional annotations are also available, but interactomes are not available. We present here a method for rapid development of computational system to predict interactome of bacterial proteomes. While other studies have presented methods to transfer interologs across species, here, we propose transfer of computational models to benefit from cross-species annotations, thereby predicting many more novel interactions even in the absence of interologs. Mycobacterium tuberculosis (Mtb) and Clostridium difficile (CD) have been used to demonstrate the work.

Results: We developed a random forest classifier over features derived from Gene Ontology annotations and genetic context scores provided by STRING database for predicting Mtb and CD interactions independently. The Mtb classifier gave a precision of 94% and a recall of 23% on a held out test set. The Mtb model was then run on all the 8 million protein pairs of the Mtb proteome, resulting in 708 new interactions (at 94% expected precision) or 1,595 new interactions at 80% expected precision. The CD classifier gave a precision of 90% and a recall of 16% on a held out test set. The CD model was run on all the 8 million protein pairs of the CD proteome, resulting in 143 new interactions (at 90% expected precision) or 580 new interactions (at 80% expected precision). We also compared the overlap of predictions of our method with STRING database interactions for CD and Mtb and also with interactions identified recently by a bacterial 2-hybrid system for Mtb. To demonstrate the utility of transfer of computational models, we made use of the developed Mtb model and used it to predict CD protein-pairs. The cross species model thus developed yielded a precision of 88% at a recall of 8%. To demonstrate transfer of features from other organisms in the absence of feature-based and interaction-based information, we transferred missing feature values from Mtb orthologs into the CD data. In transferring this data from orthologs (not interologs), we showed that a large number of interactions can be predicted.

Conclusions: Rapid discovery of (partial) bacterial interactome can be made by using existing set of GO and STRING features associated with the organisms. We can make use of cross-species interactome development, when there are not even sufficient known interactions to develop a computational prediction system. Computational model of well-studied organism(s) can be employed to make the initial interactome prediction for the target organism. We have also demonstrated successfully, that annotations can be transferred from orthologs in well-studied organisms enabling accurate predictions for organisms with no annotations. These approaches can serve as building blocks to address the challenges associated with feature coverage, missing interactions towards rapid interactome discovery for bacterial organisms.

Availability: The predictions for all Mtb and CD proteins are made available at: http://severus.dbmi.pitt.edu/TB and http://severus.dbmi.pitt.edu/CD respectively for browsing as well as for download.

PubMed Disclaimer

Figures

**Figure 1**
**Wide-gap in the availability of annotations and protein-protein interactions**. Wide-gap in the availability of Gene Ontology annotations and protein-protein interactions. The blue line shows the number of proteins whose annotations are available in Gene Ontology () [1] and the red line shows the number of interactions available in BioGrid (thebiogrid.org) [2] for the organisms.

**Figure 2**
**Approach**. Block diagram showing the approach for different cases of availability of features and interactions in bacterial genomes.

**Figure 3**
**Precision-recall curves for intraspecies prediction (Case I)**. Precision v/s Recall observed by random forest are compared with that by each STRING scores, which score functional interactions based on genomic context. Random forest predictor achieves a higher recall compared to individual STRING scores for similar values of precision. The "combined score" provided by STRING is also shown to have high false positives in text for *biophysical interactions*. (A) represents the results for training and testing done with Mtb while (B) represents training and testing done with CD.

**Figure 4**
**Mean values of feature elements for interacting and non-interacting pairs**. The figure shows the mean value of each of the 7 features for interacting protein-pairs and for random protein-pairs.

**Figure 5**
**Number of novel interactions at different levels of precision**. (A) Observed number of novel predictions in the Mtb interactome uncovered at varying level of estimated precisions. For example, at 90% precision, the algorithm uncovers 786 novel interactions and 1,068 known interactions; at expected 80% precision it uncovers 1,595 novel interactions and 1,488 known interactions. The threshold on the random forest output that corresponds to each of the precisions shown in the figure: 0.88 (90% precision), 0.79 (80%), 0.67 (70%) and 0.49 (60%).(B) Observed number of novel predictions in the CD interactome uncovered at varying level of estimated precisions. For example, at 90% precision, the algorithm uncovers 143 novel interactions and 812 known interactions; at expected 80% precision it uncovers 580 novel interactions and 1,587 known interactions. The threshold on the random forest output that corresponds to each of the precisions shown in the figure: 0.98 (90% precision), 0.86 (80%), 0.66 (70%) and 0.58 (60%).

**Figure 6**
**Overlap of interactome data**. For Mtb, the overlap between interactions predicted by the random forest at 94% precision, and those given by STRING combined score, STRING interactions with experimental evidence and interactions identified by Wang et al by bacterial 2-hybrid method [36]. A. Overlap is shown for entire datasets. For STRING combined score, all interactions with non-zero score combined are considered. B. Similar to (A), scores that have a STRING *combined score* greater than 700 only are shown in the rectangle, while those with less than 700 score are shown outside the rectangle. For CD, the overlap between interactions predicted by the random forest at 90% precision, and those given by STRING combined score, STRING interactions with experimental evidence. C. Overlap is shown for entire datasets. For STRING combined score, all interactions with non-zero score combined are considered. D. Similar to (C), scores that have a STRING *combined score* greater than 700 only are shown in the rectangle, while those with less than 700 score are shown outside the rectangle.

**Figure 7**
**Network view of predicted interactions**. (A) Complete network of the protein-protein interactions having high confidence. (B) Sub-cluster depicting some of the proteins involved in Histidine biosynthesis pathway. Node for the HisG protein is colored as dark green. Blue node (Rv2584c) is involved in purine biosynthesis.

**Figure 8**
**Precision - recall curve for cross species prediction (Case II)**. Precision v/s Recall observed for cross species evaluation. The CD test set was used for all two evaluations. The purple line shows us the classifier that was trained on the CD features and the red line shows us the classifier that was trained on the Mtb features.

**Figure 9**
**Precision and recall for ortholog transfer (Case III)**. The data that used orthologous Gene Ontology values in substitution (experimental) consistently performed better than the data with removed value (blank), and managed to reach an accuracy equal to our control, which would be having all available data for CD.

See this image and copyright information in PMC

Cited by

Research prioritization through prediction of future impact on biomedical science: a position paper on inference-analytics.
Ganapathiraju MK, Orii N. Ganapathiraju MK, et al. Gigascience. 2013 Aug 30;2(1):11. doi: 10.1186/2047-217X-2-11. Gigascience. 2013. PMID: 24001106 Free PMC article.
Computational Network Inference for Bacterial Interactomics.
James K, Muñoz-Muñoz J. James K, et al. mSystems. 2022 Apr 26;7(2):e0145621. doi: 10.1128/msystems.01456-21. Epub 2022 Mar 30. mSystems. 2022. PMID: 35353009 Free PMC article. Review.
Identifying Protein Complexes With Clear Module Structure Using Pairwise Constraints in Protein Interaction Networks.
Liu G, Liu B, Li A, Wang X, Yu J, Zhou X. Liu G, et al. Front Genet. 2021 Aug 27;12:664786. doi: 10.3389/fgene.2021.664786. eCollection 2021. Front Genet. 2021. PMID: 34512712 Free PMC article.
Comparative genomics of 274 Vibrio cholerae genomes reveals mobile functions structuring three niche dimensions.
Dutilh BE, Thompson CC, Vicente AC, Marin MA, Lee C, Silva GG, Schmieder R, Andrade BG, Chimetto L, Cuevas D, Garza DR, Okeke IN, Aboderin AO, Spangler J, Ross T, Dinsdale EA, Thompson FL, Harkins TT, Edwards RA. Dutilh BE, et al. BMC Genomics. 2014 Aug 5;15(1):654. doi: 10.1186/1471-2164-15-654. BMC Genomics. 2014. PMID: 25096633 Free PMC article.
Transferring knowledge of bacterial protein interaction networks to predict pathogen targeted human genes and immune signaling pathways: a case study on M. tuberculosis.
Mei S, Flemington EK, Zhang K. Mei S, et al. BMC Genomics. 2018 Jun 28;19(1):505. doi: 10.1186/s12864-018-4873-9. BMC Genomics. 2018. PMID: 29954330 Free PMC article.

References

1. Sears CL. A dynamic partnership: celebrating our gut flora. Anaerobe. 2005;11(5):247–251. doi: 10.1016/j.anaerobe.2005.05.001. - DOI - PubMed
1. Organisation WH. Multidrug and extensively drug-resistant TB (M/XDR-TB): 2010 global report on surveillance and response. 2010.
1. Koch R, Brock TD, Fred EB. The etiology of tuberculosis. Rev Infect Dis. 1982;4(6):1270–1274. doi: 10.1093/clinids/4.6.1270. - DOI - PubMed
1. Walzl G, Ronacher K, Hanekom W, Scriba TJ, Zumla A. Immunological biomarkers of tuberculosis. Nat Rev Immunol. 2011;11(5):343–354. doi: 10.1038/nri2960. - DOI - PubMed
1. Rachman H, Kaufmann SHE. Exploring functional genomics for the development of novel intervention strategies against tuberculosis. International Journal of Medical Microbiology. 2007;297(7-8):559–567. doi: 10.1016/j.ijmm.2007.03.003. - DOI - PubMed

Grants and funding

R01 MH094564/MH/NIMH NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction

Affiliations

Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources