Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 19;21(1):466.
doi: 10.1186/s12859-020-03794-x.

Implementation of homology based and non-homology based computational methods for the identification and annotation of orphan enzymes: using Mycobacterium tuberculosis H37Rv as a case study

Affiliations

Implementation of homology based and non-homology based computational methods for the identification and annotation of orphan enzymes: using Mycobacterium tuberculosis H37Rv as a case study

Swati Sinha et al. BMC Bioinformatics. .

Abstract

Background: Homology based methods are one of the most important and widely used approaches for functional annotation of high-throughput microbial genome data. A major limitation of these methods is the absence of well-characterized sequences for certain functions. The non-homology methods based on the context and the interactions of a protein are very useful for identifying missing metabolic activities and functional annotation in the absence of significant sequence similarity. In the current work, we employ both homology and context-based methods, incrementally, to identify local holes and chokepoints, whose presence in the Mycobacterium tuberculosis genome is indicated based on its interaction with known proteins in a metabolic network context, but have not been annotated. We have developed two computational procedures using network theory to identify orphan enzymes ('Hole finding protocol') coupled with the identification of candidate proteins for the predicted orphan enzyme ('Hole filling protocol'). We propose an integrated interaction score based on scores from the STRING database to identify candidate protein sequences for the orphan enzymes from M. tuberculosis, as a case study, which are most likely to perform the missing function.

Results: The application of an automated homology-based enzyme identification protocol, ModEnzA, on M. tuberculosis genome yielded 56 novel enzyme predictions. We further predicted 74 putative local holes, 6 choke points, and 3 high confidence local holes in the genome using 'Hole finding protocol'. The 'Hole-filling protocol' was validated on the E. coli genome using artificial in-silico enzyme knockouts where our method showed 25% increased accuracy, compared to other methods, in assigning the correct sequence for the knocked-out enzyme amongst the top 10 ranks. The method was further validated on 8 additional genomes.

Conclusions: We have developed methods that can be generalized to augment homology-based annotation to identify missing enzyme coding genes and to predict a candidate protein for them. For pathogens such as M. tuberculosis, this work holds significance in terms of increasing the protein repertoire and thereby, the potential for identifying novel drug targets.

Keywords: Chokepoints; Genome context-based annotation; Global hole; Homology based method; Local hole; Missing enzyme; ModEnzA; Non-homology based methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Stepwise systematic approach for the implementation of homology-based and non-homology based computational methods
Fig. 2
Fig. 2
Schematic representation of the workflow to map novel enzymes in M. tb using ModEnzA profiles. The ModEnzA enzymes profiles were built with the 31 January 2018 release of the ENZYME database. Both Uniprot-KB/Swiss-Prot and UniProtKB/TrEMBL were used as the sequence search space to scan for novel M. tb enzymes
Fig. 3
Fig. 3
Schematic representation of the ‘Hole Finding Protocol’ to identify local holes and chokepoints in an organism. The figure shows a flowchart of the workflow for the identification of 'local holes' and ‘chokepoints’ in an organism using an enzyme–enzyme dependency graph of all known metabolic reactions. ModEnzA [14] is a profile HMM-based method used to scan the proteome of a given organism for the accurate classification of its enzymes
Fig. 4
Fig. 4
Mapping of known and predicted enzymes in M. tb on to KEGG pathways. ac The mappings of some of these enzymes on the Porphyrin and Chlorophyll metabolism, Drug metabolism—other enzymes, and Glycolysis/Gluconeogenesis KEGG pathways respectively. The enzymes already annotated in M. tb are shown in Red, the enzymes predicted by the homology-based method ModEnzA are shown in Blue, the local holes in Green, the high-confidence local holes in Brown while the choke points are depicted in Yellow
Fig. 5
Fig. 5
Schematic representation of a metabolic hole and candidate protein set. The figure shows a an unknown protein (?) surrounded by known neighbors n1–n6 and b a set of candidate proteins from the target organism which is its entire proteome except for the known neighboring proteins. For each neighbor, we find its interaction with all the candidate proteins. If protein P1 has an interaction score with n1, n2, and n4, then we combine these scores in a naive Bayes manner using Bayesian score integration as shown in the equation (see "Methods" ). All the candidate proteins with their respective scores are then sorted and the one with the highest score qualifies to perform the desired function
Fig. 6
Fig. 6
Comparison of the self-rank thresholds after in-silico enzyme knockouts. a The figure shows the performance of the ‘Hole Filling protocol’ on the E. coli genome (shown in the blue-colored curve) where the combined scores of functional associations from STRING were used to get the new functional association score. *Reference values for individual and combined association scores were digitized from Fig. 4 of Kharchenko et al. [28] for comparison. b Similar knockouts were performed for all the metabolic proteins from eight other genomes, Saccharomyces cerevisiae (sce), Dictyostelium discoideum (ddi), Arabidopsis thaliana (ath), Drosophila melanogaster (dme), Danio rerio (dre), Salmonella enterica (sen), Shigella flexneri (sfl) and Vibrio cholerae (vch)

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23:205–211. - PubMed
    1. Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. doi: 10.1093/bioinformatics/bti125. - DOI - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi: 10.1093/bioinformatics/14.9.755. - DOI - PubMed

MeSH terms

LinkOut - more resources