HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

doi:10.1186/1471-2105-8-104

. 2007 Mar 27:8:104.

doi: 10.1186/1471-2105-8-104.

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

Prashant K Srivastava¹, Dhwani K Desai, Soumyadeep Nandi, Andrew M Lynn

Affiliations

PMID: 17389042
PMCID: PMC1852395
DOI: 10.1186/1471-2105-8-104

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

Prashant K Srivastava et al. BMC Bioinformatics. 2007.

. 2007 Mar 27:8:104.

doi: 10.1186/1471-2105-8-104.

Authors

Prashant K Srivastava¹, Dhwani K Desai, Soumyadeep Nandi, Andrew M Lynn

Affiliation

¹ School of Information Technology, Jawaharlal Nehru University, New Delhi, India. prashant.k.srivastava@gmail.com <prashant.k.srivastava@gmail.com>

PMID: 17389042
PMCID: PMC1852395
DOI: 10.1186/1471-2105-8-104

Abstract

Background: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model.

Results: The protocol, called HMM-ModE, was validated on a set of sequences belonging to six sub-families of the AGC family of kinases. These sequences have an average sequence similarity of 63% among the group though each sub-group has a different substrate specificity. The optimisation of discrimination threshold, by using negative sequences scored against the model improves specificity in test cases from an average of 21% to 98%. Further discrimination by the HMM after modifying model probabilities using negative training sequences is provided in a few cases, the average specificity rising to 99%. Similar improvements were obtained with a sample of G-Protein coupled receptors sub-classified with respect to their substrate specificity, though the average sequence identity across the sub-families is just 20.6%. The protocol is applied in a high-throughput classification exercise on protein kinases.

Conclusion: The protocol has the potential to maximise the contributions of discriminating residues to classify proteins based on their molecular function, using pre-classified positive and negative sequence training data. The high specificity of the method, and increasing availability of pre-classified sequence data holds the potential for its application in sequence annotation.

PubMed Disclaimer

Figures

**Figure 1**
**A multiple alignment showing the common fold specific signals, along with the group specific sub-family function specific signals**. Alscript [36] figure showing a portion of the alignment of representatives of six protein kinase families discussed in the text. The alignment is coloured based on residue conservation: Red and pink – identical and conserved across all families – correspond to fold signals, and blue and green – identical and conserved within a family. Positions predicted to confer specificity for the family [35] are highlighted in yellow. Deleted regions are indicated by dashes (- - -). Numbers below the alignment correspond to the PDB structure 2f7z.

**Figure 2**
**A Receiver-Operator Characteristic curve (ROC) of HMM-d and HMM-ModE for the PVPK sub-family**. HMM-ModE – blue; HMM-d – red;

**Figure 3**
**Determination of optimal discrimination threshold**. The average MCC(bold black) distribution is overlayed on the sensitivity and specificity plots for each of 10-fold cross validation samples of the PVPK sub-family. Figures are plotted for the default profile HMM-d (top, A), HMM-ModE (center, B) and HMM-Sub(bottom, C).

**Figure 4**
Six subfamilies of the AGC family of protein kinases

**Figure 5**
**An outline of some Level 1 and Level 2 subfamilies of the GPCR Class A proteins**. The level-2 sub-families used in this study are marked in bold.

**Figure 6**
**An outline of some Level 1 and Level 2 subfamilies of the GPCR Class C proteins**. The level-2 sub-families used in this study are marked in bold.

**Figure 7**
**An outline of the S/T-Y kinase/atypical kinase/lipid kinase/ATP-grasp Fold Group as categorized in [23]**. The EC numbers for which training sequences were available in the ENZYME database are marked in bold.

See this image and copyright information in PMC

Cited by

Rational mutational analysis of a multidrug MFS transporter CaMdr1p of Candida albicans by employing a membrane environment based computational approach.
Kapoor K, Rehan M, Kaushiki A, Pasrija R, Lynn AM, Prasad R. Kapoor K, et al. PLoS Comput Biol. 2009 Dec;5(12):e1000624. doi: 10.1371/journal.pcbi.1000624. Epub 2009 Dec 24. PLoS Comput Biol. 2009. PMID: 20041202 Free PMC article.
Genomics-driven discovery of a biosynthetic gene cluster required for the synthesis of BII-Rafflesfungin from the fungus Phoma sp. F3723.
Sinha S, Nge CE, Leong CY, Ng V, Crasta S, Alfatah M, Goh F, Low KN, Zhang H, Arumugam P, Lezhava A, Chen SL, Kanagasundaram Y, Ng SB, Eisenhaber F, Eisenhaber B. Sinha S, et al. BMC Genomics. 2019 May 14;20(1):374. doi: 10.1186/s12864-019-5762-6. BMC Genomics. 2019. PMID: 31088369 Free PMC article.
Employing information theoretic measures and mutagenesis to identify residues critical for drug-proton antiport function in Mdr1p of Candida albicans.
Kapoor K, Rehan M, Lynn AM, Prasad R. Kapoor K, et al. PLoS One. 2010 Jun 10;5(6):e11041. doi: 10.1371/journal.pone.0011041. PLoS One. 2010. PMID: 20548793 Free PMC article.
ModEnzA: Accurate Identification of Metabolic Enzymes Using Function Specific Profile HMMs with Optimised Discrimination Threshold and Modified Emission Probabilities.
Desai DK, Nandi S, Srivastava PK, Lynn AM. Desai DK, et al. Adv Bioinformatics. 2011;2011:743782. doi: 10.1155/2011/743782. Epub 2011 Mar 29. Adv Bioinformatics. 2011. PMID: 21541071 Free PMC article.
Evolutionary history of calcium-sensing receptors unveils hyper/hypocalcemia-causing mutations.
Bircan A, Kuru N, Dereli O, Selçuk B, Adebali O. Bircan A, et al. PLoS Comput Biol. 2024 Nov 12;20(11):e1012591. doi: 10.1371/journal.pcbi.1012591. eCollection 2024 Nov. PLoS Comput Biol. 2024. PMID: 39531485 Free PMC article.

See all "Cited by" articles

References

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
1. Pearson WR, Lipman DJ. Improved Tools for Biological Sequence Comparison. Proc Natl Acad Sci U S A. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
1. Eddy SR. HMMER: Profile hidden Markov models for biological sequence analysis. 1998. http://hmmer.janelia.org/
1. Wollenberg KR, Atchley WR. Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci U S A. 2000;97:3288–3291. doi: 10.1073/pnas.070154797. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

[1] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed

[2] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed

[3] Pearson WR, Lipman DJ. Improved Tools for Biological Sequence Comparison. Proc Natl Acad Sci U S A. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed

[4] Pearson WR, Lipman DJ. Improved Tools for Biological Sequence Comparison. Proc Natl Acad Sci U S A. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed

[5] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed

[6] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed

[7] Eddy SR. HMMER: Profile hidden Markov models for biological sequence analysis. 1998. http://hmmer.janelia.org/

[8] Eddy SR. HMMER: Profile hidden Markov models for biological sequence analysis. 1998. http://hmmer.janelia.org/

[9] Wollenberg KR, Atchley WR. Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci U S A. 2000;97:3288–3291. doi: 10.1073/pnas.070154797. - DOI - PMC - PubMed

[10] Wollenberg KR, Atchley WR. Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci U S A. 2000;97:3288–3291. doi: 10.1073/pnas.070154797. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

Affiliation

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources