HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

Yuan Zhang¹, Yanni Sun

Affiliations

PMID: 21609463
PMCID: PMC3115854
DOI: 10.1186/1471-2105-12-198

HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

Yuan Zhang et al. BMC Bioinformatics. 2011.

. 2011 May 24:12:198.

doi: 10.1186/1471-2105-12-198.

Authors

Yuan Zhang¹, Yanni Sun

Affiliation

¹ Computer Science and Engineering Department, Michigan State University, East Lansing, USA.

PMID: 21609463
PMCID: PMC3115854
DOI: 10.1186/1471-2105-12-198

Abstract

Background: Protein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors.

Results: We introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families.

Conclusions: HMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at http://www.cse.msu.edu/~zhangy72/hmmframe/ and at https://sourceforge.net/projects/hmm-frame/.

PubMed Disclaimer

Figures

**Figure 1**
**Frameshifts cause short alignments with marginal scores**. *X_i*is the *ith* base of a DNA sequence. Every codon is underscored. is the *jth* amino acid of a peptide sequence derived under reading frame i. The correct peptide sequence can be derived from the error-free sequence (shown on the top of the figure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y), the correct peptide sequence is the concatenation of three short peptide sequences derived using different reading frames. Thus, each peptide sequence derived using one reading frame can only generate short alignments with insignificant scores.

formula image — **Figure 1**
**Frameshifts cause short alignments with marginal scores**. *X_i*is the *ith* base of a DNA sequence. Every codon is underscored. is the *jth* amino acid of a peptide sequence derived under reading frame i. The correct peptide sequence can be derived from the error-free sequence (shown on the top of the figure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y), the correct peptide sequence is the concatenation of three short peptide sequences derived using different reading frames. Thus, each peptide sequence derived using one reading frame can only generate short alignments with insignificant scores.

**Figure 2**
**Change of HMMER alignments' scores, lengths, and E-values (in log space) before and after error correction for nifH sequences**. HMMER 3.0 alignments of sequences before and after error correction by HMM-FRAME. The changes of alignments are presented for 256 sequences in which HMM-FRAME detects errors. "Original" refers to HMMER 3.0 alignments on sequences before error correction. "Corrected" refers to HMMER 3.0 alignments on sequences after error correction by HMM-FRAME. As a comparison, we also plot the length of the original sequence reads (with the legend "sequence read"). They largely overlap with the length of corrected alignments, indicating that complete sequence reads can be aligned with the nifH profile HMM after error correction.

**Figure 3**
**Change of HMMER alignments' lengths, scores, and E-values (in log space) before and after error correction for the bacterial aromatic dioxygenase genes in a soil sample**. The data set are sequenced from bacterial aromatic dioxygenase genes in a soil sample. All alignments are generated by HMMER 3.0 for a fair comparison. "Original" refers to HMMER 3.0 alignments on sequences before error correction. "Corrected" refers to HMMER 3.0 alignments on sequences after error correction by HMM-FRAME.

**Figure 4**
**Protein domain classification results for the black sample in the deep mine data set**. Sequence sets that can be classified by HMM-FRAME, HMMER, and FragGeneScan+HMMER are represented by three sets A, B, and C. *|A|* = 17,496. *|B|* = 13,544. *|C|* = 12,328. B-C = 2224. C-B = 1008. C-A = 4. A-(B+C) = 2948.

See this image and copyright information in PMC

References

1. Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14(10):846–856. doi: 10.1093/bioinformatics/14.10.846. - DOI - PubMed
1. Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E, Studholme D, Yeats C, Eddy S. The Pfam protein families database. Nucleic Acids Res. 2004. pp. D138–D141. - PMC - PubMed
1. HMMER3: a new generation of sequence homology search software. http://hmmer.janelia.org/
1. Brown N, Sander C, Bork P. Frame: detection of genomic sequencing errors. Bioinformatics. 1998;14(4):367–71. doi: 10.1093/bioinformatics/14.4.367. - DOI - PubMed
1. Guan X, Uberbacher E. Alignments of DNA and protein sequences containing frameshift errors. Comput Appl Biosci. 1996;12:31–40. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

Affiliation

HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources