Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May 14:13:97.
doi: 10.1186/1471-2105-13-97.

Exploration of multivariate analysis in microbial coding sequence modeling

Affiliations

Exploration of multivariate analysis in microbial coding sequence modeling

Tahir Mehmood et al. BMC Bioinformatics. .

Abstract

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Visualization of undirected graph. The clusters of highly conserved ORFs are presented, based on a very small subset taken from Acinetobacter baumannii. Nodes represent genes, and identical color means genes from the same genome. The numbers are just identifiers within each genome. First we discard the clusters having less than 3 genes, i.e. red:6-yellow:7. Next, the medoide gene from each remaining cluster forms the set of Positives.
Figure 2
Figure 2
Performance on test data. The box and whisker plots show the distributions of performance (% correct classified) on test data for each species, by using IMM (upper panels) or CPPLS (lower panels) on ORFs represented as codon, protein or DNA sequences. The dotted red line indicates the maximum possible performance (100%). For most of the species, CPPLS on Codon sequence performance is 100 (%).
Figure 3
Figure 3
Sensitivity and specificity. The distributions of sensitivity and specificity for each species, by using IMM and CPPLS on codon sequences only. Sensitivity is defined as the ability to detect Positives and specificity as the ability to detect Negatives and both are presented in (%).
Figure 4
Figure 4
IMM and CPPLS scores. For Sulfolobus islandicus, the density of the IMM scores and CPPLS scores are plotted. For each test sequence IMM score is computed as the difference of Positive log-probability and Negative log-probability, and CPPLS scores are simply the fitted values.

Similar articles

Cited by

References

    1. Ahnert S, Fink T, Zinovyev A. How much non-coding DNA do eukaryotes require? J Theor Biol. 2008;252(4):587–592. doi: 10.1016/j.jtbi.2008.02.005. - DOI - PubMed
    1. Toh H, Weiss B, Perkin S, Yamashita A, Oshima K, Hattori M, Aksoy S. Massive genome erosion and functional adaptations provide insights into the symbiotic lifestyle of Sodalis glossinidius in the tsetse host. Genome Res. 2006;16(2):149–156. - PMC - PubMed
    1. Do J, Choi D. Computational approaches to gene prediction. J Microbiol Seoul. 2006;44(2):137. - PubMed
    1. Warren A, Archuleta J, Feng W, Setubal J. Missing genes in the annotation of prokaryotic genomes. BMC Bioinf. 2010;11:131. doi: 10.1186/1471-2105-11-131. - DOI - PMC - PubMed
    1. Angelova M, Kalajdziski S, Kocarev L. Computational Methods for Gene Finding in Prokaryotes. Web Proceedings, ISSN. 2010;1:11–20.

LinkOut - more resources