. 2009 Oct 28:3:141-54.

doi: 10.4137/bbi.s3030.

Classifying coding DNA with nucleotide statistics

Nicolas Carels¹, Diego Frías

Affiliations

PMID: 20140062
PMCID: PMC2808172
DOI: 10.4137/bbi.s3030

Classifying coding DNA with nucleotide statistics

Nicolas Carels et al. Bioinform Biol Insights. 2009.

. 2009 Oct 28:3:141-54.

doi: 10.4137/bbi.s3030.

Authors

Nicolas Carels¹, Diego Frías

Affiliation

¹ Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

PMID: 20140062
PMCID: PMC2808172
DOI: 10.4137/bbi.s3030

Abstract

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.

Keywords: ancestral codon; coding features; genomics; open reading frame; purines bias; universal correlation.

PubMed Disclaimer

Figures

**Figure 1.**
Plot of F-score for CDS/intron classification by CSF (black symbols) and UFM (white symbols) in *H. sapiens* (Hs), *D. melanogaster* (Dm) and *A. thaliana* (At).

**Figure 2.**
Classification of coding sequences (bold line) and introns (thin line) of *Homo sapiens* (A,D,G), *Drosophila melanogaster* (B,E,H) and *Arabidopsis thaliana* (C,F,I) at 300 (A,B,C), 400 (D,E,F) and 500 bp (G,H,I) by CSF. The dashed lines (CSF = 75) indicate the classification threshold (τ*_CSF*). The sample size was 500 in both the introns and coding sequences.

**Figure 3.**
Classification of coding sequences (solid line) and introns (dashed line) of *Homo sapiens* (A,D,G,J), *Drosophila melanogaster* (B,E,H,K) and *Arabidopsis thaliana* (C,F,I,L) at 150 (A,B,C), 200 (D,E,F), 250 (G,H,I) and 300 bp (J,K,L) by UFM. The number on the upper left of each panel indicates the proportion of introns (%) that do not have any ORF with the purine bias of coding sequences for the size threshold considered. The sample size was 1000 in both the introns and coding sequences.

**Figure 4.**
Plots of GC3 vs. GC2 in true positives (CDS, panels: A,C,E) and false positives (introns, panels: B,D,F) ORFs ≥ 250 bp after classification by UFM without filters. The sequence samples of *H. sapiens* (A,B), *D. melanogaster* (C,D) and *A. thaliana* (E,F) are the same as those used for Figure 3 and Table 3. The gray areas match ORFs corresponding to GC2 levels larger than the quantity (GC3 + 120)/3 when GC > 60% that are filtered out by filter 1. The gray line that matches y = 7.14 × 241.5 is for the *universal correlation*. The black line y = 3 × 120 matches the left border of the gray zone.

**Figure 5,**
Compositional properties of CDSs (bold) and introns (dashed) in *H. sapiens* (A,D,G), *D. melanogaster* (B,E,H) and *A. thaliana* (C,F,I). Panels A,B,C show the relative amount of sequences (%) from Figure 3 and Table 3 classified by GC level (%). Panels D,E,F show the distribution of false positives (intronic ORFs classified as coding) resulting from ORF (≥250 bp) classification by filters 1 + 3. Panels G,H,I show the distribution of false positives (intronic ORFs classified as coding) resulting from ORF (≥250 bp) classification by filters 2 + 3. The numbers on the panels’ upper left indicate the proportion of intron sequences (%) that did not have any ORF with the purine bias of coding sequences for the size threshold considered.

**Figure 6.**
Relationship of GC2 (gray, y axis) and GC3 (black, y axis) vs. GC (x axis) in human CDSs (>600 bp). The solid line (GC3 = 1.5*GC-27) indicates the threshold of false positive filtering. This threshold has the same rate of false positive and true positive filtering as the threshold GC3 = 3*GC2–120 (Fig. 4). False positives of coding ORFs would stand on the diagonal of this plot (GC2≈GC3).

See this image and copyright information in PMC

Cited by

The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins.
Ponce de Leon M, de Miranda AB, Alvarez-Valin F, Carels N. Ponce de Leon M, et al. Bioinform Biol Insights. 2014 May 20;8:93-108. doi: 10.4137/BBI.S13161. eCollection 2014. Bioinform Biol Insights. 2014. PMID: 24899802 Free PMC article.
Common and phylogenetically widespread coding for peptides by bacterial small RNAs.
Friedman RC, Kalkhof S, Doppelt-Azeroual O, Mueller SA, Chovancová M, von Bergen M, Schwikowski B. Friedman RC, et al. BMC Genomics. 2017 Jul 21;18(1):553. doi: 10.1186/s12864-017-3932-y. BMC Genomics. 2017. PMID: 28732463 Free PMC article.
A Metagenomic Analysis of Bacterial Microbiota in the Digestive Tract of Triatomines.
Carels N, Gumiel M, da Mota FF, de Carvalho Moreira CJ, Azambuja P. Carels N, et al. Bioinform Biol Insights. 2017 Sep 27;11:1177932217733422. doi: 10.1177/1177932217733422. eCollection 2017. Bioinform Biol Insights. 2017. PMID: 28989277 Free PMC article.
An Interpretation of the Ancestral Codon from Miller's Amino Acids and Nucleotide Correlations in Modern Coding Sequences.
Carels N, Ponce de Leon M. Carels N, et al. Bioinform Biol Insights. 2015 Apr 15;9:37-47. doi: 10.4137/BBI.S24021. eCollection 2015. Bioinform Biol Insights. 2015. PMID: 25922573 Free PMC article.
A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences.
Carels N, Frías D. Carels N, et al. Bioinform Biol Insights. 2013;7:35-54. doi: 10.4137/BBI.S10053. Epub 2013 Jan 23. Bioinform Biol Insights. 2013. PMID: 23400232 Free PMC article.

References

1. Hebsgaard SM, Korning PG, Tolstrup N, et al. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996;24:3439–52. - PMC - PubMed
1. Fickett JW. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982;10:5303–18. - PMC - PubMed
1. Staden R, McLachlan AD. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982;10:141–56. - PMC - PubMed
1. White O, Dunning T, Sutton G, et al. A quality control algorithm for DNA sequencing projects. Nucleic Acids Res. 1993;21:3829–38. - PMC - PubMed
1. Sharp PM, Cowe E, Higgins DG, et al. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizo-saccharomyces pombe, Drosophila melanogaster and Homo sapiens: a review of the considerable within-species diversity. Nucleic Acids Res. 1988;16:8207–11. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Classifying coding DNA with nucleotide statistics

Affiliation

Classifying coding DNA with nucleotide statistics

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases