. 2003 Dec 15;31(24):7271-9.

doi: 10.1093/nar/gkg905.

Gene structure prediction in syntenic DNA segments

Jonathan E Moore¹, James A Lake

Affiliations

PMID: 14654703
PMCID: PMC291857
DOI: 10.1093/nar/gkg905

Gene structure prediction in syntenic DNA segments

Jonathan E Moore et al. Nucleic Acids Res. 2003.

. 2003 Dec 15;31(24):7271-9.

doi: 10.1093/nar/gkg905.

Authors

Jonathan E Moore¹, James A Lake

Affiliation

¹ Molecular Biology Institute, University of California Los Angeles, Los Angeles, CA 90095, USA.

PMID: 14654703
PMCID: PMC291857
DOI: 10.1093/nar/gkg905

Abstract

The accurate prediction of higher eukaryotic gene structures and regulatory elements directly from genomic sequences is an important early step in the understanding of newly assembled contigs and finished genomes. As more new genomes are sequenced, comparative approaches are becoming increasingly practical and valuable for predicting genes and regulatory elements. We demonstrate the effectiveness of a comparative method called pattern filtering; it utilizes synteny between two or more genomic segments for the annotation of genomic sequences. Pattern filtering optimally detects the signatures of conserved functional elements despite the stochastic noise inherent in evolutionary processes, allowing more accurate annotation of gene models. We anticipate that pattern filtering will facilitate sequence annotation and the discovery of new functional elements by the genetics and genomics communities.

PubMed Disclaimer

Figures

**Figure 1**
(a) The simple distance function of the 1-D map. Each alignment position is given a value of 0 or 1 depending on whether the nucleotides are matched or mismatched, respectively. (b) The first step in construction of the 5-D map. For each position of the alignment, a joint probability matrix is constructed. These matrices are ordered by corresponding alignment position. Alignment positions gapped in the reference sequence are omitted from the analysis. Positions gapped in the first sequence are omitted only in multiples of three in order to conserve the potential coding frames; the joint probability matrices of any remaining gapped positions are filled with 0s. (c) To construct the 5-D map, each of the dimensions that are four long are rearranged to create two dimensions that are two long.

**Figure 2**
(a) The one-sided power spectrum resulting from the 1-D map of the alignment between the human and mouse CD4 regions. Note the non-zero floor of the trace stemming from the noise in the data, and the two signal peaks near frequencies of 0/(bp) and 1/(3 bp) corresponding to alternating long conserved and unconserved elements and to the codon triplets of coding regions, respectively. The peak at 1/(3 bp) and the region around it are magnified in the inset. (b) The power spectrum from the 5-D map of the same alignment. The left and right ends of each of the 16 1-D segments correspond to frequencies of 0/(bp) and 1/(2 bp), respectively. The arrows show the frequency 1/(3 bp). Each gray line indicates a spectral density of 0 for the four traces immediately above it. The abbreviations are as follows: R = (A or G), Y = (C or T), S = (G or C), W = (A or T), K = (A or C), M = (G or T). Subscripts indicate the first (human) or second (mouse) sequence. The plot can be divided into four conceptual regions: the trace in the bottom left corner, the remaining traces in the left column, the remaining traces in the bottom row, and the other nine traces. The bottom left corner trace tells us only about the distribution of gaps in the alignment and nothing about the sequences’ compositions or comparative relationship. Unsurprisingly, there is little high-frequency information in this trace, indicating that most gaps are relatively long. The remaining traces in the left column tell us only about the composition of the mouse sequence, and nothing about the human sequence or their comparative relationship; if the mouse sequence were aligned to any sequence, these three traces would be the same. The all₁·RY₂ trace describes how the mouse purines and pyrimidines are distributed relative to random. The large low-frequency peak indicates there are long relatively purine-rich regions and long relatively pyrimidine-rich regions. The peak at 1/(2 bp) shows the tendency of a purine to be followed by a pyrimidine, and vice versa. Purine–pyrimidine patterns of length three generate the triplet peak. Finally, note the general upward slope of the remainder, showing that once large-scale purine–pyrimidine composition effects are taken into account, a DNA segment tends to be more mixed than one would expect at random. The all₁·SW₂ trace describes how Gs and Cs are distributed relative to As and Ts. Note that this trace has the same peaks as the all₁·RY₂ trace, but now the remainder slopes downward, indicating that once the effects from the peaks are accounted for, Gs and Cs tend to be more clustered than one would expect at random. The all₁·KM₂ trace describes how the remaining pair of pairs, AC and GT, are distributed. The remaining traces in the bottom row are identical to those in the left column except that these describe the composition of the human sequence. The other nine traces tell us how the sequences relate to one another. For example, the RY₁·RY₂ trace tells us about the distribution of purine–pyrimidine conservation. It has a low-frequency peak indicating that there are long regions where purines and pyrimidines are more conserved and long regions where they are less conserved. Purine–pyrimidine conservation patterns of length three, which come largely from the coding regions, create the triplet peak. Finally, the very flat remainder indicates that all other perceived purine–pyrimidine conservation patterns stem from randomness or are a very small effect. One can interpret the other eight traces in a similar fashion.

**Figure 3**
Accuracy measures comparing pattern filtering with GENSCAN for each of the sequences. Sn/nuc, sensitivity per nucleotide; Sp/nuc, specificity per nucleotide; Sn/exon, sensitivity per exon; Sp/exon, specificity per exon. For explanations of these measures, see Table 1. Note the better performance in all statistics for pattern filtering, in particular that the sensitivities and specificities per nucleotide are very close to 1.0 and also that the fraction of wrong exons is 0.

**Figure 4**
(a) A 1 kb region with a confirmed sequencing error, as it would be seen in the GeneGrabber viewer. The horizontal axis represents the position in the mouse sequence (19). The vertical axis of the graph represents the relative filtered distance averaged across a three-nucleotide window (black trace), the relative difference between the filtered distances and these averages (the trace that alternates red, green and blue), and the filtered hexanucleotide bias (teal trace). Note that the black trace results primarily from the low-frequency peaks, while the multicolored trace stems primarily from the triplet peaks. Below the plot is a symbolic diagram of the two preferred peptide translations, where stop codons are indicated by breaks in the continuity; the preferred frames are determined by assuming that the fastest evolving position of a putative codon is the third. Below this diagram is another indicating potential slice sites (triangles), and start and stop codons (Ts) with the ones above the gray line in the human sequence and the ones below in the mouse. The left–right mirror symmetry of the symbols is designed so that sites that could delimit a coding region will point toward one another, e.g. the putative 3′ splice sites in the forward direction point right, and their putative 5′ counterparts point left. In order to simplify the plot, sites in the reverse direction are not shown. (b) A view of the same 1 kb region, with the sequencing error corrected. Note how the crossing that originally occurred at approximately sequence postion 39 590 has now disappeared, making that region resemble a typical, though long, exon. (c) The same 1 kb region plus an adjacent 3 kb. The additional 3 kb is shown to provide examples of what typical exons look like. The bars at the top of the plot indicate the coding regions as annotated by the following techniques: purple, cDNA by Ansari-Lari *et al*. (19); green, pattern filtering and GeneGrabber; red, GENSCAN using the human sequence; and blue, GENSCAN using the mouse sequence. All annotations of this gene continue to the right; only the GENSCAN annotation using the mouse sequence continues to the left, including a segment not in the cDNA used for annotation by Ansari-Lari *et al*. (19).

**Figure 5**
Views of a 1 kb region containing exons 9 and 10 of the gene PTPN6. The axes and plots are as in Figure 4. The left plot shows the filtered distances resulting from the human and mouse sequences using the 5-D map; the right plot shows the same distances but from the human, rat and mouse sequences using the 7-D map. Note how the characteristic splitting pattern is substandard for the first exon in the left plot. As shown in the right plot, the addition of the third sequence rectifies this situation.

See this image and copyright information in PMC

References

1. Stormo G. (2000) Gene-finding approaches for eukayotes. Genome Res., 10, 394–397. - PubMed
1. Claverie J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6, 1735–1744. - PubMed
1. Fickett J.W. (1996) Finding genes by computer: the state of the art. Trends Genet., 12, 316–320. - PubMed
1. Haussler D. (1998) Computational genefinding. Trends Guide Bioinformatics (Suppl.), 12–15.
1. Burge C.B. and Karlin,S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol., 8, 346–354. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gene structure prediction in syntenic DNA segments

Affiliation

Gene structure prediction in syntenic DNA segments

Authors

Affiliation

Abstract

Figures

Similar articles

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Abstract

Figures

Similar articles

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Research Materials