Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Dec 15;31(24):7271-9.
doi: 10.1093/nar/gkg905.

Gene structure prediction in syntenic DNA segments

Affiliations

Gene structure prediction in syntenic DNA segments

Jonathan E Moore et al. Nucleic Acids Res. .

Abstract

The accurate prediction of higher eukaryotic gene structures and regulatory elements directly from genomic sequences is an important early step in the understanding of newly assembled contigs and finished genomes. As more new genomes are sequenced, comparative approaches are becoming increasingly practical and valuable for predicting genes and regulatory elements. We demonstrate the effectiveness of a comparative method called pattern filtering; it utilizes synteny between two or more genomic segments for the annotation of genomic sequences. Pattern filtering optimally detects the signatures of conserved functional elements despite the stochastic noise inherent in evolutionary processes, allowing more accurate annotation of gene models. We anticipate that pattern filtering will facilitate sequence annotation and the discovery of new functional elements by the genetics and genomics communities.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(a) The simple distance function of the 1-D map. Each alignment position is given a value of 0 or 1 depending on whether the nucleotides are matched or mismatched, respectively. (b) The first step in construction of the 5-D map. For each position of the alignment, a joint probability matrix is constructed. These matrices are ordered by corresponding alignment position. Alignment positions gapped in the reference sequence are omitted from the analysis. Positions gapped in the first sequence are omitted only in multiples of three in order to conserve the potential coding frames; the joint probability matrices of any remaining gapped positions are filled with 0s. (c) To construct the 5-D map, each of the dimensions that are four long are rearranged to create two dimensions that are two long.
Figure 2
Figure 2
(a) The one-sided power spectrum resulting from the 1-D map of the alignment between the human and mouse CD4 regions. Note the non-zero floor of the trace stemming from the noise in the data, and the two signal peaks near frequencies of 0/(bp) and 1/(3 bp) corresponding to alternating long conserved and unconserved elements and to the codon triplets of coding regions, respectively. The peak at 1/(3 bp) and the region around it are magnified in the inset. (b) The power spectrum from the 5-D map of the same alignment. The left and right ends of each of the 16 1-D segments correspond to frequencies of 0/(bp) and 1/(2 bp), respectively. The arrows show the frequency 1/(3 bp). Each gray line indicates a spectral density of 0 for the four traces immediately above it. The abbreviations are as follows: R = (A or G), Y = (C or T), S = (G or C), W = (A or T), K = (A or C), M = (G or T). Subscripts indicate the first (human) or second (mouse) sequence. The plot can be divided into four conceptual regions: the trace in the bottom left corner, the remaining traces in the left column, the remaining traces in the bottom row, and the other nine traces. The bottom left corner trace tells us only about the distribution of gaps in the alignment and nothing about the sequences’ compositions or comparative relationship. Unsurprisingly, there is little high-frequency information in this trace, indicating that most gaps are relatively long. The remaining traces in the left column tell us only about the composition of the mouse sequence, and nothing about the human sequence or their comparative relationship; if the mouse sequence were aligned to any sequence, these three traces would be the same. The all1·RY2 trace describes how the mouse purines and pyrimidines are distributed relative to random. The large low-frequency peak indicates there are long relatively purine-rich regions and long relatively pyrimidine-rich regions. The peak at 1/(2 bp) shows the tendency of a purine to be followed by a pyrimidine, and vice versa. Purine–pyrimidine patterns of length three generate the triplet peak. Finally, note the general upward slope of the remainder, showing that once large-scale purine–pyrimidine composition effects are taken into account, a DNA segment tends to be more mixed than one would expect at random. The all1·SW2 trace describes how Gs and Cs are distributed relative to As and Ts. Note that this trace has the same peaks as the all1·RY2 trace, but now the remainder slopes downward, indicating that once the effects from the peaks are accounted for, Gs and Cs tend to be more clustered than one would expect at random. The all1·KM2 trace describes how the remaining pair of pairs, AC and GT, are distributed. The remaining traces in the bottom row are identical to those in the left column except that these describe the composition of the human sequence. The other nine traces tell us how the sequences relate to one another. For example, the RY1·RY2 trace tells us about the distribution of purine–pyrimidine conservation. It has a low-frequency peak indicating that there are long regions where purines and pyrimidines are more conserved and long regions where they are less conserved. Purine–pyrimidine conservation patterns of length three, which come largely from the coding regions, create the triplet peak. Finally, the very flat remainder indicates that all other perceived purine–pyrimidine conservation patterns stem from randomness or are a very small effect. One can interpret the other eight traces in a similar fashion.
Figure 2
Figure 2
(a) The one-sided power spectrum resulting from the 1-D map of the alignment between the human and mouse CD4 regions. Note the non-zero floor of the trace stemming from the noise in the data, and the two signal peaks near frequencies of 0/(bp) and 1/(3 bp) corresponding to alternating long conserved and unconserved elements and to the codon triplets of coding regions, respectively. The peak at 1/(3 bp) and the region around it are magnified in the inset. (b) The power spectrum from the 5-D map of the same alignment. The left and right ends of each of the 16 1-D segments correspond to frequencies of 0/(bp) and 1/(2 bp), respectively. The arrows show the frequency 1/(3 bp). Each gray line indicates a spectral density of 0 for the four traces immediately above it. The abbreviations are as follows: R = (A or G), Y = (C or T), S = (G or C), W = (A or T), K = (A or C), M = (G or T). Subscripts indicate the first (human) or second (mouse) sequence. The plot can be divided into four conceptual regions: the trace in the bottom left corner, the remaining traces in the left column, the remaining traces in the bottom row, and the other nine traces. The bottom left corner trace tells us only about the distribution of gaps in the alignment and nothing about the sequences’ compositions or comparative relationship. Unsurprisingly, there is little high-frequency information in this trace, indicating that most gaps are relatively long. The remaining traces in the left column tell us only about the composition of the mouse sequence, and nothing about the human sequence or their comparative relationship; if the mouse sequence were aligned to any sequence, these three traces would be the same. The all1·RY2 trace describes how the mouse purines and pyrimidines are distributed relative to random. The large low-frequency peak indicates there are long relatively purine-rich regions and long relatively pyrimidine-rich regions. The peak at 1/(2 bp) shows the tendency of a purine to be followed by a pyrimidine, and vice versa. Purine–pyrimidine patterns of length three generate the triplet peak. Finally, note the general upward slope of the remainder, showing that once large-scale purine–pyrimidine composition effects are taken into account, a DNA segment tends to be more mixed than one would expect at random. The all1·SW2 trace describes how Gs and Cs are distributed relative to As and Ts. Note that this trace has the same peaks as the all1·RY2 trace, but now the remainder slopes downward, indicating that once the effects from the peaks are accounted for, Gs and Cs tend to be more clustered than one would expect at random. The all1·KM2 trace describes how the remaining pair of pairs, AC and GT, are distributed. The remaining traces in the bottom row are identical to those in the left column except that these describe the composition of the human sequence. The other nine traces tell us how the sequences relate to one another. For example, the RY1·RY2 trace tells us about the distribution of purine–pyrimidine conservation. It has a low-frequency peak indicating that there are long regions where purines and pyrimidines are more conserved and long regions where they are less conserved. Purine–pyrimidine conservation patterns of length three, which come largely from the coding regions, create the triplet peak. Finally, the very flat remainder indicates that all other perceived purine–pyrimidine conservation patterns stem from randomness or are a very small effect. One can interpret the other eight traces in a similar fashion.
Figure 3
Figure 3
Accuracy measures comparing pattern filtering with GENSCAN for each of the sequences. Sn/nuc, sensitivity per nucleotide; Sp/nuc, specificity per nucleotide; Sn/exon, sensitivity per exon; Sp/exon, specificity per exon. For explanations of these measures, see Table 1. Note the better performance in all statistics for pattern filtering, in particular that the sensitivities and specificities per nucleotide are very close to 1.0 and also that the fraction of wrong exons is 0.
Figure 4
Figure 4
(a) A 1 kb region with a confirmed sequencing error, as it would be seen in the GeneGrabber viewer. The horizontal axis represents the position in the mouse sequence (19). The vertical axis of the graph represents the relative filtered distance averaged across a three-nucleotide window (black trace), the relative difference between the filtered distances and these averages (the trace that alternates red, green and blue), and the filtered hexanucleotide bias (teal trace). Note that the black trace results primarily from the low-frequency peaks, while the multicolored trace stems primarily from the triplet peaks. Below the plot is a symbolic diagram of the two preferred peptide translations, where stop codons are indicated by breaks in the continuity; the preferred frames are determined by assuming that the fastest evolving position of a putative codon is the third. Below this diagram is another indicating potential slice sites (triangles), and start and stop codons (Ts) with the ones above the gray line in the human sequence and the ones below in the mouse. The left–right mirror symmetry of the symbols is designed so that sites that could delimit a coding region will point toward one another, e.g. the putative 3′ splice sites in the forward direction point right, and their putative 5′ counterparts point left. In order to simplify the plot, sites in the reverse direction are not shown. (b) A view of the same 1 kb region, with the sequencing error corrected. Note how the crossing that originally occurred at approximately sequence postion 39 590 has now disappeared, making that region resemble a typical, though long, exon. (c) The same 1 kb region plus an adjacent 3 kb. The additional 3 kb is shown to provide examples of what typical exons look like. The bars at the top of the plot indicate the coding regions as annotated by the following techniques: purple, cDNA by Ansari-Lari et al. (19); green, pattern filtering and GeneGrabber; red, GENSCAN using the human sequence; and blue, GENSCAN using the mouse sequence. All annotations of this gene continue to the right; only the GENSCAN annotation using the mouse sequence continues to the left, including a segment not in the cDNA used for annotation by Ansari-Lari et al. (19).
Figure 5
Figure 5
Views of a 1 kb region containing exons 9 and 10 of the gene PTPN6. The axes and plots are as in Figure 4. The left plot shows the filtered distances resulting from the human and mouse sequences using the 5-D map; the right plot shows the same distances but from the human, rat and mouse sequences using the 7-D map. Note how the characteristic splitting pattern is substandard for the first exon in the left plot. As shown in the right plot, the addition of the third sequence rectifies this situation.

Similar articles

References

    1. Stormo G. (2000) Gene-finding approaches for eukayotes. Genome Res., 10, 394–397. - PubMed
    1. Claverie J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6, 1735–1744. - PubMed
    1. Fickett J.W. (1996) Finding genes by computer: the state of the art. Trends Genet., 12, 316–320. - PubMed
    1. Haussler D. (1998) Computational genefinding. Trends Guide Bioinformatics (Suppl.), 12–15.
    1. Burge C.B. and Karlin,S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol., 8, 346–354. - PubMed

Publication types