Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2001 Sep 25;98(20):11193-8.
doi: 10.1073/pnas.201407298.

A computational analysis of sequence features involved in recognition of short introns

Affiliations

A computational analysis of sequence features involved in recognition of short introns

L P Lim et al. Proc Natl Acad Sci U S A. .

Abstract

Splicing of short introns by the nuclear pre-mRNA splicing machinery is thought to proceed via an "intron definition" mechanism, in which the 5' and 3' splice sites (5'ss, 3'ss, respectively) are initially recognized and paired across the intron. Here, we describe a computational analysis of sequence features involved in recognition of short introns by using available transcript data from five eukaryotes with complete or nearly complete genomic sequences. The information content of five different transcript features was measured by using methods from information theory, and Monte Carlo simulations were used to determine the amount of information required for accurate recognition of short introns in each organism. We conclude: (i) that short introns in Drosophila melanogaster and Caenorhabditis elegans contain essentially all of the information for their recognition by the splicing machinery, and computer programs that simulate splicing specificity can predict the exact boundaries of approximately 95% of short introns in both organisms; (ii) that in yeast, the 5'ss, branch signal, and 3'ss can accurately identify intron locations but do not precisely determine the location of 3' cleavage in every intron; and (iii) that the 5'ss, branch signal, and 3'ss are not sufficient to accurately identify short introns in plant and human transcripts, but that specific subsets of candidate intronic enhancer motifs can be identified in both human and Arabidopsis that contribute dramatically to the accuracy of splicing simulators.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Intron length distributions. Histograms of the lengths of introns from each organism are plotted, using a log scale for the abscissa. Each histogram was fitted as a mixture of two lognormal distributions by using the r statistical package (curved lines). The position of the point of intersection of these distributions is indicated. S. ce., S. cerevisiae; C. el., C. elegans; D. me., D. melanogaster; A. th., A. thaliana; H. sa., Homo sapiens.
Figure 2
Figure 2
Splice signal motifs. Sequence motifs for the 5′ss (A), branch site (B), and 3′ss (C) are displayed by using the pictogram program (http://genes.mit.edu/pictogram.html). The height of each letter is proportional to the frequency of the corresponding base at the given position, and bases are listed in descending order of frequency from top to bottom. The RelEnt (in bits) of the motif model used in our analyses (I1M or WMM) relative to the background transcript base composition is also shown. The splice junctions and branch point are marked by inverted triangles.
Figure 3
Figure 3
Monte Carlo estimation of information required for short intron recognition. EAc of prediction of short introns by pairscan in randomized transcripts is plotted versus the sum of the RelEnts of the splice signal motifs used. Dotted gray line indicates 98% EAc. Each curve is the best-fit from 130 simulations. Brackets indicate 1 SD above and below the best-fit curve for three chosen RelEnt values. Solid circles represent EAc for intronscan in real transcripts versus the sum of the RelEnts of the transcript features used.
Figure 4
Figure 4
Relative contributions of five transcript features to intron detection. The area of each wedge represents the relative contribution to intron detection accuracy of the corresponding transcript feature, calculated as described in Methods. The sizes of the wedges are scaled so that the complete circle represents the RelEnt per intron required to achieve 98% detection accuracy in each organism, derived from Fig. 3.
Figure 5
Figure 5
Contribution of subsets of pentamers to intron prediction. Exact prediction accuracies are shown for intronscan by using the 5′ss and 3′ss signals and specialized intron composition models that score particular subsets of pentamers (see the supporting information) as a function of the number of pentamers used. Circles represent accuracy calculated by using 0, 10, 20, 40, 60, and 100 pentamers, with pentamers chosen in order from high values of flog(f/g) to low, where f and g are the pentamer frequency in introns and exons, respectively, using a protocol that avoids choosing overlapping pentamers (see the supporting information). (A) Drosophila, (B) Arabidopsis, (C) human. (D). The first ten intron-biased pentamers chosen from each organism. The dashed black line represents average accuracy for 25 random orderings of pentamers. The solid gray line represents accuracy by using all 1,024 pentamers—dashed gray lines are described in text.

References

    1. Claverie J M. Genome Res. 2000;10:1277–1279. - PubMed
    1. International Human Genome Sequencing Consortium. Nature (London) 2001;409:860–921. - PubMed
    1. Berget S M. J Biol Chem. 1995;270:2411–2414. - PubMed
    1. Talerico M, Berget S M. Mol Cell Biol. 1994;14:3434–3445. - PMC - PubMed
    1. Gatermann K B, Hoffmann A, Rosenberg G H, Kaufer N F. Mol Cell Biol. 1989;9:1526–1535. - PMC - PubMed

Publication types

LinkOut - more resources