Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Dec;13(12):2637-50.
doi: 10.1101/gr.1679003.

Sequence information for the splicing of human pre-mRNA identified by support vector machine classification

Affiliations
Comparative Study

Sequence information for the splicing of human pre-mRNA identified by support vector machine classification

Xiang H-F Zhang et al. Genome Res. 2003 Dec.

Abstract

Vertebrate pre-mRNA transcripts contain many sequences that resemble splice sites on the basis of agreement to the consensus,yet these more numerous false splice sites are usually completely ignored by the cellular splicing machinery. Even at the level of exon definition,pseudo exons defined by such false splices sites outnumber real exons by an order of magnitude. We used a support vector machine to discover sequence information that could be used to distinguish real exons from pseudo exons. This machine learning tool led to the definition of potential branch points,an extended polypyrimidine tract,and C-rich and TG-rich motifs in a region limited to 50 nt upstream of constitutively spliced exons. C-rich sequences were also found in a region extending to 80 nt downstream of exons,along with G-triplet motifs. In addition,it was shown that combinations of three bases within the splice donor consensus sequence were more effective than consensus values in distinguishing real from pseudo splice sites; two-way base combinations were optimal for distinguishing 3' splice sites. These data also suggest that interactions between two or more of these elements may contribute to exon recognition,and provide candidate sequences for assessment as intronic splicing enhancers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Three-way combinations of bases within the splice donor site. (A,B) False positive rate as a function of sensitivity in discriminating real and pseudo exon donor sites. Systematic variation of the threshold resulted in the different sensitivities. Classifying scores were from SVM (heavy black lines); multiple dependence decomposition (MDD, light black lines); consensus value (CV) calculated according to Shapiro and Senapathy (; heavy gray lines); and consensus values calculated by the log likelihood method (LLH, light gray lines). (A) The data set contained all of the real exons. Pseudo exons were defined as containing a simple GT as a potential donor site (no consensus value filter). (B) The data set contained all of the real exons. Pseudo exons were defined as having consensus values of at least 78. (C) Three-way combinations weighted most highly by SVM in distinguishing real from pseudo exons. The training set consisted of approximately 3400 real exons and 3200 pseudo exons, all of which exhibited donor site consensus values of at least 78. Positive and negative weights are listed separately, in descending weight order (absolute value). Asterisks denote agreements to the consensus. These 64 combinations allow SVM to perform at 92% of the accuracy achieved with the full set.
Figure 2
Figure 2
Distribution around exons of pentamer weights assigned by SVM. The top 256 pentamers were divided into four groups according to their origins (downstream or upstream) and signs of their weights (positive or negative). For each group, the SVM weights assigned to each pentamer were summed for pentamers that started in nonoverlapping windows of 5 nt on either side of 15,000 real exons, 24,000 pseudo exons, and 12,000 repeat-free pseudo exons. Values for upstream pentamers only are shown on the left and for downstream pentamers only on the right (i.e., values derived from exclusively upstream pentamers are not plotted on the downstream side and vice versa). Top: positively-weighted pentamers; bottom: negatively-weighted pentamers
Figure 3
Figure 3
Grouping and distribution of the top positively scoring flanking pentamers. A subset of 121 positively weighted pentamers that contributed most to the ability of SVM to distinguish real from pseudo exons were grouped according to their similar positional distributions of their prevalence around exons, as measured by a z-score (see text). Z-scores with an absolute value greater than 2 have a P-value of less than 0.05. Values were summed for pentamers starting in windows of 5 nt starting just upstream of the acceptor site (-15) and just downstream of the donor site (+7); an exception is panel H, in which upstream windows up to the exon (-1) are shown. Light gray lines represent individual pentamers listed on the right; the heavy dark line is the average. The red line shows the average for the distribution of these pentamers around pseudo exons. Pentamers in each flank were treated separately for extraction from SVM and for clustering. However, their prevalence is shown both upstream and downstream of the exons regardless of their origin.
Figure 4
Figure 4
Grouping and distribution of the top negatively scoring flanking pentamers. A subset of 140 pentamers that contributed most with a negative weight to the ability of SVM to distinguish real from pseudo exons were grouped according to their similar positional distributions of their prevalence around exons, as measured by a z-score (see text). Zscores with an absolute value greater than 2 have a P-value of less than 0.05. (A,C,D) Light gray lines represent individual pentamers listed to the right; the heavy dark line is the average. The red line shows the average for the distribution of these pentamers around pseudo exons; the blue line shows this average for repeat-free pseudo exons. Pentamers in each flank were treated separately for extraction from SVM and for clustering. However, their prevalence is shown both upstream and downstream of the exons regardless of their origin. (D) Distribution of the acceptor splice consensus sequence CAGG and related tetramers.

Similar articles

Cited by

References

    1. Bauren, G. and Wieslander, L. 1994. Splicing of Balbiani ring 1 gene pre-mRNA occurs simultaneously with transcription. Cell 76: 183-192. - PubMed
    1. Berget, S.M. 1995. Exon recognition in vertebrate splicing. J. Biol. Chem. 270: 2411-2414. - PubMed
    1. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. - PubMed
    1. Buvoli, M., Mayer, S.A., and Patton, J.G. 1997. Functional crosstalk between exon enhancers, polypyrimidine tracts, and branchpoint sequences. EMBO J. 16: 7174-7183. - PMC - PubMed
    1. Carlo, T., Sterner, D.A., and Berget, S.M. 1996. An intron splicing enhancer containing a G-rich repeat facilitates inclusion of a vertebrate micro-exon. RNA 2: 342-353. - PMC - PubMed

WEB SITE REFERENCES

    1. http://genes.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html; maximum entrophy modeling of short sequence motifs by G. Yeo and C.B. Burge. - PubMed
    1. http://gepas.bioinfo.cnio.es/cgi-bin/somtree; combining hierarchical clustering and self-organizing maps by J. Herrero and J. Dopazo. - PubMed
    1. www.cs.columbia.edu/compbio; Computational Biology Group at Columbia.

Publication types