Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug;38(14):4740-54.
doi: 10.1093/nar/gkq197. Epub 2010 Apr 12.

Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome

Affiliations

Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome

I G Mollet et al. Nucleic Acids Res. 2010 Aug.

Abstract

Mining massive amounts of transcript data for alternative splicing information is paramount to help understand how the maturation of RNA regulates gene expression. We developed an algorithm to cluster transcript data to annotated genes to detect unannotated splice variants. A higher number of alternatively spliced genes and isoforms were found compared to other alternative splicing databases. Comparison of human and mouse data revealed a marked increase, in human, of splice variants incorporating novel exons and retained introns. Previously unannotated exons were validated by tiling array expression data and shown to correspond preferentially to novel first exons. Retained introns were validated by tiling array and deep sequencing data. The majority of retained introns were shorter than 500 nt and had weak polypyrimidine tracts. A subset of retained introns matching small RNAs and displaying a high GC content suggests a possible coordination between splicing regulation and production of noncoding RNAs. Conservation of unannotated exons and retained introns was higher in horse, dog and cow than in rodents, and 64% of exon sequences were only found in primates. This analysis highlights previously bypassed alternative splice variants, which may be crucial to deciphering more complex pathways of gene regulation in human.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Effect of number of ESTs on estimates of levels of alternative splicing. Data generated from random sets of ESTs at intervals of 0.5 million ESTs, ranging from 0.5 to 2 million for mouse and 0.5 to 4 million for human. (A) Percent of genes with more than one splicing pattern. (B) Average number of exons per gene. (C) Average number of splicing patterns per gene.
Figure 2.
Figure 2.
Relative amounts of first, internal and terminal exons. (A) Relative amounts of first, internal and terminal exons in known (RefSeq annotated) exons. (B) Relative amounts of first, internal and terminal new (unannotated) exons. (C) Percent of first known and new exons containing transcription start sites (TSSs) within the exons or 200 nt upstream [DBTSS (24), Version: 7.0, 15 September 2009). (D) Percent of terminal exons containing the poly-A signal or 1-nt variants of the consensus AATAAA sequence (23). First exons are those for which no 3′ splice site was detected. Terminal exons are those for which no 5′ splice site was detected. Internal exons have both a 3′ splice site and 5′ splice site. This analysis includes only exons with minimum 25 nt and excludes chimeric transcript products.
Figure 3.
Figure 3.
Tiling array data for fragment of gene ZRSR2. Coordinates for signal and transcribed fragments of tiling array data [Geo Accession GSE7576 (25)] of all eight cell lines used in that study for human genome assembly hg17 were lifted to assembly hg18 and matched to ExonMine data (August 2008 update). The graph represents data for the 5′-end of gene ZRSR2. Nuclear signal (yellow) and cytoplasmic signal (red) shown. Exon positions from our ExonMine data (blue) and transcribed fragments from tiling array data are represented superimposed on the negative axis: cytoplasmic (red), nuclear (yellow), short RNA top strand (green) and short RNA bottom strand (cyan). The figure shows that probe coverage on the tiling array is absent or too low for Alu containing unannotated exons 2A, 3A and 3B. For unannotated exon 1A, however, there is a clear nuclear and cytoplasmic signal as well as correspondence to short RNA transcribed fragments in that region. The figure also shows expression which is not detected in ExonMine, including: on the 5′-end of the intron downstream of exon 2; several transcribed fragments between exons 2 and 2A likely to correspond to a gene on the opposite strand for which there is only EST evidence (AA284226); and a transcribed fragment just upstream of exon 3 with a low signal.
Figure 4.
Figure 4.
Distribution of intron size range and presence of short RNAs. (A) The plot represents the distribution of size range of three sets of introns. AIt: total non-retained introns (222721 introns); RTt: total retained introns with 50% of surface matching tiling array (25) transcribed fragments in the cytoplasm (50CytoTF set, 7381 introns); and RIt: total retained introns not confirmed by tiling array data (8907 introns); and their corresponding subsets matching short transcribed fragments detected in tiling arrays (AIs: 60 090 introns, RTs: 2663 introns and RIs: 1908 introns). Tiling array data for short RNAs (short transcribed fragments, 22–200 nt) was taken from (25). (B) Percentage of total introns in each set containing short RNAs plotted against Log10 of intron length. % AIs/AIt: non-retained introns; % RTs/RTt: 50CytoTF retained intron; % RIs/RIt: retained introns not detected in tiling array data. Within the size range of retained introns, this plot reveals that the 50CytoTF set of introns carries more short RNAs than non-retained introns. The calculated two tailed P-value for the difference observed, is <0.0001 (see ‘Materials and Methods’ section).
Figure 5.
Figure 5.
GC-content in retained and non-retained introns. GC content in three sets of 2500 introns each. Set Rs: small retained introns validated by tiling array data (25) in the cytoplasm and also matching short transcribed fragments; set Rns: small retained introns not matching short transcribed fragments; and set NR: small non-retained introns. Small introns are <1029 nt as defined in (28). (A) Boxplots of all three sets Rs, Rns and NR. Introns in sets Rns and NR are composed of a random selection of the same number of introns in each quartile and outliers as in set Rs: lower hinge = 30, extreme lower whisker = 102, median = 185, upper hinge = 337, extreme upper whisker = 687, lower extreme of notch = 177.6, upper extreme of notch = 192.4, 153 outliers. (B) Percent GC content in each of the three sets Rs, Rns and NR.
Figure 6.
Figure 6.
Frequency of nucleotide occurrence at splice sites. Logos representing the frequency of occurrence of nucleotides at each position at the 5′splice site (3 nt upstream and 20 nt downstream) and at the 3′splice site (30 nt upstream and 3 nt downstream) were produced using WebLogo (39). Uridines are represented by Ts. (A) Set Rs, 2500 small retained introns matching short RNAs, as described in Figure 5. (B) Random set of 2500 introns of all sizes. (C) Set Rns, 2500 small retained introns not matching short RNAs, as described in Figure 5. (D) Set NR, 2500 small non-retained introns, as described in Figure 5.
Figure 7.
Figure 7.
Conservation of unannotated exons and retained introns. Conservation estimated using discontiguous megablast (see ‘Materials and Methods’ section) against eight species: chimp, rhesus, mouse, rat, dog, horse, cow and chicken. The bars represent the percentage of exons or introns with a minimum of 70% sequence conservation over a minimum of 80% sequence coverage. (A) Conservation in intron sets Rs, Rns and NR as described in Figure 5, with the same size distribution. (B) Conservation of known (RefSeq annotated) and new (unannotated). Known exon set consists of a selection of 2500 random exons. Unannotated exons consist of the set of 9371 previously unannotated exons validated by more than 50% coverage of tiling array transcribed fragments in the cytoplasm. The error associated with the use of random sets of 2500 sequences was estimated at less than ± 1%.

Similar articles

Cited by

References

    1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2008;36:D25–D30. - PMC - PubMed
    1. Matlin AJ, Clark F, Smith CW. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell. Biol. 2005;6:386–398. - PubMed
    1. Blencowe BJ. Alternative splicing: new insights from global analyses. Cell. 2006;126:37–47. - PubMed
    1. Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 2007;8:749–761. - PubMed
    1. Modrek B, Lee C. A genomic view of alternative splicing. Nat. Genet. 2002;30:13–19. - PubMed

Publication types