. 2010 Aug;38(14):4740-54.

doi: 10.1093/nar/gkq197. Epub 2010 Apr 12.

Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome

I G Mollet¹, Claudia Ben-Dov, Daniel Felício-Silva, A R Grosso, Pedro Eleutério, Ruben Alves, Ray Staller, Tito Santos Silva, Maria Carmo-Fonseca

Affiliations

PMID: 20385588
PMCID: PMC2919708
DOI: 10.1093/nar/gkq197

Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome

I G Mollet et al. Nucleic Acids Res. 2010 Aug.

. 2010 Aug;38(14):4740-54.

doi: 10.1093/nar/gkq197. Epub 2010 Apr 12.

Authors

I G Mollet¹, Claudia Ben-Dov, Daniel Felício-Silva, A R Grosso, Pedro Eleutério, Ruben Alves, Ray Staller, Tito Santos Silva, Maria Carmo-Fonseca

Affiliation

¹ Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal. ines.mollet@med.lu.se

PMID: 20385588
PMCID: PMC2919708
DOI: 10.1093/nar/gkq197

Abstract

Mining massive amounts of transcript data for alternative splicing information is paramount to help understand how the maturation of RNA regulates gene expression. We developed an algorithm to cluster transcript data to annotated genes to detect unannotated splice variants. A higher number of alternatively spliced genes and isoforms were found compared to other alternative splicing databases. Comparison of human and mouse data revealed a marked increase, in human, of splice variants incorporating novel exons and retained introns. Previously unannotated exons were validated by tiling array expression data and shown to correspond preferentially to novel first exons. Retained introns were validated by tiling array and deep sequencing data. The majority of retained introns were shorter than 500 nt and had weak polypyrimidine tracts. A subset of retained introns matching small RNAs and displaying a high GC content suggests a possible coordination between splicing regulation and production of noncoding RNAs. Conservation of unannotated exons and retained introns was higher in horse, dog and cow than in rodents, and 64% of exon sequences were only found in primates. This analysis highlights previously bypassed alternative splice variants, which may be crucial to deciphering more complex pathways of gene regulation in human.

PubMed Disclaimer

Figures

**Figure 1.**
Effect of number of ESTs on estimates of levels of alternative splicing. Data generated from random sets of ESTs at intervals of 0.5 million ESTs, ranging from 0.5 to 2 million for mouse and 0.5 to 4 million for human. (A) Percent of genes with more than one splicing pattern. (B) Average number of exons per gene. (C) Average number of splicing patterns per gene.

**Figure 2.**
Relative amounts of first, internal and terminal exons. (A) Relative amounts of first, internal and terminal exons in known (RefSeq annotated) exons. (B) Relative amounts of first, internal and terminal new (unannotated) exons. (C) Percent of first known and new exons containing transcription start sites (TSSs) within the exons or 200 nt upstream [DBTSS (24), Version: 7.0, 15 September 2009). (D) Percent of terminal exons containing the poly-A signal or 1-nt variants of the consensus AATAAA sequence (23). First exons are those for which no 3′ splice site was detected. Terminal exons are those for which no 5′ splice site was detected. Internal exons have both a 3′ splice site and 5′ splice site. This analysis includes only exons with minimum 25 nt and excludes chimeric transcript products.

**Figure 3.**
Tiling array data for fragment of gene *ZRSR2.* Coordinates for signal and transcribed fragments of tiling array data [Geo Accession GSE7576 (25)] of all eight cell lines used in that study for human genome assembly hg17 were lifted to assembly hg18 and matched to ExonMine data (August 2008 update). The graph represents data for the 5′-end of gene *ZRSR2*. Nuclear signal (yellow) and cytoplasmic signal (red) shown. Exon positions from our ExonMine data (blue) and transcribed fragments from tiling array data are represented superimposed on the negative axis: cytoplasmic (red), nuclear (yellow), short RNA top strand (green) and short RNA bottom strand (cyan). The figure shows that probe coverage on the tiling array is absent or too low for Alu containing unannotated exons 2A, 3A and 3B. For unannotated exon 1A, however, there is a clear nuclear and cytoplasmic signal as well as correspondence to short RNA transcribed fragments in that region. The figure also shows expression which is not detected in ExonMine, including: on the 5′-end of the intron downstream of exon 2; several transcribed fragments between exons 2 and 2A likely to correspond to a gene on the opposite strand for which there is only EST evidence (AA284226); and a transcribed fragment just upstream of exon 3 with a low signal.

**Figure 4.**
Distribution of intron size range and presence of short RNAs. (A) The plot represents the distribution of size range of three sets of introns. AIt: total non-retained introns (222721 introns); RTt: total retained introns with 50% of surface matching tiling array (25) transcribed fragments in the cytoplasm (50CytoTF set, 7381 introns); and RIt: total retained introns not confirmed by tiling array data (8907 introns); and their corresponding subsets matching short transcribed fragments detected in tiling arrays (AIs: 60 090 introns, RTs: 2663 introns and RIs: 1908 introns). Tiling array data for short RNAs (short transcribed fragments, 22–200 nt) was taken from (25). (B) Percentage of total introns in each set containing short RNAs plotted against Log10 of intron length. % AIs/AIt: non-retained introns; % RTs/RTt: 50CytoTF retained intron; % RIs/RIt: retained introns not detected in tiling array data. Within the size range of retained introns, this plot reveals that the 50CytoTF set of introns carries more short RNAs than non-retained introns. The calculated two tailed P-value for the difference observed, is <0.0001 (see ‘Materials and Methods’ section).

**Figure 5.**
GC-content in retained and non-retained introns. GC content in three sets of 2500 introns each. Set Rs: small retained introns validated by tiling array data (25) in the cytoplasm and also matching short transcribed fragments; set Rns: small retained introns not matching short transcribed fragments; and set NR: small non-retained introns. Small introns are <1029 nt as defined in (28). (A) Boxplots of all three sets Rs, Rns and NR. Introns in sets Rns and NR are composed of a random selection of the same number of introns in each quartile and outliers as in set Rs: lower hinge = 30, extreme lower whisker = 102, median = 185, upper hinge = 337, extreme upper whisker = 687, lower extreme of notch = 177.6, upper extreme of notch = 192.4, 153 outliers. (B) Percent GC content in each of the three sets Rs, Rns and NR.

**Figure 6.**
Frequency of nucleotide occurrence at splice sites. Logos representing the frequency of occurrence of nucleotides at each position at the 5′splice site (3 nt upstream and 20 nt downstream) and at the 3′splice site (30 nt upstream and 3 nt downstream) were produced using WebLogo (39). Uridines are represented by Ts. (A) Set Rs, 2500 small retained introns matching short RNAs, as described in Figure 5. (B) Random set of 2500 introns of all sizes. (C) Set Rns, 2500 small retained introns not matching short RNAs, as described in Figure 5. (D) Set NR, 2500 small non-retained introns, as described in Figure 5.

**Figure 7.**
Conservation of unannotated exons and retained introns. Conservation estimated using discontiguous megablast (see ‘Materials and Methods’ section) against eight species: chimp, rhesus, mouse, rat, dog, horse, cow and chicken. The bars represent the percentage of exons or introns with a minimum of 70% sequence conservation over a minimum of 80% sequence coverage. (A) Conservation in intron sets Rs, Rns and NR as described in Figure 5, with the same size distribution. (B) Conservation of known (RefSeq annotated) and new (unannotated). Known exon set consists of a selection of 2500 random exons. Unannotated exons consist of the set of 9371 previously unannotated exons validated by more than 50% coverage of tiling array transcribed fragments in the cytoplasm. The error associated with the use of random sets of 2500 sequences was estimated at less than ± 1%.

See this image and copyright information in PMC

Cited by

Intron Retention and Alzheimer's Disease (AD): A Review of Regulation Genes Implicated in AD.
El-Seedy A, Ladevèze V. El-Seedy A, et al. Genes (Basel). 2025 Jun 30;16(7):782. doi: 10.3390/genes16070782. Genes (Basel). 2025. PMID: 40725435 Free PMC article. Review.
ExoLocator--an online view into genetic makeup of vertebrate proteins.
Khoo AA, Ogrizek-Tomas M, Bulovic A, Korpar M, Gürler E, Slijepcevic I, Šikic M, Mihalek I. Khoo AA, et al. Nucleic Acids Res. 2014 Jan;42(Database issue):D879-81. doi: 10.1093/nar/gkt1164. Epub 2013 Nov 23. Nucleic Acids Res. 2014. PMID: 24271393 Free PMC article.
Dynamic usage of alternative splicing exons during mouse retina development.
Wan J, Masuda T, Hackler L Jr, Torres KM, Merbs SL, Zack DJ, Qian J. Wan J, et al. Nucleic Acids Res. 2011 Oct;39(18):7920-30. doi: 10.1093/nar/gkr545. Epub 2011 Jun 30. Nucleic Acids Res. 2011. PMID: 21724604 Free PMC article.
Evolution of gene structural complexity: an alternative-splicing-based model accounts for intron-containing retrogenes.
Zhang C, Gschwend AR, Ouyang Y, Long M. Zhang C, et al. Plant Physiol. 2014 May;165(1):412-23. doi: 10.1104/pp.113.231696. Epub 2014 Feb 11. Plant Physiol. 2014. PMID: 24520158 Free PMC article.
Alternative splicing of clock transcript mediates the response of circadian clocks to temperature changes.
Cai YD, Chow GK, Hidalgo S, Liu X, Jackson KC, Vasquez CD, Gao ZY, Lam VH, Tabuloc CA, Zheng H, Zhao C, Chiu JC. Cai YD, et al. bioRxiv [Preprint]. 2024 May 12:2024.05.10.593646. doi: 10.1101/2024.05.10.593646. bioRxiv. 2024. Update in: Proc Natl Acad Sci U S A. 2024 Dec 10;121(50):e2410680121. doi: 10.1073/pnas.2410680121. PMID: 38766142 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2008;36:D25–D30. - PMC - PubMed
1. Matlin AJ, Clark F, Smith CW. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell. Biol. 2005;6:386–398. - PubMed
1. Blencowe BJ. Alternative splicing: new insights from global analyses. Cell. 2006;126:37–47. - PubMed
1. Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 2007;8:749–761. - PubMed
1. Modrek B, Lee C. A genomic view of alternative splicing. Nat. Genet. 2002;30:13–19. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome

Affiliation

Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous