Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 30;17(1):266.
doi: 10.1186/s13059-016-1118-6.

Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive

Affiliations

Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive

Abhinav Nellore et al. Genome Biol. .

Abstract

Background: Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0.

Results: We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio . We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data.

Conclusions: Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.

Keywords: Intron; RNA-seq; Splicing.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Displayed is the number of exon-exon junctions J found by Rail-RNA and other alignment protocols in at least S of the 1720 brain and universal human reference RNA-seq samples also studied by the SEQC/MACQ-III consortium [11] (i.e., SEQC). “2 aligners” (red), “3 aligners” (green), and “4 aligners” (orange) refer to junctions we found with Rail-RNA that were also found by, respectively, 1, 2, and 3 of the alignment protocols used by SEQC
Fig. 2
Fig. 2
a Shows how many exon-exon junctions J are found in at least S samples of the 21,504 human RNA-seq samples on the SRA aligned here. It also shows how much evidence for these junctions is found in gene annotation: “fully annotated” (orange) means the junction is in an annotated transcript, “exon skip” (green) means a called junction’s donor and acceptor sites are annotated in distinct junctions, “alternative donor/acceptor” (red) means only one of a called junction’s donor and acceptor sites is in a junction from annotation, and “novel” (blue) means neither donor nor acceptor site is annotated. b and c restrict attention to the 10,311 samples for which 100,000 junctions are discovered in each. b refers to overlaps, where an overlap is any instance where a read maps across a junction
Fig. 3
Fig. 3
Displayed is the number of exon-exon junctions J found in at least P projects of the 929 human RNA-seq projects on the SRA considered in this paper. It also shows how much evidence for these junctions is found in gene annotation: “fully annotated” (orange) means the junction is in an annotated transcript, “exon skip” (green) means a called junction’s donor and acceptor sites are annotated in distinct junctions, “alternative donor/acceptor” (red) means only one of a called junction’s donor and acceptor sites is in a junction from annotation, and “novel” (blue) means neither donor nor acceptor site is annotated
Fig. 4
Fig. 4
Displayed is the first principal component (PC1) vs. the second principal component (PC2) for a principal component analysis (PCA) with a coverage data matrix where rows are junctions and columns are samples. (See Methods for technical details.) Each point corresponds to a distinct sample. Gray points are unlabeled samples, red points are blood samples, magenta points are lymphoblastoid cell line samples, and cyan points are brain samples. GEUVADIS (GEU) is a sizable cluster of magenta points. The ABRF and SEQC consortia each sequenced mixtures of universal human reference RNA (UHRR) and human brain reference RNA (HBRR) in four sample ratios UHRR:HBRR that form distinct clusters in the shaded regions: 0:1 (green), 1:3 (blue), 3:1 (brown), and 1:0 (yellow)
Fig. 5
Fig. 5
The 3,211,228 junctions found in at least 20 reads across samples are accumulated by their “discovery dates.” Here, discovery date of a junction is taken to be the earliest submission date to the BioSample database from among the samples in which the junction was found. 96.1% of the junctions were discovered before 1 January 2013, although only 34.7% of samples depicted in the figure had been submitted by then, and afterwards discovery levels off. Demanding higher levels of confidence (the red, green, and orange curves) gives rise to earlier asymptotes. Ranked from 1 to 5 are the dominant contributing projects from dates on which the most junctions are discovered. “Che” refers to a study of 41 Coriell cell lines by Cheung et al. [21], “Pic” refers to a study of 69 LCLs by Pickrell et al. [20], “UWE” refers to the University of Washington Human Reference Epigenome Mapping Project [17], “BM2” refers to Illumina Body Map 2.0 [6], and “ENC” refers to ENCODE [19]. “GEU” refers to GEUVADIS [37], whose 464 LCLs uncovered few junctions that had not already been discovered
Fig. 6
Fig. 6
Displayed is a summary of the evolution of junctions from the GENCODE annotation of hg19 through its 18 releases compared to the evolution of confidently called junctions called across the SRA. Every junction considered here is “confidently called”—found in at least 20 reads across the SRA samples we analyzed. a shows that most junctions (80.0%) annotated by GENCODE first appeared in the first release. b shows that junctions in GENCODE tend to have early discovery dates. This is also evident from c, which shows that while only 20.3% of junctions are discovered by late January 2010, almost three-quarters of junctions appearing in at least one GENCODE release are discovered by the same date. Also shown in b is how junctions first appearing in GENCODE’s first release have noticeably earlier discovery dates than junctions first appearing in later releases. This is due to how junctions first appearing in GENCODE’s first release tend to be found in many more samples (median = 5825) than junctions first appearing in later releases (median = 602 samples), as shown in d. In every box plot, the red diamond corresponds to the median, and the blue triangle corresponds to the mean
Fig. 7
Fig. 7
Displayed in the UCSC Genome Browser (http://genome.ucsc.edu) are tracks corresponding to CAGE data for normal human melanocyte cell cultures NHEM_M2 and NHEM.f_M2 studied by ENCODE as well as TSSes predicted with hidden Markov models from pooled replicates in the ALK gene for hg19. Observe that one model predicts a TSS in the region chr2:29,446,803–29,446,696 and the other predicts a TSS in the region chr2:29,446,882–29,446,687, both of which contain the TSS region identified for ALK ATI in [29], chr2:29,446,768–29,446,744

Comment in

References

    1. Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014;42(D1):756–63. doi: 10.1093/nar/gkt1114. - DOI - PMC - PubMed
    1. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 2012;22(9):1760–74. doi: 10.1101/gr.135350.111. - DOI - PMC - PubMed
    1. Thibaud-Nissen F, Souvorov A, Murphy T, DiCuccio M, Kitts P. Eukaryotic genome annotation pipeline. 2013. https://www.ncbi.nlm.nih.gov/books/NBK169439/.
    1. Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M. The Ensembl automatic gene annotation system. Genome Res. 2004;14(5):942–50. doi: 10.1101/gr.1858004. - DOI - PMC - PubMed
    1. Consortium EP, et al. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–40. doi: 10.1126/science.1105136. - DOI - PubMed

Substances