Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 12;12(1):1652.
doi: 10.1038/s41467-021-21894-x.

Aptardi predicts polyadenylation sites in sample-specific transcriptomes using high-throughput RNA sequencing and DNA sequence

Affiliations

Aptardi predicts polyadenylation sites in sample-specific transcriptomes using high-throughput RNA sequencing and DNA sequence

Ryan Lusk et al. Nat Commun. .

Abstract

Annotation of polyadenylation sites from short-read RNA sequencing alone is a challenging computational task. Other algorithms rooted in DNA sequence predict potential polyadenylation sites; however, in vivo expression of a particular site varies based on a myriad of conditions. Here, we introduce aptardi (alternative polyadenylation transcriptome analysis from RNA-Seq data and DNA sequence information), which leverages both DNA sequence and RNA sequencing in a machine learning paradigm to predict expressed polyadenylation sites. Specifically, as input aptardi takes DNA nucleotide sequence, genome-aligned RNA-Seq data, and an initial transcriptome. The program evaluates these initial transcripts to identify expressed polyadenylation sites in the biological sample and refines transcript 3'-ends accordingly. The average precision of the aptardi model is twice that of a standard transcriptome assembler. In particular, the recall of the aptardi model (the proportion of true polyadenylation sites detected by the algorithm) is improved by over three-fold. Also, the model-trained using the Human Brain Reference RNA commercial standard-performs well when applied to RNA-sequencing samples from different tissues and different mammalian species. Finally, aptardi's input is simple to compile and its output is easily amenable to downstream analyses such as quantitation and differential expression.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview for using aptardi.
Aptardi requires three files as input: (1) FASTA file of DNA sequence with headers by chromosome, (2) sorted Binary Alignment Map (BAM) file of reads aligned to the genome, and (3) General Feature Format (GTF) file of transcript structures. Blue boxes represent software. Yellow writing/boxes indicate aptardi incorporation. Note transcript structures can be derived from a reference transcriptome (i.e., Ensembl annotation) in lieu of the original transcriptome generated from a transcriptome assembler.
Fig. 2
Fig. 2. DNA sequence and RNA-sequencing (RNA-Seq) features are individually associated with polyadenylation (polyA) sites.
a The percent of 100 base bins containing each of the three strong polyA signals stratified by the bin not containing (blue) or containing (orange) a polyA site. b Distribution of the inter-bin RNA-Seq features for each 100 base bin stratified by the bin not containing (blue) or containing (orange) a polyA site (RNA-Seq ratio features were standardized using the training set). c RNA-Seq features and DNA sequence features display little correlation (two-sided Pearson Product-Moment) across omics type. The combination of RNA-Seq information and DNA sequence information improves d average precision, and e, precision and recall at a specific prediction threshold (probability >0.50) over each separately. For both d and e, data are presented as mean values ±standard deviation on the test set (n = 5 random train-validate-test splits). Data shown are from the Human Brain Reference data set.
Fig. 3
Fig. 3. The machine learning pipeline used to build aptardi is robust to different data sets and the aptardi prediction model generated from the Human Brain Reference data set is applicable across diverse data sets.
Blue bars indicate the performance of the data set-specific prediction model on its own data set, i.e., the model was built and evaluated on a single data set. Orange bars represent the performance of the aptardi prediction model—built from the Human Brain Reference data set—on the given data set (x axis).
Fig. 4
Fig. 4. Incorporating aptardi transcripts into the original transcriptome improves the ratio of true positive to false positive 3′ termini compared with the original transcriptome and compared with the Tool for Alternative Polyadenylation site AnalysiS (TAPAS) analysis on the original transcriptome.
Results from transcripts added by aptardi to the original transcriptome are shaded in dark. Transcripts whose 3′ terminus was plus or minus 100 bases of a true polyadenylation site from PolyA-Seq data were considered a true positive and otherwise counted as a false positive. Data shown are from the Human Brain Reference data set.
Fig. 5
Fig. 5. Aptardi displays sample-specific sensitivity when annotating transcription stop sites.
RNA-sequencing (RNA-Seq) read densities for a CCND1, b DICER1, and c TIMP2 after control (Control) siRNA treatment and CFIm25 knockdown (KD) in HeLa cells. Numbers on y axis indicate RNA-Seq read coverage. After knockdown, each gene preferentially expresses a proximal alternative polyadenylation (APA) site compared to under control conditions. Transcript structures shown are from RefSeq annotation (dark blue), where boxes and lines indicate exons and introns, respectively. Black vertical lines indicate transcript stop sites identified in the original transcriptome, red vertical lines indicate transcript stop sites only identified in the aptardi modified transcriptome and that match the original study’s findings, and blue vertical lines indicate transcript stop sites only identified in the aptardi modified transcriptome that are not described in the original study. Graphics were generating using the UCSC Genome Browser (https://genome.ucsc.edu/) using the hg38 human genome assembly.
Fig. 6
Fig. 6. Incorporation of aptardi into differential expression analyses.
RNA-sequencing (RNA-Seq) read densities for six genes in BNLx and SHR inbred rat strains. Numbers on y axis indicate RNA-Seq read coverage. Read coverage represents the aggregate of three biological samples for each strain. Transcript structures shown are from Ensembl annotation (dark red), where boxes and lines indicate exons and introns, respectively. Black vertical lines denote transcript stop sites identified in the original transcriptome derived using StringTie, and red vertical lines indicate transcript stop sites identified in the aptardi modified transcriptome only. No transcripts were identified as differentially expressed between strains in the original transcriptome (p > 0.001), but at least one differentially expressed transcript for each gene was identified in the aptardi modified transcriptome (p ≤ 0.001). For a Unc79, b Sf3b1, c Ptn, and d Ap3b1 the original transcript isoform (black line) was differentially expressed in the aptardi modified transcriptome, and for e Zdhhc22 and f RGD1559441 the aptardi transcript was differentially expressed (red line). Graphics were generating using the UCSC Genome Browser (https://genome.ucsc.edu/) using the rn6 rat genome assembly.

Similar articles

Cited by

References

    1. Di Giammartino DC, Nishida K, Manley JL. Mechanisms and consequences of alternative polyadenylation. Mol. Cell. 2011;43:853–866. doi: 10.1016/j.molcel.2011.08.017. - DOI - PMC - PubMed
    1. Tian B, Manley JL. Alternative polyadenylation of mRNA precursors. Nat. Rev. Mol. Cell Biol. 2017;18:18–30. doi: 10.1038/nrm.2016.116. - DOI - PMC - PubMed
    1. Park JY, et al. Comparative analysis of mRNA isoform expression in cardiac hypertrophy and development reveals multiple post-transcriptional regulatory modules. PLoS ONE. 2011;6:e22391. doi: 10.1371/journal.pone.0022391. - DOI - PMC - PubMed
    1. de Klerk E, et al. Poly(A) binding protein nuclear 1 levels affect alternative polyadenylation. Nucleic Acids Res. 2012;40:9089–9101. doi: 10.1093/nar/gks655. - DOI - PMC - PubMed
    1. Jenal M, et al. The poly(A)-binding protein nuclear 1 suppresses alternative cleavage and polyadenylation sites. Cell. 2012;149:538–553. doi: 10.1016/j.cell.2012.03.022. - DOI - PubMed

Publication types

LinkOut - more resources