Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 1;34(17):2889-2898.
doi: 10.1093/bioinformatics/bty211.

Inference of the human polyadenylation code

Affiliations

Inference of the human polyadenylation code

Michael K K Leung et al. Bioinformatics. .

Abstract

Motivation: Processing of transcripts at the 3'-end involves cleavage at a polyadenylation site followed by the addition of a poly(A)-tail. By selecting which site is cleaved, the process of alternative polyadenylation enables genes to produce transcript isoforms with different 3'-ends. To facilitate the identification and treatment of disease-causing mutations that affect polyadenylation and to understand the sequence determinants underlying this regulatory process, a computational model that can accurately predict polyadenylation patterns from genomic features is desirable.

Results: Previous works have focused on identifying candidate polyadenylation sites and classifying tissue-specific sites. By training on how multiple sites in genes are competitively selected for polyadenylation from 3'-end sequencing data, we developed a deep learning model that can predict the tissue-specific strength of a polyadenylation site in the 3' untranslated region of the human genome given only its genomic sequence. We demonstrate the model's broad utility on multiple tasks, without any application-specific training. The model can be used to predict which polyadenylation site is more likely to be selected in genes with multiple sites. It can be used to scan the 3' untranslated region to find candidate polyadenylation sites. It can be used to classify the pathogenicity of variants near annotated polyadenylation sites in ClinVar. It can also be used to anticipate the effect of antisense oligonucleotide experiments to redirect polyadenylation. We provide analysis on how different features affect the model's predictive performance and a method to identify sensitive regions of the genome at the single-based resolution that can affect polyadenylation regulation.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(Left) A schematic of the components of the neural network that represent the polyadenylation model. The genomic sequence surrounding a polyadenylation site is an input to the strength predictor, which outputs eight tissue-specific scores describing the efficiency of the site for cleavage and polyadenylation. The model is trained from the relative strength between pairs of competing sites. (Right) Two architectures are compared for the sequence model, a convolutional neural network that operates directly on sequences and a fully connected neural network that takes in a feature vector processed by a feature extraction pipeline
Fig. 2.
Fig. 2.
Classification performance of ClinVar variants near polyadenylation sites. (Left) ROC curves comparing the variant classification performance of the Conv-Net and the Feature-Net. The shaded region shows the one standard deviation zone computed by bootstrapping. (Right) ROC curves comparing our model’s performance against other predictors. AUC values are shown in the figure legend
Fig. 3.
Fig. 3.
A mutation map of the genomic region chr11: 5,246,678–5,246,777. Each square represents a change in the model’s score if the original base is substituted. The substituted base is represented in each row in the order ‘ACGT’. Red/blue denote a mutation that would increase/decrease the likelihood of the PAS for cleavage and polyadenylation
Fig. 4.
Fig. 4.
Predicting the effect of an antisense oligonucleotide experiment. (Left) Schematic of human E-selectin 3′-UTR and the possible transcripts from polyadenylation site selection, reproduced from Vickers et al. (2001). The regions targeted by the oligonucleotides are shown. (Right) Predicted PAS strength, simulating the effects of blocked nucleotides due to oligonucleotide treatment. (Center) The figure from the original paper is reproduced here for ease of comparison. The oligonucleotides applied are shown on top of each column
Fig. 5.
Fig. 5.
Saliency map from the Conv-Net of a section of the oligo-targeted mRNA from Vickers et al. (2001). The base is represented in each row in the order ‘ACGT’. Red means the base increases the likelihood of the sequence for cleavage and polyadenylation. Blue is the reverse. The sum of the magnitude of the gradient is shown above the saliency map to suggest how sensitive the nucleotide is to the strength of the polyadenylation site. The position of the oligonucleotide used in the study is shown at the top. The Type 4 Poly(A) signal is labeled also, but was not targeted in the original study

References

    1. Abadi M. et al. (2015) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467.
    1. Akhtar M.N. et al. (2010) POLYAR, a new computer program for prediction of poly(A) sites in human sequences. BMC Genomics, 11, 646. - PMC - PubMed
    1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
    1. Angermueller C. et al. (2017) DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol., 18, 67.. - PMC - PubMed
    1. Blanchette M. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res., 14, 708–715. - PMC - PubMed

Publication types