. 2018 Sep 1;34(17):2889-2898.

doi: 10.1093/bioinformatics/bty211.

Inference of the human polyadenylation code

Michael K K Leung^{1

2}, Andrew Delong^{1

2}, Brendan J Frey^{1

2

3}

Affiliations

¹ Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada.
² Deep Genomics, MaRS Centre, Toronto, Canada.
³ Banting and Best Department of Medical Research, University of Toronto, Toronto, Canada.

PMID: 29648582
PMCID: PMC6129302
DOI: 10.1093/bioinformatics/bty211

Inference of the human polyadenylation code

Michael K K Leung et al. Bioinformatics. 2018.

. 2018 Sep 1;34(17):2889-2898.

doi: 10.1093/bioinformatics/bty211.

Authors

Michael K K Leung^{1

2}, Andrew Delong^{1

2}, Brendan J Frey^{1

2

3}

Affiliations

¹ Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada.
² Deep Genomics, MaRS Centre, Toronto, Canada.
³ Banting and Best Department of Medical Research, University of Toronto, Toronto, Canada.

PMID: 29648582
PMCID: PMC6129302
DOI: 10.1093/bioinformatics/bty211

Abstract

Motivation: Processing of transcripts at the 3'-end involves cleavage at a polyadenylation site followed by the addition of a poly(A)-tail. By selecting which site is cleaved, the process of alternative polyadenylation enables genes to produce transcript isoforms with different 3'-ends. To facilitate the identification and treatment of disease-causing mutations that affect polyadenylation and to understand the sequence determinants underlying this regulatory process, a computational model that can accurately predict polyadenylation patterns from genomic features is desirable.

Results: Previous works have focused on identifying candidate polyadenylation sites and classifying tissue-specific sites. By training on how multiple sites in genes are competitively selected for polyadenylation from 3'-end sequencing data, we developed a deep learning model that can predict the tissue-specific strength of a polyadenylation site in the 3' untranslated region of the human genome given only its genomic sequence. We demonstrate the model's broad utility on multiple tasks, without any application-specific training. The model can be used to predict which polyadenylation site is more likely to be selected in genes with multiple sites. It can be used to scan the 3' untranslated region to find candidate polyadenylation sites. It can be used to classify the pathogenicity of variants near annotated polyadenylation sites in ClinVar. It can also be used to anticipate the effect of antisense oligonucleotide experiments to redirect polyadenylation. We provide analysis on how different features affect the model's predictive performance and a method to identify sensitive regions of the genome at the single-based resolution that can affect polyadenylation regulation.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(Left) A schematic of the components of the neural network that represent the polyadenylation model. The genomic sequence surrounding a polyadenylation site is an input to the strength predictor, which outputs eight tissue-specific scores describing the efficiency of the site for cleavage and polyadenylation. The model is trained from the relative strength between pairs of competing sites. (Right) Two architectures are compared for the sequence model, a convolutional neural network that operates directly on sequences and a fully connected neural network that takes in a feature vector processed by a feature extraction pipeline

**Fig. 2.**
Classification performance of ClinVar variants near polyadenylation sites. (Left) ROC curves comparing the variant classification performance of the Conv-Net and the Feature-Net. The shaded region shows the one standard deviation zone computed by bootstrapping. (Right) ROC curves comparing our model’s performance against other predictors. AUC values are shown in the figure legend

**Fig. 3.**
A mutation map of the genomic region chr11: 5,246,678–5,246,777. Each square represents a change in the model’s score if the original base is substituted. The substituted base is represented in each row in the order ‘ACGT’. Red/blue denote a mutation that would increase/decrease the likelihood of the PAS for cleavage and polyadenylation

**Fig. 4.**
Predicting the effect of an antisense oligonucleotide experiment. (Left) Schematic of human E-selectin 3′-UTR and the possible transcripts from polyadenylation site selection, reproduced from Vickers *et al.* (2001). The regions targeted by the oligonucleotides are shown. (Right) Predicted PAS strength, simulating the effects of blocked nucleotides due to oligonucleotide treatment. (Center) The figure from the original paper is reproduced here for ease of comparison. The oligonucleotides applied are shown on top of each column

**Fig. 5.**
Saliency map from the Conv-Net of a section of the oligo-targeted mRNA from Vickers *et al.* (2001). The base is represented in each row in the order ‘ACGT’. Red means the base increases the likelihood of the sequence for cleavage and polyadenylation. Blue is the reverse. The sum of the magnitude of the gradient is shown above the saliency map to suggest how sensitive the nucleotide is to the strength of the polyadenylation site. The position of the oligonucleotide used in the study is shown at the top. The Type 4 Poly(A) signal is labeled also, but was not targeted in the original study

See this image and copyright information in PMC

References

1. Abadi M. et al. (2015) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467.
1. Akhtar M.N. et al. (2010) POLYAR, a new computer program for prediction of poly(A) sites in human sequences. BMC Genomics, 11, 646. - PMC - PubMed
1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
1. Angermueller C. et al. (2017) DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol., 18, 67.. - PMC - PubMed
1. Blanchette M. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res., 14, 708–715. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Inference of the human polyadenylation code

Affiliations

Inference of the human polyadenylation code

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources