Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 12;19(10):e1011526.
doi: 10.1371/journal.pcbi.1011526. eCollection 2023 Oct.

Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

Affiliations

Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

Joseph D Valencia et al. PLoS Comput Biol. .

Abstract

Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of problem setting and computational method.
(A) Summary of messenger RNA functional regions and known elements regulating translation. See [48] for a review of known regulatory elements. (B) Neural network sequence-to-sequence architecture. We designed LFNet (left) to apply a learned filter matrix W to a 1D short-time Fourier transform (spectrogram) of the hidden representations, enabling frequency-domain filtering of the 3-base periodicity present in coding sequences. We trained this architecture for multiple problem settings: bioseq2class outputs a classification token, bioseq2seq also predicts the protein translation, and bioseq2start predicts the position of the start codon for mRNAs.
Fig 2
Fig 2. Comparison of training tasks and neural architectures.
Names have been shortened by removing the “bioseq2” prefix for all of them. (A) F1 score across five replicates of bioseq2seq, bioseq2seq-wt, bioseq2class, and bioseq2start using both LFNet and CNN architectures. (B) Analysis of CDS detection abilities by bioseq2seq variants. Rate at which predicted protein sequence aligns better to the CDS than alternative ORFs (left), and alignment percent identity with the CDS (right).
Fig 3
Fig 3. Frequency-domain content in model representations.
LFNet filters from selected layers, with complex filter weights visualized in terms of magnitude (bioseq2seq-wt in panel A, bioseq2class in B) and phase (bioseq2seq-wt in C, bioseq2class in D). For each layer heatmap, the x-axis represents the hidden embedding dimension, and the y-axis refers to a discrete frequency bin, with annotations for the equivalent nucleotide periodicity. Both model types learned weights with a pronounced structure around 3-nt periodicity, visible mostly clearly in the phase for bioseq2seq-wt and in the magnitude for bioseq2class. (E) A nucleotide-resolution metagene consisting of average encoder-decoder attention scores from mRNAs aligned relative to their start codons. Attention distributions for this plot were taken from head 5 of the lower bioseq2seq-wt (LFN) decoder layer, which primarily attends to the start codon and places attention downstream of the start in a periodic fashion. (F) The equivalent plot for the same attention head applied to lncRNAs aligned relative to the start of the longest ORF.
Fig 4
Fig 4. Predicted mutation effects by model type on a subset of testing data.
Names have been shortened by removing the “bioseq2” prefix for all of them. (A) Inter-replicate agreement according to Pearson correlation of saturated in silico mutagenesis (ISM) ΔS scores, i.e. the difference in log(P(〈PC〉)/P(〈NC〉)) between single-nucleotide variants and their wild-type sequence. Correlation of ISM scores is computed pairwise across replicates and averaged into a single value per transcript. (B) Metagene plots of ISM in which the absolute value of ΔS was averaged within each of 25 positional bins and across all three possible mutations in each position, with mRNAs and lncRNAs depicted separately for both bioseq2seq-wt (LFN), bioseq2seq-wt (CNN), and bioseq2class (LFN). Vertical dashed lines denote the first and last bin of the CDS for mRNAs and the longest ORF for lncRNAs. Metagenes from all five replicates are shown, with the best-performing model colored using the darkest hue. (C) Changes in coding score for changes that introduce a premature stop codon, in fifty-codon bins along the length of the CDS. (D) Changes in score for mRNAs from nucleotide substitutions that knock out a start codon. (E) Changes in score relative to wildtype for mRNAs shuffled within each functional region. UTRs were shuffled to preserve dinucleotide frequencies. Codon shuffling excluded the start and stop codons to preserve CDS length.
Fig 5
Fig 5. Detailed analysis of in silico mutagenesis (ISM) on the full test set.
(A) Plots of ISM metagenes for selected amino acids lysine (left) and glycine (right). Mean ΔS is shown for 25 positional bins across mRNA CDS regions with mutations listed based on the resulting codon. The red line represents the average across all missense/nonsynonymous mutations. For amino acids with more than two codons, the blue dashed line depicts the average synonymous mutation for comparison. (B) Mean ISM for synonymous point mutations by codon position and nucleotide. X’s denote substitutions which do not exist as synonymous changes. (C) An example protein-coding transcript with NCBI accession NM_001206605.1. Signed ISM scores for the transcript are depicted as a heatmap and the RNA sequence is portrayed with characters scaled according to the ↑ PC importance strategy, i.e. regions with highly negative ISM weights depicted in dark blue. The subregions shown are windows around the start codon, the position of maximum importance, and the stop codon. (D) Same as panel C with an example long noncoding RNA with NCBI accession NR_109777.1. The endogenous sequence is scaled according to ↑ NC, or highly positive ISM values drawn in dark red. (E) mRNA motifs discovered in our test set with STREME using ISM importance values from bioseq2seq to determine sequence regions in which to search for enriched signals. Annotations denote the importance and control strategy for each trial, with boldfaced annotations signifying that importance values were not masked and ordinary typeface indicating that feature importance at start and stop codons and nonsense mutations were excluded. Motifs are positioned near the regions in which they were enriched. (F) Same as panel E showing discovered lncRNA motifs.
Fig 6
Fig 6. Gradient-based approximation performance.
(A) Summary results from tuning of β hyperparameter for MDIG alongside baseline methods. The intra-replicate agreement according to Pearson correlation of each gradient-based approximation with ISM is summarized using the median across transcripts as a point estimate. (B) Scatter plot of ΔS for all possible synonymous point mutations, i.e. every wildtype>variant pair differing at one position, from MDIG on the test set (x-axis) versus the same for ISM (y-axis) on the test set. (C) mRNA motifs discovered in our training set with STREME using MDIG importance values from bioseq2seq to determine sequence regions in which to search for enriched signals. Results from unmasked importance are shown above the transcript diagram and those from the masked trials are shown below. (D) lncRNA motifs discovered in the training set using MDIG importance values from bioseq2seq, depicted in the same manner as panel C.

Update of

References

    1. Iyer MK, Niknafs YS, Malik R, Singhal U, Sahu A, Hosono Y, et al.. The landscape of long noncoding RNAs in the human transcriptome. Nature Genetics. 2015;47(3):199–208. doi: 10.1038/ng.3192 - DOI - PMC - PubMed
    1. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al.. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Research. 2012;22(9):1775–1789. doi: 10.1101/gr.132159.111 - DOI - PMC - PubMed
    1. Statello L, Guo CJ, Chen LL, Huarte M. Gene regulation by long non-coding RNAs and its biological functions. Nature Reviews Molecular Cell Biology. 2021;22(2):96–118. doi: 10.1038/s41580-020-00315-9 - DOI - PMC - PubMed
    1. Ransohoff JD, Wei Y, Khavari PA. The functions and unique features of long intergenic non-coding RNA. Nature Reviews Molecular Cell Biology. 2018;19(3):143–157. doi: 10.1038/nrm.2017.104 - DOI - PMC - PubMed
    1. Sallam T, Sandhu J, Tontonoz P. Long Noncoding RNA Discovery in Cardiovascular Disease. Circulation Research. 2018;122(1):155–166. - PMC - PubMed

Publication types