Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 8;47(6):e36.
doi: 10.1093/nar/gkz061.

DeepRibo: a neural network for precise gene annotation of prokaryotes by combining ribosome profiling signal and binding site patterns

Affiliations

DeepRibo: a neural network for precise gene annotation of prokaryotes by combining ribosome profiling signal and binding site patterns

Jim Clauwaert et al. Nucleic Acids Res. .

Abstract

Annotation of gene expression in prokaryotes often finds itself corrected due to small variations of the annotated gene regions observed between different (sub)-species. It has become apparent that traditional sequence alignment algorithms, used for the curation of genomes, are not able to map the full complexity of the genomic landscape. We present DeepRibo, a novel neural network utilizing features extracted from ribosome profiling information and binding site sequence patterns that shows to be a precise tool for the delineation and annotation of expressed genes in prokaryotes. The neural network combines recurrent memory cells and convolutional layers, adapting the information gained from both the high-throughput ribosome profiling data and ribosome binding translation initiation sequence region into one model. DeepRibo is designed as a single model trained on a variety of ribosome profiling experiments, used for the identification of open reading frames in prokaryotes without a priori knowledge of the translational landscape. Through extensive validation of the model trained on various sets of data, multiple species sequence similarity, mass spectrometry and Edman degradation verified proteins, the effectiveness of DeepRibo is highlighted.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The architecture of the neural network DeepRibo. For each candidate ORF two types of data are processed and fed into their respective parts of the neural network. The convolutional layers train on a 30 nucleotide DNA sequence ranging from 20 nucleotides upstream to 10 nucleotides downstream of the TIS. The recurrent neural network covers the complete ORF from 50 nucleotides upstream of the start codon, including the SD region, and extending 20 nucleotides downstream of the stop codon. The DNA sequence is first translated in a binary image before being processed by four 1 × 1 and 32 1 × 12 convolutional kernels, respectively. The ribosome profiling data is processed by a double layered bidirectional GRU of 128 hidden nodes. The outputs of both neural networks are flattened and concatenated and fed into three consecutive fully-connected layers of length 1024, 512 and 2.
Figure 2.
Figure 2.
Bend point estimation on the fitted S-curves of the coverage in function of the log RPKM for both the E. coli (left) and S. aureus (right) dataset. The positive samples for each dataset (red) are plotted with the predicted (blue) ones for the fitted S-curve. For each dataset, the lower bend point of the fitted curve is estimated using the bent-cable function to obtain the minimum cut-off values.
Figure 3.
Figure 3.
The precision-recall curves of the different networks on the E. coli dataset. the precision-recall curves are given in case of the multiple start site and the single start site set-up. The full model (full line), combining the RNN and CNN outperforms both the single CNN (dashed) and RNN (dotted) architecture.
Figure 4.
Figure 4.
Venn diagram displaying the distributions of the proteins verified by Edman sequencing (left) and mass spectrometry (right) within the annotations provided by DeepRibo and the NCBI RefSeq database (labels). Distributions only include expressed ORFs, determined using the S-curve methodology.
Figure 5.
Figure 5.
E value distributions for the pBLAST results on newly predicted proteins (left) and proteoforms (right) for the different datasets. The E values are given for the best hit (if existent) for each of the false positives. The dashed line indicates the E value of 1.
Figure 6.
Figure 6.
DeepRibo example annotations displayed alongside the ribo-seq input signal and RefSeq annotations. The data is formatted using the GWIPS-viz browser (43) and is hosted publicly (see Supplementary Data). On every track is displayed (from top to bottom): ribo-seq signal (sense: orange, antisense: blue), TISs of all ORF samples present in the test set, annotations predicted by DeepRibo not in agreement with the RefSeq assembly (Predicted ORF) and the RefSeq genome annotations used to label the data (Labeled ORF). (A) The highest ranking proteoform prediction (gene: PqqL, rank: 231) for E. coli. (B) The highest ranking proteoform prediction (gene: UbiE, rank: 131) for S. aureus. (C) The highest ranking novel protein for E. coli with no pBLAST alignments (rank: 1302). (D) An example of a predicted proteoform in a region with overlapping genes (gene: ybhF, rank: 941).

Similar articles

Cited by

References

    1. Land M., Hauser L., Jun S.-R., Nookaew I., Leuze M.R., Ahn T.-H., Karpinets T., Lund O., Kora G., Wassenaar T. et al. .. Insights from 20 years of bacterial genome sequencing. Funct. Integrative Genomics. 2015; 15:141–161. - PMC - PubMed
    1. Richardson E.J., Watson M.. The automatic annotation of bacterial genomes. Brief. Bioinformatics. 2013; 14:1–12. - PMC - PubMed
    1. Fields A.P., Rodriguez E.H., Jovanovic M., Stern-Ginossar N., Haas B.J., Mertins P., Raychowdhury R., Hacohen N., Carr S.A., Ingolia N.T. et al. .. A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation. Mol. Cell. 2015; 60:816–827. - PMC - PubMed
    1. Delcher A. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999; 27:4636–4641. - PMC - PubMed
    1. Hyatt D., Chen G.L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J.. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11:119. - PMC - PubMed

Publication types

MeSH terms