Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 28;40(12):btae685.
doi: 10.1093/bioinformatics/btae685.

Tiberius: end-to-end deep learning with an HMM for gene prediction

Affiliations

Tiberius: end-to-end deep learning with an HMM for gene prediction

Lars Gabriel et al. Bioinformatics. .

Abstract

Motivation: For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst et al. demonstrated with their program Helixer that the accuracy of ab initio eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.

Results: We present Tiberius, a novel deep learning-based ab initio gene predictor that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. Tiberius uses a custom gene prediction loss and was trained for prediction in mammalian genomes and evaluated on human and two other genomes. It significantly outperforms existing ab initio methods, achieving F1 scores of 62% at gene level for the human genome, compared to 21% for the next best ab initio method. In de novo mode, Tiberius predicts the exon-intron structure of two out of three human genes without error. Remarkably, even Tiberius's ab initio accuracy matches that of BRAKER3, which uses RNA-seq data and a protein database. Tiberius's highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.

Availability and implementation: https://github.com/Gaius-Augustus/Tiberius.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Illustration of the CNN-LSTM architecture of the Tiberius model for gene structure classification at each base position. The HMM layer computes posterior probabilities or complete gene structures (Viterbi sequences). The model has approximately 8 million trainable parameters, and it was trained with sequences of length T = 9999 and a length of T = 500,004 was used for inference.
Figure 2.
Figure 2.
The states of the HMM used for inference with Tiberius and the transitions between them. The 11 coding-exon position states are subdivided by reading frame i: Exon-i represents non-border positions within an exon, while ASS-i (acceptor splice site) and DSS-i (donor splice site) states are the first and last position of an exon that starts and ends with reading frame i, respectively. The four non-coding position states are intergenic region (IR) or within an intron.
Figure 3.
Figure 3.
Gene and exon-level precision and recall for Tiberius, BRAKER3, GALBA, Helixer, BRAKER2, and AUGUSTUS. Tiberius, Helixer, and AUGUSTUS performed ab initio predictions while the other methods additionally incorporated extrinsic evidence: GALBA proteins from related species, BRAKER2 a large protein database, and BRAKER3 a large protein database and RNA-seq. For the human genome, Tiberius was also run de novo.
Figure 4.
Figure 4.
Tiberius accuracy for test species, including non-mammalian species, plotted against the median time from the most recent common ancestor (MRCA) with Mus musculus, generated with TimeTree (Kumar et al. 2022).

References

    1. Becker F, Stanke M.. learnMSA: learning and aligning large protein families. Gigascience 2022;11:giac104. - PMC - PubMed
    1. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999;27:573–80. - PMC - PubMed
    1. Brůna T, Hoff KJ, Lomsadze A. et al. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 2021;3:lqaa108. - PMC - PubMed
    1. Brůna T, Li H, Guhlin J. et al. Galba: genome annotation with miniprot and augustus. BMC Bioinformatics 2023;24:327. - PMC - PubMed
    1. Brůna T, Lomsadze A, Borodovsky M.. GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. Genome Res 2024;34:757–68. - PMC - PubMed