Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Sep 15:2025.05.07.652745.
doi: 10.1101/2025.05.07.652745.

Efficient evidence-based genome annotation with EviAnn

Affiliations

Efficient evidence-based genome annotation with EviAnn

Aleksey V Zimin et al. bioRxiv. .

Abstract

For many years, machine learning-based ab initio gene finding approaches have been central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of gene expression data, allowing scientists to rely more heavily on this class of evidence. In addition, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn (Evidence-based Annotator), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely ab initio gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release and from Bioconda as "eviann".

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Accuracy of gene-level annotations by EviAnn, MAKER2, BRAKER3, and FINDER in six species. A gene locus was counted as correct if at least one transcript or CDS at that locus matched a reference transcript at the same locus, where a match required exact matches of all intron boundaries. Transcripts assembled with StringTie2 were used as input evidence for MAKER2, BRAKER3 and EviAnn. Data points representing the StringTie-assembled transcripts were added to illustrate how well those transcripts without further processing matched the references.
Figure 2.
Figure 2.
Accuracy of annotations of protein-coding sequences (CDSs) by EviAnn, MAKER2, BRAKER3, and FINDER in six species. A protein-coding annotation was considered correct if all coordinates of the protein-coding regions (CDSs) precisely matched the reference annotation. Noncoding exons were not considered in this evaluation.
Figure 3.
Figure 3.
Accuracy of annotations of transcript sequences by EviAnn, MAKER2, BRAKER3, and FINDER in six species. A transcript was considered correct if all introns precisely matched the reference annotation and if both the transcription start and end sites were within 100bp of the reference values. MAKER2 results are not shown for D. rerio and M. musculus because it failed to complete after one month on a 24-core Intel Xeon Gold server.
Figure 4.
Figure 4.
Annotation comparisons for A. thaliana using proteins, from a different taxonomic tribe, Brassiceae, as input to all pipelines. Accuracies are shown at the gene, CDS, and transcript levels, defined as in Figures 1–3.
Figure 5.
Figure 5.
Sensitivity and precision for annotations of the mouse (M.musculus) genome with varying amounts of the RNA-seq data. The individual points are labeled with the number of RNA-seq experiments used. The protein set is the same for all experiments.
Figure 6.
Figure 6.
A simplified diagram of the EviAnn genome annotation pipeline.
Figure 7.
Figure 7.
Resolution of conflicts between intron chains of aligned coding sequences (CDSs) from related species’ proteins (green) and transcripts assembled from the RNA-seq data (blue). In all cases, EviAnn uses the frame specified by the start of the aligned CDS and only annotates a transcript/CDS pair if it can locate a complete ORF on the transcript. (a) If there are no conflicts in the intron chain, use the transcript/CDS pair as is, and look for a complete ORF. (b) If the intron chain fails to match, trust the transcript intron chain (blue), look for a complete ORF on the transcript, and annotate if found. If no complete ORF is found, delete the transcript and use the CDS if it contains a complete ORF. (c) With a partial intron chain match, look for the first matching splice junction (arrow) and produce a consensus pseudo-transcript using exons from the transcript and CDS. Not pictured here: if the transcript is contained in the CDS, simply discard the transcript and use the exons from the CDS.

References

    1. Schuster SL, Hsieh AC. The untranslated regions of mRNAs in cancer. Trends in cancer. 2019. Apr 1;5(4):245–62. - PMC - PubMed
    1. Cenik C, Derti A, Mellor JC, Berriz GF, Roth FP. Genome-wide functional analysis of human 5′untranslated region introns. Genome biology. 2010. Mar;11:1–7.
    1. Chatterjee S, Rao SJ, Pal JK. Pathological mutations in 5′ untranslated regions of human genes. eLS. 2001:1–8.
    1. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome research. 2008. Jan 1;18(1):188–96. - PMC - PubMed
    1. Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC bioinformatics. 2011. Dec;12(1):1–4. - PMC - PubMed

Publication types

LinkOut - more resources