Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Apr 10:2023.04.10.536199.
doi: 10.1101/2023.04.10.536199.

GALBA: Genome Annotation with Miniprot and AUGUSTUS

Affiliations

GALBA: Genome Annotation with Miniprot and AUGUSTUS

Tomáš Brůna et al. bioRxiv. .

Update in

  • Galba: genome annotation with miniprot and AUGUSTUS.
    Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. Brůna T, et al. BMC Bioinformatics. 2023 Aug 31;24(1):327. doi: 10.1186/s12859-023-05449-z. BMC Bioinformatics. 2023. PMID: 37653395 Free PMC article.

Abstract

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
The GALBA pipeline.
Figure 2:
Figure 2:
Gene prediction F1-scores of GALBA across development steps using two different reference proteomes: dsim = D. simulans, combo = D. ananassae, D. grimshawi, D. pseudoobscura, D. virilis, and D. willistoni.
Figure 3:
Figure 3:
Introns predicted by miniprot, characterized by miniprothint-derived IMC and IBA scores. The predictions originate from running miniprot on D. melanogaster with reference proteomes of five other Drosophila species (see Figure 4 for the list of reference species). A small random offset was added to each item to reduce the amount of overlapping data points. Miniprothint discards all introns with IBA < 0.1 (the blue dotted line). This step improved the prediction Specificity from 80.0% to 89.8% at the cost of a Sensitivity decrease from 80.3% to 78.8%. Miniprothint also defines a set of high-confidence hints characterized by IBA >= 0.25 and IMC >= 4 (the red dashed lines). This further improved the Specificity to 98.5% while reducing the Sensitivity to 68.9%.
Figure 4:
Figure 4:
Gene prediction of GALBA provided with either a proteome of a single reference species (corresponding to phylogenetic tree from [25]), or executed with a combination of the species listed on the right. BRAKER2 can only be executed with a certain level of redundancy in the protein reference set, and results are therefore only provided for the combined protein input set.
Figure 5:
Figure 5:
Sensitivity and Specificity on gene level in 14 genomes.
Figure 6:
Figure 6:
Mono-exonic to multi-exonic gene ratios of the reference annotations, GALBA, BRAKER2, and a combination of both with TSEBRA in 14 model species.

Similar articles

Cited by

References

    1. A comparative genomics multitool for scientific discovery and conservation. Nature, 587(7833):240–245, 2020. - PMC - PubMed
    1. Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 2023. - PMC - PubMed
    1. Benson G.. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2):573–580, 1999. - PMC - PubMed
    1. Bruna T., Hoff K. J., Lomsadze A., Stanke M., and Borodovsky M.. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics and Bioinformatics, 3(1):lqaa108, 2021. - PMC - PubMed
    1. Bruna T., Lomsadze A., and Borodovsky M.. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genomics and Bioinformatics, 2(2):lqaa026, 2020. - PMC - PubMed

Publication types