Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 31;24(1):327.
doi: 10.1186/s12859-023-05449-z.

Galba: genome annotation with miniprot and AUGUSTUS

Affiliations

Galba: genome annotation with miniprot and AUGUSTUS

Tomáš Brůna et al. BMC Bioinformatics. .

Abstract

Background: The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.

Results: Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.

Conclusions: Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Keywords: AUGUSTUS; Gene prediction; Miniprot; Protein coding gene.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The GALBA pipeline. Miniprot performs rapid spliced alignment of proteins against the genome. Subsequently, miniprothint (2) scores and classifies these alignments. Training genes for AUGUSTUS are generated from the best high quality miniprot alignment per locus (1). After training, AUGUSTUS predicts genes using the alignment evidence generated by miniprothint. AUGUSTUS parameters are refined by one iteration of training (3). The numbering of steps in the figure caption corresponds to the order in which steps were introduced into GALBA during development, see Additional file 1: Results section S4.1
Fig. 2
Fig. 2
Gene prediction of GALBA provided with either a proteome of a single reference species (corresponding to phylogenetic tree from [57]), or executed with a combination of the species listed on the right. BRAKER2 can only be executed with a certain level of redundancy in the protein reference set, and results are therefore only provided for the combined protein input set
Fig. 3
Fig. 3
Sensitivity and Specificity on gene level in 7 genomes smaller than 500 Mb. We show accuracy of miniprot raw alignments, AUGUSTUS ab initio trained on filtered miniprot alignments, GALBA (AUGUSTUS with hints by miniprot), BRAKER2, GeneMark-EP, GeneMark-ES, and a combination of GALBA and TSEBRA (labelled as TSEBRA G+B)
Fig. 4
Fig. 4
Sensitivity and Specificity on gene level in 7 genomes larger than 500 Mb. We show accuracy of miniprot raw alignments, AUGUSTUS ab initio trained on filtered miniprot alignments, GALBA (AUGUSTUS with hints by miniprot), BRAKER2, GeneMark-EP, GeneMark-ES, and a combination of GALBA and TSEBRA (labelled as TSEBRA G+B)
Fig. 5
Fig. 5
Network plot of gene F1 accuracy for (clockwise starting from the top, increasing genome sizes) insects, metazoa, plants, and vertebrates. We show accuracy of GALBA and its intermediate product miniprot, and of BRAKER2 and its intermediate GeneMark-ES and GeneMark-EP gene sets. Accuracy of the combiner TSEBRA combining the final gene sets of both GALBA and BRAKER2 is also shown as TSEBRA G+B

Update of

References

    1. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci. 2018;115(17):4325–33. doi: 10.1073/pnas.1720115115. - DOI - PMC - PubMed
    1. Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–1365. doi: 10.1038/s41587-021-01108-x. - DOI - PMC - PubMed
    1. Lawniczak MK, Durbin R, Flicek P, Lindblad-Toh K, Wei X, Archibald JM, Baker WJ, Belov K, Blaxter ML, Marques Bonet T, et al. Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci. 2022;119(4):2115639118. doi: 10.1073/pnas.2115639118. - DOI - PMC - PubMed
    1. Hope H, Willis S, Markie M, Elliott L. Wellcome Open Research. https://wellcomeopenresearch.org/browse/articles Accessed Accessed 10 April 2023. 2023.
    1. for Biotechnology Information NC. NCBI Genomes. https://www.ncbi.nlm.nih.gov/genome/browse#!/eukaryotes/ Accessed Accessed 10 April 2023. 2023.

Supplementary concepts

LinkOut - more resources