. 2024 Jun 25;34(5):769-777.

doi: 10.1101/gr.278090.123.

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA

Lars Gabriel^{1

2}, Tomáš Brůna³, Katharina J Hoff^{4

2}, Matthis Ebel^{1

2}, Alexandre Lomsadze⁵, Mark Borodovsky^#^{6

7}, Mario Stanke^#^{1

2}

Affiliations

¹ Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany.
² Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.
³ U.S. Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA.
⁴ Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany; katharina.hoff@uni-greifswald.de alexandre.lomsadze@bme.gatech.edu.
⁵ Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA; katharina.hoff@uni-greifswald.de alexandre.lomsadze@bme.gatech.edu.
⁶ Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA.
⁷ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA.

^# Contributed equally.

PMID: 38866550
PMCID: PMC11216308
DOI: 10.1101/gr.278090.123

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA

Lars Gabriel et al. Genome Res. 2024.

. 2024 Jun 25;34(5):769-777.

doi: 10.1101/gr.278090.123.

Authors

Lars Gabriel^{1

2}, Tomáš Brůna³, Katharina J Hoff^{4

2}, Matthis Ebel^{1

2}, Alexandre Lomsadze⁵, Mark Borodovsky^#^{6

7}, Mario Stanke^#^{1

2}

Affiliations

¹ Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany.
² Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.
³ U.S. Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA.
⁴ Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany; katharina.hoff@uni-greifswald.de alexandre.lomsadze@bme.gatech.edu.
⁵ Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA; katharina.hoff@uni-greifswald.de alexandre.lomsadze@bme.gatech.edu.
⁶ Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA.
⁷ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA.

^# Contributed equally.

PMID: 38866550
PMCID: PMC11216308
DOI: 10.1101/gr.278090.123

Abstract

Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.

PubMed Disclaimer

Figures

**Figure 1.**
Schematic view of the BRAKER3 pipeline. Required inputs are genomic sequences, short-read RNA-seq data, and a protein database. The RNA-seq data can be provided in three different forms: IDs of libraries available at the Sequence Read Archive (Leinonen et al. 2010), unaligned reads, or aligned reads. If library IDs are given, BRAKER3 downloads the raw RNA-seq reads using the SRA Toolkit (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) and aligns them to the genome using HISAT2 (Kim et al. 2019). It is also possible to use a combination of these formats when using more than one library.

**Figure 2.**
Average precision and sensitivity of gene predictions made by BRAKER1, BRAKER2, TSEBRA, GeneMark-ETP, and BRAKER3 for the genomes of 11 different species (listed in Supplemental Table S1). Inputs were the genomic sequences, short-read RNA-seq libraries, and protein databases (*order excluded*).

**Figure 3.**
Gene-level precision and sensitivity of gene predictions made by BRAKER1, BRAKER2, TSEBRA, GeneMark-ETP, and BRAKER3 for the genomes of 11 different species: well-annotated and compact genomes (first and second row), well-annotated and large genomes (third row), other genomes (fourth row). The fourth column shows the average for each group. Inputs were the genomic sequences, short-read RNA-seq libraries, and protein databases (*order excluded*).

**Figure 4.**
Lowly, medium, and highly expressed transcripts are in the first, second, and third terciles of expression levels, respectively.

**Figure 5.**
Average precision and sensitivity of gene predictions made by MAKER2, Funannotate, and BRAKER3 for a subset of eight species (excluding the mouse, spider, and fish genomes). Inputs were the genomic sequences, short-read RNA-seq libraries, and protein databases (*close relatives included*). The accuracy of MAKER2 reported here can be regarded as an upper limit of what can be expected when annotating a previously unannotated genome (see “Experiments” section).

**Figure 6.**
The execution time of BRAKER3. The time required for aligning the RNA-seq to the genome and thus producing the BAM input files is not included.

See this image and copyright information in PMC

Update of

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA.
Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. Gabriel L, et al. bioRxiv [Preprint]. 2024 Feb 29:2023.06.10.544449. doi: 10.1101/2023.06.10.544449. bioRxiv. 2024. Update in: Genome Res. 2024 Jun 25;34(5):769-777. doi: 10.1101/gr.278090.123. PMID: 37398387 Free PMC article. Updated. Preprint.

References

1. Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM. 2021. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences. BMC Bioinformatics 22: 205. 10.1186/s12859-021-04120-9 - DOI - PMC - PubMed
1. Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34: 525–527. 10.1038/nbt.3519 - DOI - PubMed
1. Brůna T, Lomsadze A, Borodovsky M. 2020. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform 2: lqaa026. 10.1093/nargab/lqaa026 - DOI - PMC - PubMed
1. Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. 2021. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 3: lqaa108. 10.1093/nargab/lqaa108 - DOI - PMC - PubMed
1. Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. 2023. Galba: genome annotation with miniprot and AUGUSTUS. BMC Bioinformatics 24: 327. 10.1186/s12859-023-05449-z - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 GM128145/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- HighWire
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA

Affiliations

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA

Authors

Affiliations

Abstract

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources