. 2023 Aug 8;11(4):e11533.

doi: 10.1002/aps3.11533. eCollection 2023 Jul-Aug.

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Affiliations

PMID: 37601314
PMCID: PMC10439824
DOI: 10.1002/aps3.11533

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Vidya S Vuruputoor et al. Appl Plant Sci. 2023.

. 2023 Aug 8;11(4):e11533.

doi: 10.1002/aps3.11533. eCollection 2023 Jul-Aug.

Affiliation

¹ Department of Ecology and Evolutionary Biology University of Connecticut Storrs Connecticut 06269 USA.

PMID: 37601314
PMCID: PMC10439824
DOI: 10.1002/aps3.11533

Erratum in

Correction to Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes.
[No authors listed] [No authors listed] Appl Plant Sci. 2023 Nov 7;11(6):e11553. doi: 10.1002/aps3.11553. eCollection 2023 Nov-Dec. Appl Plant Sci. 2023. PMID: 38106536 Free PMC article.

Abstract

Premise: Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions.

Methods: The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity.

Results: Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended.

Discussion: While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.

Keywords: BRAKER; MAKER; StringTie2; TSEBRA; gene identification; genome annotation; plant genomes.

PubMed Disclaimer

Figures

**Figure 1**
Genome size, repeat content, and BUSCO completeness for the five plant genomes: *Arabidopsis thaliana*, *Populus trichocarpa*, *Funaria hygrometrica*, *Rosa chinensis*, and *Liriodendron chinense*. Each pie represents the BUSCO completeness. Green denotes the completeness score, orange indicates the fragmented score, and blue indicates the missing score from BUSCO. (A) BUSCO scores estimated from the published assemblies. (B) BUSCO scores estimated from protein‐coding gene predictions from the published annotations.

**Figure 2**
Comparison of BUSCO, sensitivity, and false positive rates between the *Arabidopsis* and *Populus* annotations (Appendix S8). (A) BUSCO completeness scores for the MK (SR/RM2+) and BR (SR/RM2+) runs of *Arabidopsis* and *Populus*. Green denotes the completeness score, orange indicates the fragmented score, and blue indicates the missing score from BUSCO. (B) False positive rates and sensitivity scores from Mikado against published annotations for *Arabidopsis* (red) and *Populus* (gold) for the MAKER, BRAKER, TSEBRA, Trinity, and StringTie2 runs. The scores were assessed using Mikado. Multiple points per run reflect differences in input read type and repeat masking.

**Figure 3**
Comparing metrics between BRAKER (blue) and StringTie2 (red) predictions. (A) Mono:multi ratios, (B) BUSCO comparisons, and (C) EnTAP annotation rates of the gene models. The yellow region indicates the ideal value for each of the metrics.

**Figure 4**
Comparison of scores across all species between the runs of different input types and software. (A) BUSCO completeness scores. (B) Mono:multi ratios. (C) EnTAP annotation rates. MAKER is shown in green, BRAKER is light blue, TSEBRA is dark blue, and StringTie2 is red. The yellow rectangle represents the target scores for each benchmark. RM2+, RepeatModeler2 with LTRStruct.

**Figure 5**
The effect of soft masking on gene prediction in *Liriodendron* (Appendix S11). (A) Performing structural annotation on the unmasked *Liriodendron* genome results in the identification of more mono‐exonic genes as opposed to multi‐exonic genes. Blue denotes the BRAKER (BR) runs for both genomes, SR denotes short reads, and LR denotes long reads. The lighter shade represents mono‐exonics, and the darker shade represents the multi‐exonics. (B) More genes predicted using the unmasked genome (blue), as compared with only one gene predicted in this region with the masked genome (red). The green track shows the long terminal repeat elements in the genome as identified by RepeatModeler2. The RNA alignment reads show a read pile‐up at the predicted gene (masked track).

See this image and copyright information in PMC

Cited by

Galba: genome annotation with miniprot and AUGUSTUS.
Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. Brůna T, et al. BMC Bioinformatics. 2023 Aug 31;24(1):327. doi: 10.1186/s12859-023-05449-z. BMC Bioinformatics. 2023. PMID: 37653395 Free PMC article.
Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response.
Lo T, Coombe L, Gagalova KK, Marr A, Warren RL, Kirk H, Pandoh P, Zhao Y, Moore RA, Mungall AJ, Ritland C, Pavy N, Jones SJM, Bohlmann J, Bousquet J, Birol I, Thomson A. Lo T, et al. G3 (Bethesda). 2023 Dec 29;14(1):jkad247. doi: 10.1093/g3journal/jkad247. G3 (Bethesda). 2023. PMID: 37875130 Free PMC article.
Crossroads of assembling a moss genome: navigating contaminants and horizontal gene transfer in the moss Physcomitrellopsis africana.
Vuruputoor VS, Starovoitov A, Cai Y, Liu Y, Rahmatpour N, Hedderson TA, Wilding N, Wegrzyn JL, Goffinet B. Vuruputoor VS, et al. G3 (Bethesda). 2024 Jul 8;14(7):jkae104. doi: 10.1093/g3journal/jkae104. G3 (Bethesda). 2024. PMID: 38781445 Free PMC article.
Annotation of protein-coding genes in 49 diatom genomes from the Bacillariophyta clade.
Nenasheva N, Pitzschel C, Webster CN, Hart AJ, Wegrzyn JL, Bengtsson MM, Hoff KJ. Nenasheva N, et al. Sci Data. 2025 Jun 11;12(1):985. doi: 10.1038/s41597-025-05306-z. Sci Data. 2025. PMID: 40500266 Free PMC article.
GALBA: Genome Annotation with Miniprot and AUGUSTUS.
Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. Brůna T, et al. bioRxiv [Preprint]. 2023 Apr 10:2023.04.10.536199. doi: 10.1101/2023.04.10.536199. bioRxiv. 2023. Update in: BMC Bioinformatics. 2023 Aug 31;24(1):327. doi: 10.1186/s12859-023-05449-z. PMID: 37090650 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. Amarasinghe, S. L. , Su S., Dong X., Zappia L., Ritchie M. E., and Gouil Q.. 2020. Opportunities and challenges in long‐read sequencing data analysis. Genome Biology 21(1): 30. - PMC - PubMed
1. Andrews, S. 2010. FastQC: A quality control tool for high throughput sequence data. Available online. Website: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [accessed 17 May 2018].
1. Arabidopsis Genome Initiative . 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 408(6814): 796–815. - PubMed
1. Armisén, D. , Lecharny A., and Aubourg S.. 2008. Unique genes in plants: Specificities and conserved features throughout evolution. BMC Evolutionary Biology 8: 280. - PMC - PubMed
1. Banerjee, S. , Bhandary P., Woodhouse M., Sen T. Z., Wise R. P., and Andorf C. M.. 2021. FINDER: An automated software package to annotate eukaryotic genes from RNA‐Seq data and associated protein sequences. BMC Bioinformatics 22(1): 205. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Affiliation

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Authors

Affiliation

Erratum in

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Erratum in

Abstract

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources