Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 8;11(4):e11533.
doi: 10.1002/aps3.11533. eCollection 2023 Jul-Aug.

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Affiliations

Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes

Vidya S Vuruputoor et al. Appl Plant Sci. .

Erratum in

Abstract

Premise: Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions.

Methods: The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity.

Results: Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended.

Discussion: While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.

Keywords: BRAKER; MAKER; StringTie2; TSEBRA; gene identification; genome annotation; plant genomes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Genome size, repeat content, and BUSCO completeness for the five plant genomes: Arabidopsis thaliana, Populus trichocarpa, Funaria hygrometrica, Rosa chinensis, and Liriodendron chinense. Each pie represents the BUSCO completeness. Green denotes the completeness score, orange indicates the fragmented score, and blue indicates the missing score from BUSCO. (A) BUSCO scores estimated from the published assemblies. (B) BUSCO scores estimated from protein‐coding gene predictions from the published annotations.
Figure 2
Figure 2
Comparison of BUSCO, sensitivity, and false positive rates between the Arabidopsis and Populus annotations (Appendix S8). (A) BUSCO completeness scores for the MK (SR/RM2+) and BR (SR/RM2+) runs of Arabidopsis and Populus. Green denotes the completeness score, orange indicates the fragmented score, and blue indicates the missing score from BUSCO. (B) False positive rates and sensitivity scores from Mikado against published annotations for Arabidopsis (red) and Populus (gold) for the MAKER, BRAKER, TSEBRA, Trinity, and StringTie2 runs. The scores were assessed using Mikado. Multiple points per run reflect differences in input read type and repeat masking.
Figure 3
Figure 3
Comparing metrics between BRAKER (blue) and StringTie2 (red) predictions. (A) Mono:multi ratios, (B) BUSCO comparisons, and (C) EnTAP annotation rates of the gene models. The yellow region indicates the ideal value for each of the metrics.
Figure 4
Figure 4
Comparison of scores across all species between the runs of different input types and software. (A) BUSCO completeness scores. (B) Mono:multi ratios. (C) EnTAP annotation rates. MAKER is shown in green, BRAKER is light blue, TSEBRA is dark blue, and StringTie2 is red. The yellow rectangle represents the target scores for each benchmark. RM2+, RepeatModeler2 with LTRStruct.
Figure 5
Figure 5
The effect of soft masking on gene prediction in Liriodendron (Appendix S11). (A) Performing structural annotation on the unmasked Liriodendron genome results in the identification of more mono‐exonic genes as opposed to multi‐exonic genes. Blue denotes the BRAKER (BR) runs for both genomes, SR denotes short reads, and LR denotes long reads. The lighter shade represents mono‐exonics, and the darker shade represents the multi‐exonics. (B) More genes predicted using the unmasked genome (blue), as compared with only one gene predicted in this region with the masked genome (red). The green track shows the long terminal repeat elements in the genome as identified by RepeatModeler2. The RNA alignment reads show a read pile‐up at the predicted gene (masked track).

Similar articles

Cited by

References

    1. Amarasinghe, S. L. , Su S., Dong X., Zappia L., Ritchie M. E., and Gouil Q.. 2020. Opportunities and challenges in long‐read sequencing data analysis. Genome Biology 21(1): 30. - PMC - PubMed
    1. Andrews, S. 2010. FastQC: A quality control tool for high throughput sequence data. Available online. Website: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [accessed 17 May 2018].
    1. Arabidopsis Genome Initiative . 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 408(6814): 796–815. - PubMed
    1. Armisén, D. , Lecharny A., and Aubourg S.. 2008. Unique genes in plants: Specificities and conserved features throughout evolution. BMC Evolutionary Biology 8: 280. - PMC - PubMed
    1. Banerjee, S. , Bhandary P., Woodhouse M., Sen T. Z., Wise R. P., and Andorf C. M.. 2021. FINDER: An automated software package to annotate eukaryotic genes from RNA‐Seq data and associated protein sequences. BMC Bioinformatics 22(1): 205. - PMC - PubMed