Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec 16;20(1):275.
doi: 10.1186/s13059-019-1905-y.

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Affiliations

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Shujun Ou et al. Genome Biol. .

Abstract

Background: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations.

Results: We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species.

Conclusions: The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

Keywords: Annotation; Benchmarking; Genome; Pipeline; Transposable element.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic representation of benchmarking metrics. a Definition of TP, true positive; FP, false positive; FN, false negative; and TN, true negative. b Definition of sensitivity, specificity, accuracy, precision, F1 measure, and false discovery rate (FDR). Each metric is calculated based on genomic sequence length in bp
Fig. 2
Fig. 2
Annotation performance of general repeat annotators compared to the rice curated annotation. a Annotation and b classification performance of various methods. Sens, sensitivity; Spec, specificity; Accu, accuracy; Prec, precision; FDR, false discovery rate; F1, F1 measure
Fig. 3
Fig. 3
Annotation performance of retrotransposon-related programs as compared to the rice curated annotation. a Various methods to identify LTR retrotransposons. GRF-LTR_FINDER combines the terminal direct repeat search engine in GRF and the filtering engine in a modified version of LTR_FINDER for detection of LTR retrotransposons. The LTR_FINDER result was generated by the parallel version. b LTR_retriever-specific results, which were generated using LTR_retriever to process results from other programs specified in each of the names in the figure. c Non-LTR retrotransposon annotation methods. d Short interspersed nuclear element (SINE) annotation methods. Sens, sensitivity; Spec, specificity; Accu, accuracy; Prec, precision; FDR, false discovery rate; F1, F1 measure
Fig. 4
Fig. 4
Annotation performance of DNA transposon-related programs as compared to the rice curated annotation. a General methods and c structure-based methods to identify TIR elements. The TIR-Learner_rmLTR and TIRvish_rmLTR libraries had LTR-related sequences removed using the curated library. b Structure-based methods and specialized database to identify miniature inverted transposable elements (MITEs). d Annotation performance of Helitron-related methods as compared to the rice curated annotation. The HelitronScanner_clean result had non-Helitron TE sequences removed using the curated library. Sens, sensitivity; Spec, specificity; Accu, accuracy; Prec, precision; FDR, false discovery rate; F1, F1 measure
Fig. 5
Fig. 5
The Extensive de-novo TE Annotator (EDTA) pipeline. a The EDTA workflow. LTR retrotransposons, TIR elements, and Helitron candidates are identified from the genome sequence. Sublibraries (such as LTR library, TIR library, etc.) are filtered using EDTA library filtering scripts (including both basic filters and advanced filters, see the “Methods” section for details) for removal of misclassified TEs and are then used to mask TEs in the genome. The unmasked part of the genome is processed by RepeatModeler to identify non-LTR retrotransposons and any unclassified TEs that are missed by the structure-based library. Nested insertions and protein-coding sequences are removed in the final step to generate the final TE library. Performance of b EDTA stage 0 sublibraries and c EDTA stage 1 sublibraries after basic filtering and advanced filtering, respectively. Annotation of the rice genome using d the curated library and e the final EDTA-generated library
Fig. 6
Fig. 6
Benchmarking of the EDTA pipeline. Misclassification rate of whole-genome TEs annotated by a our curated rice library, b the Maize TE Consortium curated maize library (Maize_MTEC), c the community curated Drosophila library (Dmel_std6.28), d the EDTA-generated rice library, e the EDTA-generated maize library, f the EDTA-generated Drosophila library, and g the EDTA-generated stage 0 library with only basic filtering. Benchmarking of EDTA-generated maize (h) and Drosophila (i) libraries using Maize_MTEC and Dmel_std6.28 libraries, respectively

Comment in

References

    1. McClintock B. Cytogenetic studies of maize and Neurospora. Year B Carnegie Inst Wash. 1947;46:146–152.
    1. Mills RE, Bennett EA, Iskow RC, Devine SE. Which transposable elements are active in the human genome? Trends Genet. 2007;23:183–191. doi: 10.1016/j.tig.2007.02.006. - DOI - PubMed
    1. International Wheat Genome Sequencing Consortium (IWGSC), IWGSC RefSeq principal investigators. Appels R, Eversole K, Feuillet C, Keller B, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 2018;361:eaar7191. doi: 10.1126/science.aar7191. - DOI - PubMed
    1. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–1115. doi: 10.1126/science.1178534. - DOI - PubMed
    1. Marand AP, Zhao H, Zhang W, Zeng Z, Fang C, Jiang J. Historical meiotic crossover hotspots fueled patterns of evolutionary divergence in rice. Plant Cell. 2019;31:645–662. doi: 10.1105/tpc.18.00750. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources