. 2019 Dec 16;20(1):275.

doi: 10.1186/s13059-019-1905-y.

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Shujun Ou¹, Weija Su², Yi Liao³, Kapeel Chougule⁴, Jireh R A Agda⁵, Adam J Hellinga⁵, Carlos Santiago Blanco Lugo⁵, Tyler A Elliott⁵, Doreen Ware^{4

6}, Thomas Peterson², Ning Jiang⁷, Candice N Hirsch⁸, Matthew B Hufford⁹

Affiliations

¹ Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, 50011, USA.
² Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, 50011, USA.
³ Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, 92697, USA.
⁴ Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA.
⁵ Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, N1G 2W1, Canada.
⁶ USDA-ARS NEA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, NY, 14853, USA.
⁷ Department of Horticulture, Michigan State University, East Lansing, MI, 48824, USA. jiangn@msu.edu.
⁸ Department of Agronomy and Plant Genetics, University of Minnesota, Saint Paul, MN, 55108, USA. cnhirsch@umn.edu.
⁹ Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, 50011, USA. mhufford@iastate.edu.

PMID: 31843001
PMCID: PMC6913007
DOI: 10.1186/s13059-019-1905-y

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Shujun Ou et al. Genome Biol. 2019.

. 2019 Dec 16;20(1):275.

doi: 10.1186/s13059-019-1905-y.

Authors

Affiliations

¹ Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, 50011, USA.
² Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, 50011, USA.
³ Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, 92697, USA.
⁴ Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA.
⁵ Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, N1G 2W1, Canada.
⁶ USDA-ARS NEA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, NY, 14853, USA.
⁷ Department of Horticulture, Michigan State University, East Lansing, MI, 48824, USA. jiangn@msu.edu.
⁸ Department of Agronomy and Plant Genetics, University of Minnesota, Saint Paul, MN, 55108, USA. cnhirsch@umn.edu.
⁹ Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, 50011, USA. mhufford@iastate.edu.

PMID: 31843001
PMCID: PMC6913007
DOI: 10.1186/s13059-019-1905-y

Abstract

Background: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations.

Results: We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F₁. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species.

Conclusions: The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

Keywords: Annotation; Benchmarking; Genome; Pipeline; Transposable element.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Schematic representation of benchmarking metrics. a Definition of TP, true positive; FP, false positive; FN, false negative; and TN, true negative. b Definition of sensitivity, specificity, accuracy, precision, F₁ measure, and false discovery rate (FDR). Each metric is calculated based on genomic sequence length in bp

**Fig. 2**
Annotation performance of general repeat annotators compared to the rice curated annotation. a Annotation and b classification performance of various methods. Sens, sensitivity; Spec, specificity; Accu, accuracy; Prec, precision; FDR, false discovery rate; F1, F₁ measure

**Fig. 3**
Annotation performance of retrotransposon-related programs as compared to the rice curated annotation. a Various methods to identify LTR retrotransposons. GRF-LTR_FINDER combines the terminal direct repeat search engine in GRF and the filtering engine in a modified version of LTR_FINDER for detection of LTR retrotransposons. The LTR_FINDER result was generated by the parallel version. b LTR_retriever-specific results, which were generated using LTR_retriever to process results from other programs specified in each of the names in the figure. c Non-LTR retrotransposon annotation methods. d Short interspersed nuclear element (SINE) annotation methods. Sens, sensitivity; Spec, specificity; Accu, accuracy; Prec, precision; FDR, false discovery rate; F1, F₁ measure

**Fig. 4**
Annotation performance of DNA transposon-related programs as compared to the rice curated annotation. a General methods and c structure-based methods to identify TIR elements. The TIR-Learner_rmLTR and TIRvish_rmLTR libraries had LTR-related sequences removed using the curated library. b Structure-based methods and specialized database to identify miniature inverted transposable elements (MITEs). d Annotation performance of *Helitron*-related methods as compared to the rice curated annotation. The HelitronScanner_clean result had non-*Helitron* TE sequences removed using the curated library. Sens, sensitivity; Spec, specificity; Accu, accuracy; Prec, precision; FDR, false discovery rate; F1, F₁ measure

**Fig. 5**
The Extensive *de-novo* TE Annotator (EDTA) pipeline. a The EDTA workflow. LTR retrotransposons, TIR elements, and *Helitron* candidates are identified from the genome sequence. Sublibraries (such as LTR library, TIR library, etc.) are filtered using EDTA library filtering scripts (including both basic filters and advanced filters, see the “Methods” section for details) for removal of misclassified TEs and are then used to mask TEs in the genome. The unmasked part of the genome is processed by RepeatModeler to identify non-LTR retrotransposons and any unclassified TEs that are missed by the structure-based library. Nested insertions and protein-coding sequences are removed in the final step to generate the final TE library. Performance of b EDTA stage 0 sublibraries and c EDTA stage 1 sublibraries after basic filtering and advanced filtering, respectively. Annotation of the rice genome using d the curated library and e the final EDTA-generated library

**Fig. 6**
Benchmarking of the EDTA pipeline. Misclassification rate of whole-genome TEs annotated by a our curated rice library, b the Maize TE Consortium curated maize library (Maize_MTEC), c the community curated Drosophila library (Dmel_std6.28), d the EDTA-generated rice library, e the EDTA-generated maize library, f the EDTA-generated Drosophila library, and g the EDTA-generated stage 0 library with only basic filtering. Benchmarking of EDTA-generated maize (h) and Drosophila (i) libraries using Maize_MTEC and Dmel_std6.28 libraries, respectively

See this image and copyright information in PMC

Comment in

Accounting for diverse transposable element landscapes is key to developing and evaluating accurate de novo annotation strategies.
Gozashti L, Hoekstra HE. Gozashti L, et al. Genome Biol. 2024 Jan 2;25(1):4. doi: 10.1186/s13059-023-03118-1. Genome Biol. 2024. PMID: 38166955 Free PMC article.

References

1. McClintock B. Cytogenetic studies of maize and Neurospora. Year B Carnegie Inst Wash. 1947;46:146–152.
1. Mills RE, Bennett EA, Iskow RC, Devine SE. Which transposable elements are active in the human genome? Trends Genet. 2007;23:183–191. doi: 10.1016/j.tig.2007.02.006. - DOI - PubMed
1. International Wheat Genome Sequencing Consortium (IWGSC), IWGSC RefSeq principal investigators. Appels R, Eversole K, Feuillet C, Keller B, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 2018;361:eaar7191. doi: 10.1126/science.aar7191. - DOI - PubMed
1. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–1115. doi: 10.1126/science.1178534. - DOI - PubMed
1. Marand AP, Zhao H, Zhang W, Zeng Z, Fang C, Jiang J. Historical meiotic crossover hotspots fueled patterns of evolutionary divergence in rice. Plant Cell. 2019;31:645–662. doi: 10.1105/tpc.18.00750. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Affiliations

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases