Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 28;117(17):9451-9457.
doi: 10.1073/pnas.1921046117. Epub 2020 Apr 16.

RepeatModeler2 for automated genomic discovery of transposable element families

Affiliations

RepeatModeler2 for automated genomic discovery of transposable element families

Jullien M Flynn et al. Proc Natl Acad Sci U S A. .

Abstract

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).

Keywords: genome annotation; mobile genetic elements; transposon families.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement: C.F. and M.C.G.H. are coauthors on a 2018 review article: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1577-z.

Figures

Fig. 1.
Fig. 1.
RepeatModeler2 flow diagram.
Fig. 2.
Fig. 2.
Benchmarking of RepeatModeler2 on three model species. (Top) Genome composition (Upper) and number of families (Lower) of each TE subclass for the reference libraries. (Bottom) Genome composition (Upper) and number of families (Lower) of each TE subclass for the RepeatModeler2 library.
Fig. 3.
Fig. 3.
Evaluation family by family for RepeatModeler1 and RepeatModeler2. (A) Definitions of “Perfect,” “Good,” and “Present” families. “Perfect” families are those for which one sequence in our de novo library matches >95% in sequence identity and coverage to a family in the reference library. “Good” families are those in which multiple overlapping library sequences with alignments >95% similar to the reference consensus make up the >95% sequence coverage of the element. Finally, a family is considered “present” if one or multiple library sequences align with >80% similarity to the reference consensus sequence and cover >80% of the sequence. Otherwise, we consider a family “not found” (although there may be fragments present). (B) Summary of families found by the last release of RepeatModeler (RM1) and RepeatModeler2 (RM2). (C) Number of perfect families by subclass for each benchmark species.

References

    1. Smit A. F., Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9, 657–663 (1999). - PubMed
    1. Lander E. S., et al. ; International Human Genome Sequencing Consortium , Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). - PubMed
    1. Huang C. R. L., Burns K. H., Boeke J. D., Active transposition in genomes. Annu. Rev. Genet. 46, 651–675 (2012). - PMC - PubMed
    1. Bourque G., et al. , Ten things you should know about transposable elements. Genome Biol. 19, 199 (2018). - PMC - PubMed
    1. Jurka J., Kapitonov V. V., Kohany O., Jurka M. V., Repetitive sequences in complex genomes: Structure and evolution. Annu. Rev. Genomics Hum. Genet. 8, 241–259 (2007). - PubMed

Publication types

Substances

LinkOut - more resources