Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Apr 17;13(4):709.
doi: 10.3390/genes13040709.

Methodologies for the De novo Discovery of Transposable Element Families

Affiliations
Review

Methodologies for the De novo Discovery of Transposable Element Families

Jessica M Storer et al. Genes (Basel). .

Abstract

The discovery and characterization of transposable element (TE) families are crucial tasks in the process of genome annotation. Careful curation of TE libraries for each organism is necessary as each has been exposed to a unique and often complex set of TE families. De novo methods have been developed; however, a fully automated and accurate approach to the development of complete libraries remains elusive. In this review, we cover established methods and recent developments in de novo TE analysis. We also present various methodologies used to assess these tools and discuss opportunities for further advancement of the field.

Keywords: curation; de novo methods; genome annotation; repeats; signature-based methods; transposable element; transposon.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Spectrum of methodologies for the discovery of TE sequences.
Figure 2
Figure 2
K-mer-based approaches on sequence assemblies. Upon characterizing the k-mer composition of the assembly, the word counts are either: simply used to annotate each base of the sequence (mer-engine), used to discriminate regions of high repetitiveness (RAP, WindowMasker, RED), clustered (P-Clouds), or used as anchors in a seed and extension process (RepeatScout, phRAIDER, RepSeek).
Figure 3
Figure 3
Self-comparison approaches. These methods attempt an all-vs.-all self-alignment using the whole assembly or a portion thereof. The self-alignments, viewed as a dot plot, will have many off-diagonal alignments representing dispersed similarities. These methods group the alignments into “piles”, defined by their distinct coverage across a region of the assembly. The primary difference between methods is in how they group piles into families. PILER and CARP require that elements are globally alignable, thereby identifying R1/R3 as a distinct family rather than fragments. Grouper and RECON apply single-linkage clustering, which, in this example, groups all fragments into a single family. RECON further attempts to identify composite families by looking for overrepresented internal edges—in this example, the internal edges were not deemed significant (red x’s).
Figure 4
Figure 4
Read-based de novo methodologies. Due to the overwhelming size of read datasets, methods often start by either downsampling or filtering low-coverage regions based upon read k-mer frequencies. At this stage, either the remaining reads or the k-mers themselves are assembled into contigs or clustered into distinct groups representing repetitive families.
Figure 5
Figure 5
Examples of commonly used TE Signatures for Detection. Structural features: the identification of LTR/ERV elements, class II elements, non-LTR retrotransposable elements, and Helitrons can be achieved by searching for LTRs (~100–1000 bp direct repeats), TIRs (~10–40 bp inverted repeats), TSDs (6–21 bp on average duplications), and hairpin structures, respectively. In addition, the A and B boxes seen in RNA polymerase III promoters and 3′ terminal A/T-rich sequence can be used to identify SINE elements. Motifs/Protein Homology: the order, orientation, and similarity to protein domains is key to homology-based searches. Gag: group-specific antigen; PR: pathogenesis-related; RT: reverse transcriptase; EN: endonuclease; Env: envelope; RH: ribonuclease H; MT: methyltransferase; YR: tyrosine recombinase. Other sequence structures (not seen in the figure above) observed in LINEs are their poly-A or simple-repeat tails, and the RT and apurinic–apyrimidinic EN (APE) domains of the Pol protein.
Figure 6
Figure 6
Workflow of select TE discovery pipelines. Each process in the pipeline has been categorized as classification (purple), signature-based TE detection (green), de novo TE detection (gold), homology-based detection (black), genome annotation (red), filter and/or refinement (blue) and clustering (grey). Arrows indicate the general workflow direction. NOTE: the above image is meant to describe the high-level organization of each pipeline, and does not reflect the inherent complexity contained within. Refer to Supplementary Materials for additional details.

References

    1. Schnable P.S., Ware D., Fulton R.S., Stein J.C., Wei F., Pasternak S., Liang C., Zhang J., Fulton L., Graves T.A., et al. The B73 maize genome: Complexity, diversity, and dynamics. Science. 2009;326:1112–1115. doi: 10.1126/science.1178534. - DOI - PubMed
    1. International Barley Genome Sequencing Consortium. Mayer K.F., Waugh R., Brown J.W., Schulman A., Langridge P., Platzer M., Fincher G.B., Muehlbauer G.J., Sato K., et al. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491:711–716. - PubMed
    1. Meyer A., Schloissnig S., Franchini P., Du K., Woltering J.M., Irisarri I., Wong W.Y., Nowoshilow S., Kneitz S., Kawaguchi A., et al. Giant lungfish genome elucidates the conquest of land by vertebrates. Nature. 2021;590:284–289. doi: 10.1038/s41586-021-03198-8. - DOI - PMC - PubMed
    1. Doolittle W.F., Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature. 1980;284:601–603. doi: 10.1038/284601a0. - DOI - PubMed
    1. Smit A.F.A. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 1996;6:743–748. doi: 10.1016/S0959-437X(96)80030-X. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources