. 2021 Jun;1(6):e154.

doi: 10.1002/cpz1.154.

Curation Guidelines for de novo Generated Transposable Element Families

Jessica M Storer¹, Robert Hubley¹, Jeb Rosen¹, Arian F A Smit¹

Affiliations

PMID: 34138525
PMCID: PMC9191830
DOI: 10.1002/cpz1.154

Curation Guidelines for de novo Generated Transposable Element Families

Jessica M Storer et al. Curr Protoc. 2021 Jun.

. 2021 Jun;1(6):e154.

doi: 10.1002/cpz1.154.

Authors

Jessica M Storer¹, Robert Hubley¹, Jeb Rosen¹, Arian F A Smit¹

Affiliation

¹ Institute for Systems Biology, Seattle, Washington.

PMID: 34138525
PMCID: PMC9191830
DOI: 10.1002/cpz1.154

Abstract

Transposable elements (TEs) have the ability to alter individual genomic landscapes and shape the course of evolution for species in which they reside. Such profound changes can be understood by studying the biology of the organism and the interplay of the TEs it hosts. Characterizing and curating TEs across a wide range of species is a fundamental first step in this endeavor. This protocol employs techniques honed while developing TE libraries for a wide range of organisms and specifically addresses: (1) the extension of truncated de novo results into full-length TE families; (2) the iterative refinement of TE multiple sequence alignments; and (3) the use of alignment visualization to assess model completeness and subfamily structure. © 2021 Wiley Periodicals LLC. Basic Protocol: Extension and edge polishing of consensi and seed alignments derived from de novo repeat finders Support Protocol: Generating seed alignments using a library of consensi and a genome assembly.

Keywords: RepeatMasker; RepeatModeler; alignment; curation; hidden Markov model; transposable elements.

PubMed Disclaimer

Figures

**Figure 1.**
Terminal output of alignAndCallConsensus.pl performed with A) the default 25% substitution matrix and B) a 14% substitution matrix. The orange box indicates the search engine utilized, while the blue box indicates the substitution matrix used for this alignment. 25p41g indicates that a 25% substitution matrix with a 41% CG background was used. The green box highlights the average Kimura-model substitution level of all the aligned copies compared to the consensus. The presented sequences are the newly calculated consensus (“consensus”) and the previous consensus or reference sequence (“ref:example1_com”) as they appear in the complete MSA (gaps in both sequences indicate that copies exist with an insertion ther). “v” in between the sequences indicates a transversion, while an “i” indicates a transition. “?” indicates a nucleotide aligned to the ambiguous base (like “N” or “H”). The red boxes highlight the consensus of the bases aligned to the terminal “H-pads”.

**Figure 2.**
Continued 5’ extension of the example1 consensus sequence. The red box highlights the option selected to extend the 5’ edge of the alignment while the green box indicates the changing kimura divergence value as the number of aligned bp changes. The extension was terminated after an initial “x” selection with subsequent “5” extensions. Note that only the last 3 iterations of the program are shown.

**Figure 3.**
example1_con.ali alignment of the 5’ edge to the consensus sequence. The blue box indicates the position at which that sequence starts to align to the consensus sequence while the red line indicates a block of sequences that have a common start position.

**Figure 4.**
Example1_con.ali in the terminal after pruning. A) The 5’ alignment edge after pruning. The black bracket indicates the 5’ sequence past the conserved sequence. C) 3’ alignment edge. The blue box indicates the start position of the instance compared to the consensus sequence. The red arrows indicate two different 3’ edges. The black box highlights the lack of conservation past the consensus sequence.

**Figure 5.**
HTML visualization of example1_con.html. Each sequence is represented by a single row (sorted by start position) where the color gradient indicates alignment quality (red=low; blue=high) over 10bp non-overlapping windows. The length of the consensus sequence is 690 bp, including the H-pad.

**Figure 6.**
Alignment of example2 in HTML format. A) HTML alignment format. The bracket indicates a possible subfamily as observed by the divergence pattern differences for sequences. Each sequence is represented by a single row (sorted by start position) where the color gradient indicates alignment quality (red=low; blue=high) over 10bp non-overlapping windows. The length of the HTML alignment is 242 bp, including the H-pad. B) Terminal output of bestwindow.pl. The highest scoring 10 sequences are shown. C) Terminal output of preprocessAlignments.pl. Note that the consensus range length may differ slightly from your terminal output.

**Figure 7.**
HTML format of the example2 TE subfamilies produced by COSEG. A) subfamily0 alignment. The red bracket indicates a possible subfamily that may have been missed by COSEG. B) subfamily1 alignment. Alignment of TE instances to 5’ edge of the example2 subfamily consensi generated by COSEG. A) subfamily0. Each sequence is represented by a single row (sorted by start position) where the color gradient indicates alignment quality (red=low; blue=high) over 10bp non-overlapping windows. The lengths for the alignments for subfamily0 and subfamily1 are 220 and 201 bp, respectively.

**Figure 8.**
HTML alignment of example3. The alignment shows possible deletion products which may represent a subfamily structure. Each sequence is represented by a single row (sorted by start position) where the color gradient indicates alignment quality (red=low; blue=high) over 10bp non-overlapping windows. The total length of this alignment is 2284 bp.

**Figure 9.**
Terminal output of TSD.pl for cluster7.ali. The red box highlights the highest-scoring TSD length out of possible lengths 1–8 bp.

**Figure 10.**
HTML alignments of the 8 consensi produced for example1 by ClusterPartialMatchingSubs.pl. A) cluster0 – 586 bp ; B) cluster2 – 572 bp; C) cluster5 – 502 bp; D) cluster6 – 522 bp ; D) cluster7 - 513 bp; E) cluster8 – 503 bp; G) cluster10 – 481 bp; H) cluster12 – 449 bp. Each sequence is represented by a single row (sorted by start position) where the color gradient indicates alignment quality (red=low; blue=high) over 10bp non-overlapping windows. All sequences are in the 5’ to 3’ orientation.

**Figure 11.**
HTML alignments of the 3 consensi produced for example3 by ClusterPartialMatchingSubs.pl. A) cluster0; B) cluster1; C) cluster6.. Each sequence is represented by a single row (sorted by start position) where the color gradient indicates alignment quality (red=low; blue=high) over 10bp non-overlapping windows. All sequences are in the 5’ to 3’ orientation. The consensi lengths for cluster0, cluster1 and cluster6 are 2124, 1154, and 192, respectively.

**Figure 12.**
Length comparison of the example3 consensi to the original example3_con sequence. This image was generated using alignAndCallConsensus.pl and overlaying thicker and colored lines to highlight the difference length of the derived consensi and the original consensus sequence. The purple line is cluster0, green is cluster1 and orange is cluster6. The colors do not correspond to alignment quality.

**Figure 13.**
AutoRunBlocker.pl analysis of the cluster12 subfamily derived from example1. A) Terminal output of AutoRunBlocker.pl. B) Cluster12.ali visualization. The red box highlights the sequence for which AutoRunBlocker.pl suggests an alternate length.

See this image and copyright information in PMC

References

1. Arensburger P, Piégu B, & Bigot Y (2016). The future of transposable element annotation and their classification in the light of functional genomics - what we can learn from the fables of Jean de la Fontaine? [Review of The future of transposable element annotation and their classification in the light of functional genomics - what we can learn from the fables of Jean de la Fontaine?]. Mobile Genetic Elements, 6(6), e1256852. - PMC - PubMed
1. Arkhipova IR (2017). Using bioinformatic and phylogenetic approaches to classify transposable elements and understand their complex evolutionary histories. Mobile DNA, 8, 19. - PMC - PubMed
1. Chuong EB, Elde NC, & Feschotte C (2017). Regulatory activities of transposable elements: from conflicts to benefits. Nature Reviews. Genetics, 18(2), 71–86. - PMC - PubMed
1. Do CB, Mahabhashyam MSP, Brudno M, & Batzoglou S (2005). ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15(2), 330–340. - PMC - PubMed
1. Edgar RC (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Curation Guidelines for de novo Generated Transposable Element Families

Affiliation

Curation Guidelines for de novo Generated Transposable Element Families

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources