Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007;35(15):4964-76.
doi: 10.1093/nar/gkm515. Epub 2007 Jul 17.

Automated recognition of retroviral sequences in genomic data--RetroTector

Affiliations

Automated recognition of retroviral sequences in genomic data--RetroTector

Göran O Sperber et al. Nucleic Acids Res. 2007.

Abstract

Eukaryotic genomes contain many endogenous retroviral sequences (ERVs). ERVs are often severely mutated, therefore difficult to detect. A platform independent (Java) program package, RetroTector (ReTe), was constructed. It has three basic modules: (i) detection of candidate long terminal repeats (LTRs), (ii) detection of chains of conserved retroviral motifs fulfilling distance constraints and (iii) attempted reconstruction of original retroviral protein sequences, combining alignment, codon statistics and properties of protein ends. Other features are prediction of additional open reading frames, automated database collection, graphical presentation and automatic classification. ReTe favors elements >1000-bp long due to its dependence on order of and distances between retroviral fragments. It detects single or low-copy-number elements. ReTe assigned a 'retroviral' score of 890-2827 to 10 exogenous retroviruses from seven genera, and accurately predicted their genes. In a simulated model, ReTe was robust against mutational decay. The human genome was analyzed in 1-2 days on a LINUX cluster. Retroviral sequences were detected in divergent vertebrate genomes. Most ReTe detected chains were coincident with Repeatmasker output and the HERVd database. ReTe did not report most of the evolutionary old HERV-L related and MalR sequences, and is not yet tailored for single LTR detection. Nevertheless, ReTe rationally detects and annotates many retroviral sequences.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The principle of ‘fragment threading’. Three motif hits (RT1, RT2 and RT4) are within accepted distances from each other, whereas one (RT3) is not. Motifs RT1, RT2 and RT4 can therefore be utilized to build a proviral chain.
Figure 2.
Figure 2.
Flow of events during a RetroTector© analysis.
Figure 3.
Figure 3.
LTR features utilized by RetroTector©, in the proviral model context. A combination of obligate and alternative landmarks is used to select LTR candidates, which are further selected by pairing and proviral chain distance criteria. Upper panel: ERV structure overview, with standard terms and ReTe motif group names below. Lower panel: LTR features utilized by ReTe. Constraints between them are also shown. Motifs and their abbreviations are explained in Supplementary Data S3.
Figure 4.
Figure 4.
Chainview picture of HIV and MLV. Symbols are explained below the proviral renditions.
Figure 5.
Figure 5.
Simulation of mutation of an endogenous and an exogenous retrovirus. Average scores for 20 sequences at each level of mutation, divided by maximum score for unmutated HIVMNCG, when analyzed with ReTe and BLAST, are shown. (a) Normalized score for sequences in the endogenous (indel) model. ReTe is more tolerant to mutation than BLAST, when scoring a sequence as retroviral. (b) Normalized score for sequences in the exogenous model. These sequences receive high scores throughout the analysis when analyzed with ReTe. BLAST, on the other hand, does not as readily recognize the sequences as descendant from HIVMNCG. Further information is given in the Supplementary Data, S1.
Figure 6.
Figure 6.
Sequence similarity (percent identity, gap positions excluded) for the Pol puteins compared to the unmutated HIVMNCG Pol protein. Average for 20 puteins at each level of mutation. (a) Endogenous (indel) model puteins: sequence identity. (b) Exogenous model puteins: sequence identity. Further information is given in the Supplementary Data, S1.
Figure 7.
Figure 7.
Chain scores with retroviral, retroviruslike and random sequences. Retroviruslike (errantiviral; gypsy elements) sequences of slime molds, insects and plants are shown to the left. Epsilonretroviral sequences of amphibians and fish are shown to the right. Scores of chains detected in a 108 random sequence are shown below the cutoff. The chains from 108 random nucleotides were obtained before the changed settings described in the text.
Figure 8.
Figure 8.
Frequencies of scores of the chains reported by ReTe version 1.0 from the human (hg18), chimpanzee (panTro2), rhesus (rheMac2), dog (canFam2), mouse (mm8) and chicken (galGal3) genome assemblies.

References

    1. Coffin JM, Hughes SH, Varmus HE, editors. Retroviruses. New York, USA: Cold Spring Harbor Laboratory Press; 1997. - PubMed
    1. IHGSC. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. CSAC. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. - PubMed
    1. ICGSC. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716. - PubMed
    1. Kumar A, Bennetzen JL. Plant retrotransposons. Annu. Rev. Genet. 1999;33:479–532. - PubMed

Publication types

Substances