Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Sep 19;3(10):research0052.
doi: 10.1186/gb-2002-3-10-research0052. Epub 2002 Sep 19.

Molecular archeology of L1 insertions in the human genome

Affiliations

Molecular archeology of L1 insertions in the human genome

Suzanne T Szak et al. Genome Biol. .

Abstract

Background: As the rough draft of the human genome sequence nears a finished product and other genome-sequencing projects accumulate sequence data exponentially, bioinformatics is emerging as an important tool for studies of transposon biology. In particular, L1 elements exhibit a variety of sequence structures after insertion into the human genome that are amenable to computational analysis. We carried out a detailed analysis of the anatomy and distribution of L1 elements in the human genome using a new computer program, TSDfinder, designed to identify transposon boundaries precisely.

Results: Structural variants of L1 elements shared similar trends in the length and quality of their target site duplications (TSDs) and poly(A) tails. Furthermore, we found no correlation between the composition and genomic location of the pre-insertion locus and the resulting anatomy of the L1 insertion. We verified that L1 insertions with TSDs have the 5'-TTAAAA-3' cleavage site associated with L1 endonuclease activity. In addition, the second target DNA cut required for L1 insertion weakly matches the consensus pattern TTAAAA. On the other hand, the L1-internal breakpoints of deleted and inverted L1 elements do not resemble L1 endonuclease cleavage sites. Finally, the genome sequence data indicate that whereas singly inverted elements are common, doubly inverted elements are almost never found.

Conclusions: The sequence data give no indication that the creation of L1 structural variants depends on characteristics of the insertion locus. In addition, the formation of 5' truncated and 5' inverted L1s are probably not due to the action of the L1 endonuclease.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Anatomy of the L1 element and its structural variants found in the human genome. (a) A full-length L1 transcript is approximately 6,000 nucleotides long. It has a 5' UTR, two ORFs separated by 63 nucleotides, and a 3' UTR followed by a poly(A) tail. An L1 insertion in the genome is flanked by TSDs; the 3' TSD is immediately preceded by a poly(A) tail. (b) Variations in the structure of L1 insertions are shown. Arrowheads indicate the orientation of the L1 sequence. Most L1s are 5' truncated. In addition, during the process of insertion, a 5' segment of L1 may become inverted with respect to the 3' end of the L1 (5' inversion). Alternatively or additionally, a weak poly(A) signal in the L1 transcript can result in a portion of the 3' flanking DNA being transposed to another locus in the genome along with the L1 element; this process is called 3' transduction. In this case, the 3' TSD can be located hundreds of nucleotides downstream from the end of the L1 element. The numbers in parentheses represent the percentage of all L1s with TSDs that fall into each category. (c) TSD sequences flanking an L1 insertion are underlined. The poly(A) tail has a line over it. Although the poly(A) tail could potentially be extended, this would require the length (and potentially the score) of the TSDs to be reduced; TSDfinder always finds the longest possible TSDs.
Figure 2
Figure 2
Distribution of TSD lengths and scores. Lengths of the 16,266 TSDs found by TSDfinder for L1s in the genome are shown as a histogram. The line indicates the average TSD score for each TSD length. The x axis displays the range of TSD lengths in nucleotides; the y axis displays the number of TSDs. The bimodal distribution suggests that a significant fraction of the 9-11-nucleotide TSDs may be artifactual.
Figure 3
Figure 3
Nucleotide profile of L1 target loci. (a) Logo graphics were generated for the TSDs of length shown and 10 nucleotides of upstream and downstream flanking sequence. At each position, the highest-frequency base is at the top of the stack, and so the general consensus can be found by reading the top base at every position. The relative height of individual bases at each position is proportional to the frequency of the base at that position. The vertical axis represents the amount of information in the input data. The TSDs included in this analysis did not have any mismatches. Stars indicate the first position of the TSD. The nucleotides are colored as follows: A, green; T, red; C, blue; G, yellow. (b) The percentage of A, T, G, and C nucleotides is shown, calculated for 1,794 15-nucleotide TSDs (without mismatches) and their flanking regions (50 nucleotides on each side). The shaded region represents the TSD sequence.
Figure 4
Figure 4
Histogram of L1 start positions for L1 insertions complete at the 3' end. The L1 start positions of the 16,266 elements with TSDs were placed into 50-nucleotide bins. The schematic of L1 along the x axis indicates the L1 start position represented by each bin. The y axis shows the number found in each bin.
Figure 5
Figure 5
L1 5' inversion structures. For each 5' inverted L1 element for which TSDs were found, the structure at the junction between the two L1 segments was analyzed. Arrows represent the orientation of L1 segments in the genome; an unrearranged L1 would consist of two tandem arrows. Vertical dotted lines indicate breakpoints in the L1 sequence and the alignment of the L1 segments on the genomic sequence. (a) The annotation of the two segments may result in a slight overlap or gap in the contig sequence. Alternatively, the genomic contig sequence may be flawless at the junction of the two fragments. The numbers on the figure for the flawless junction represent the median relative lengths of all of the 5' inverted and 3' direct segments. For those cases in which the 3' direct segment is longer than the 5' inverted segment, the median ratio was 2.3 (70% of cases). (b) In the L1 sequence itself, the two segments that make up the inverted element may overlap (iii) or suggest a small deletion in the L1 sequence (ii). 'X' represents any coordinate in the L1 sequence.
Figure 6
Figure 6
Characteristics of 5' inverted L1 segments. (a) Start and end positions (projected onto L1.3) of 3,157 5' inverted L1 segments is shown for the 5' inverted L1s with TSDs. (b) The 5' (inverted) segment length is plotted along the x axis; the 3' (direct) segment length is plotted along the y axis for all 5' inverted L1s with TSDs. The direction (inverted versus direct) is given relative to that of L1 transcription.
Figure 7
Figure 7
Twice-inverted L1 elements. L1 segments shown with solid lines were found when we used the L1.3 sequence as our library for RepeatMasker; the percentages indicate the percent identity of each segment to L1.3. L1 segments indicated with dashed lines were found when we further analyzed the sequence using the full set of RepeatMasker libraries; these segments were identified as L1MA9 family members. The 3' UTR of the L1MA9 subfamily of L1s is 160 nucleotides longer than that for the Ta family of L1s [23]; thus, we were unable to determine the percent identity of the L1MA9 segments to L1.3.
Figure 8
Figure 8
Distribution of L1s with TSDs and annotated genes on chromosomes. Along each 500 kb bin along the chromosomes shown, the total number of L1s with TSDs (blue line) and the annotated genes (pink line) are indicated. Stars represent loci at which the L1 concentration is quite low and gene concentration is high. Among these loci are the histone gene cluster and MHCIII cluster on chromosome 6, a group of genes subject to X inactivation on the X chromosome, and a group of unrelated genes on chromosome 1. The corresponding cytogenetic staining pattern along the length of the chromosomes is shown below the graphs. Centromeric regions are white, and Giemsa-negative bands are indicated by the lightest gray. For the Giemsa-positive bands, the intensity of the grayscale reflects the staining intensity.

References

    1. International Human Genome Sequencing Consortium (IHGSC) Initial sequencing of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Jurka J. Subfamily structure and evolution of the human L1 family of repetitive sequences. J Mol Evol. 1989;29:496–503. - PubMed
    1. Matera AG, Hellmann U, Hintz MF, Schmid CW. Recently transposed Alu repeats result from multiple source genes. Nucleic Acids Res. 1990;18:6019–6023. - PMC - PubMed
    1. Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996;6:743–748. - PubMed
    1. DeBerardinis RJ, Goodier JL, Ostertag EM, Kazazian HH., Jr Rapid amplification of a retrotransposon subfamily is evolving the mouse genome. Nat Genet. 1998;20:288–290. - PubMed

Publication types