Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 20;48(3):1146-1163.
doi: 10.1093/nar/gkz1173.

Identification and characterization of occult human-specific LINE-1 insertions using long-read sequencing technology

Affiliations

Identification and characterization of occult human-specific LINE-1 insertions using long-read sequencing technology

Weichen Zhou et al. Nucleic Acids Res. .

Abstract

Long Interspersed Element-1 (LINE-1) retrotransposition contributes to inter- and intra-individual genetic variation and occasionally can lead to human genetic disorders. Various strategies have been developed to identify human-specific LINE-1 (L1Hs) insertions from short-read whole genome sequencing (WGS) data; however, they have limitations in detecting insertions in complex repetitive genomic regions. Here, we developed a computational tool (PALMER) and used it to identify 203 non-reference L1Hs insertions in the NA12878 benchmark genome. Using PacBio long-read sequencing data, we identified L1Hs insertions that were absent in previous short-read studies (90/203). Approximately 81% (73/90) of the L1Hs insertions reside within endogenous LINE-1 sequences in the reference assembly and the analysis of unique breakpoint junction sequences revealed 63% (57/90) of these L1Hs insertions could be genotyped in 1000 Genomes Project sequences. Moreover, we observed that amplification biases encountered in single-cell WGS experiments led to a wide variation in L1Hs insertion detection rates between four individual NA12878 cells; under-amplification limited detection to 32% (65/203) of insertions, whereas over-amplification increased false positive calls. In sum, these data indicate that L1Hs insertions are often missed using standard short-read sequencing approaches and long-read sequencing approaches can significantly improve the detection of L1Hs insertions present in individual genomes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
PALMER identifies non-reference L1Hs insertions from PacBio data. Reference-aligned BAM files from long-read technology are used as input. Known repeats (L1s, Alus or SVAs in reference) are used to pre-mask the portions of individual reads that align to these repeats. After the pre-masking process, PALMER searches PacBio subreads against an insertion sequence, L1.3 (GenBank: L19088), and identifies reads with a putative insertion sequence (including 5′ inverted L1 sequence, if available) as candidate supporting reads. PALMER searches the bins in 50 bp 5′ upstream and 3.5 kb 3′ downstream of insertion sequence for each read and then identifies candidate TSD motifs, 5′ transduction and poly(A) sequence. All supporting reads are then clustered at each locus and those with a minimum number of supporting events are reported as putative insertions.
Figure 2.
Figure 2.
Validation of the PALMER L1Hs insertions using multiple strategies. (A) Error correction and local alignment for the supporting subreads were carried out to obtain high-quality sequence reads for each event. (B) A recurrence plot for a predicted insertion in chr16: 31 950 972. The structure of this event is shown on the Y-axis of the plot, including a 11 bp 5′TSD (purple arrow), a 639 5′ inverted L1 sequence (light orange bar), 1371 bp non-reference L1Hs sequence (dark orange bar), a 32 bp poly(A) tract (red bar), a 138 bp 3′ transduction (blue bar), a second 55 bp polyA tract (red bar), and a 11 bp 3′TSD (purple arrow). Y-axis is a 8 kb segment of error-corrected sequence, and X-axis is a 8 kb reference sequence at chr16 from 31 946 972 to 31 954 972. Information of RepeatMasker track was shown below in the same scale, demonstrating this event is inserted into a 6 kb reference L1PA region (red arrow shows the insertion site). (C) Example of supporting sequences from searching the BLAST GenBank nr/nt database using error-corrected reads containing putative insertion sequences. The lower panel shows the hits in the BLAST GenBank nr/nt database for one event (chr6: 32 613 219), whose sequence is 445 bp (orange) in an error-corrected read (green). The red bars underneath represent supporting results with E-value = 0 in the database. (D) Distributions between the predicted size of insertion sequence and the Δ length from 40 kb of the expected insert size of fosmid clone (FC) reads, categorized by fosmid clone read pairs assigned to the different haplotype of insertion sequence (left) and those assigned to the same haplotype of insertion sequence (right). (E) L1Hs 5′ genomic DNA/L1 junction sequence k-mer analysis for 203 germline non-reference L1Hs insertions of NA12878 in short-read data. The green bar represents the genome with inserted L1Hs sequence (orange); the green arrows are the short paired-end reads mapped to the genome. (F) L1Hs 5′ genomic DNA/L1 junction sequence k-mer analysis in five distinct sets: WGS data for NA12878, NA12891, NA12892 and 10× Genomics data for NA12878 and the reference genome (hs37d5). The red frame shows the events are supported in the specific genome. No: the number of the real event observed in WGS Illumina samples; Ns: the number of the simulated event observed in WGS Illumina samples; N10x: the number of the real event observed in 10× Genomics data; Nref: the number of the real event observed in the reference genome.
Figure 3.
Figure 3.
Comparison of PALMER L1Hs insertion calls with a high-quality structural variation set in NA12878. (A) Venn diagram of PALMER calls (orange) and L1Hs calls from Audano et al. 2019 (red) in NA12878. A subset of ‘INS’ calls (representing generic insertions not specifically annotated as L1Hs insertions) from Audano et al. 2019 that intersected with Palmer calls is also indicated (grey). (B) An example of a PALMER call that was reported by Audano et al. 2019 as a generic L1 insertion. We describe the EN Cleavage site sequence, the sequence at the empty site of insertion (bold font) and the TSD motif (purple font/arrow), poly(A) (red font/bar), and the detailed structure of L1Hs insertion (orange font/bar) with 5′ inverted L1 sequence (brown font/arrow) and non-inverted L1Hs insertion sequence (dark brown font/arrow). The green arrow shows that the L1Hs insertion sequence has the L1Hs-Ta defining ‘ACA’ motif at the 3′UTR region. (C) An example of a PALMER call that was missed by Audano et al., containing a 3′ transduction sequence (light blue bar) and other colors as described in (B).
Figure 4.
Figure 4.
Characteristics of the 203 germline non-reference L1Hs insertions from NA12878 PacBio data. (A) Ideogram of PacBio call set. Four types of insertions are highlighted: insertions with 3′ transduction sequence (brown), insertions with 5′ inverted L1 sequence (dark blue), insertions located in reference LINE regions (red), and full-length events (purple). The black bar delineates all non-reference calls. (B) Venn diagram of non-reference L1Hs insertion sets of NA12878 from Illumina standard WGS by MELT and PacBio by PALMER. (C) Number of calls locating in different RepeatMasker categories based on (B). We show the calls in three categories: WGS-only calls (light blue), PacBio-only calls (orange) and calls intersecting in the two call sets (gray). We delineate reference repeat information into six categories: LINE (e.g. L1, L2), SINE (e.g. Alu, MIR), LTR (e.g. ERV, ERVK), DNA (DNA transposons), TR (tandem repeats, e.g. simple repeat, satellite, low complexity region), N/A (regions with no reference repeats annotated). (D) Number of calls located in genomic regions of different short-read accessibility (non-strict: less accessibility, and strict: more accessibility) on the left panel and different gene regions (intergenic and intragenic) on the right panel, on the scale of portion in overall two call sets. The figure legend is the same as in (C). (E) Distribution of truncated positions within the L1Hs sequence of PacBio calls with L1Hs structure annotated below. Bars filled with orange depicted PacBio-only calls, and gray bars depicted PacBio calls intersected with WGS call set. Lower panel demonstrated the detailed structure of a full-length L1Hs, including a 5′UTR, ORF1 (yellow), ORF2 containing endonuclease (EN) and reverse transcriptase (RT) domains (green), 3′UTR and a poly(A) tract. (F) Distribution of 5′ inverted L1 sequence possibly related to twin priming mechanism. The dark blue bar demonstrates the 5′ inverted L1 segments. (G) Length distributions of 3′ transduction sequence of PacBio calls (left) and TSD motif in the 5′ and 3′ flanking region of PacBio calls (right). (H) Histogram of sample frequency in all 1000 Genomes phase 3 samples of PacBio calls based on L1Hs 5′ genomic DNA/L1 junction sequence k-mer assessment. Upper panels show scatter plots of sample frequency by k-mer calculation (X-axis) versus sample frequency based on 1000 Genomes L1Hs call set (Y-axis), for calls intersected in PacBio call set and 1000 Genomes L1Hs call set across five super-populations: All (red), AFR (Africa, dark yellow), AMR (Americas, green), EAS (East Asia, sky blue), EUR (Europe, blue) and SAS (South Asia, pink). E, F, G and H share the same figure key.
Figure 5.
Figure 5.
L1Hs insertion detection using 3′ targeted L1 capture in bulk experiments and WGS in single-cell experiments. (A) Venn diagram of non-reference L1Hs insertion sets of NA12878 from 3′ targeted L1 capture technology and PacBio by PALMER. (B) Number of calls located in different RepeatMasker categories based on (A) in three categories: 3′ targeted L1 capture technique-only calls (purple), PacBio-only calls (orange) and calls intersecting in two call sets (grey). (C) Upset plot of the intersection between PacBio call set and MELT call sets from four single-cell WGS data (batch id: scWGS59, scWGS9, scWGS2 and scWGS5). We delineate the calls into three sets: set1 (orange bracket and orange dots, the events in the PacBio call set but not called in any single-cell experiments), set2 (light blue bracket and blue dots, the events from the single-cell call sets but not in the PacBio call set), set3 (green bracket). In set3, we have two sub-sets: dark green dots show the intersection of the single-cell call sets and PacBio call set, and light green dots show the calls were absent in a certain single-cell experiment but called by the others and intersected with PacBio call set. (D) Number of calls located in different RepeatMasker categories based on sets defined in (C). We delineate the calls into three categories: set1 (orange), set2 (light blue) and set3 (green). (E) Read depth analysis for four single-cell WGS experiments. Categories of sets are based on (C). The curves of normalized read depth value in the ± 37.5 kb flanking regions of insertion sites are shown.

References

    1. Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W. et al. .. Initial sequencing and analysis of the human genome. Nature. 2001; 409:860–921. - PubMed
    1. Smit A.F. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 1999; 9:657–663. - PubMed
    1. Grimaldi G., Skowronski J., Singer M.F.. Defining the beginning and end of KpnI family segments. EMBO J. 1984; 3:1753–1759. - PMC - PubMed
    1. Kazazian H.H. Jr., Moran J.V.. The impact of L1 retrotransposons on the human genome. Nat. Genet. 1998; 19:19–24. - PubMed
    1. Ostertag E.M., Kazazian H.H. Jr.. Twin priming: a proposed mechanism for the creation of inversions in L1 retrotransposition. Genome Res. 2001; 11:2059–2065. - PMC - PubMed

Publication types

LinkOut - more resources