Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Oct 7:2025.10.07.680917.
doi: 10.1101/2025.10.07.680917.

Multi-platform framework for mapping somatic retrotransposition in human tissues

Affiliations

Multi-platform framework for mapping somatic retrotransposition in human tissues

Seunghyun Wang et al. bioRxiv. .

Abstract

Mobile element insertions (MEI) shape the human genome in both germline and somatic tissues. While inherited MEIs are well characterized, mapping somatic MEIs (sMEI) in non-cancer tissues remains challenging due to their low allelic fraction and repetitive nature. We established an integrative framework for sMEI analysis leveraging modern sequencing technologies and analytical innovations. We first benchmarked sMEI detection and demonstrated advantages of long-read and MEI-targeted sequencing for ultra-low-frequency events using a mixture of well-established cell lines. We then showed that haplotype phasing and donor-specific assemblies refine sMEI detection, effectively distinguishing from germline and false signals in in-silico tumor-normal mixtures. We further developed a source-tracing strategy based on internal sequence variation, expanding the catalogue of active source elements beyond traditional transduction-based methods. Applying this framework to donor tissues, we identified 18 rare somatic L1 insertions, revealing structural and source diversity. Our work provides a foundational framework and biological insight into sMEIs.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest All other authors declare no conflict.

Figures

Figure 1 ∣
Figure 1 ∣. Overview of integrative framework for mapping sMEIs
A, Benchmarking sMEI detection methods using HapMap mixture from the SMaHT network. A benchmarking set was generated from PacBio and haplotype assemblies and stratified into three tiers by MEI subfamily and TPRT features. Nine sMEI detection methods were evaluated across sequencing platforms, depths, call set concordance, genomic regions, and VAFs. Integrative strategy incorporates both calls and raw signals from the reads to expand high-confidence sMEI candidates. GCC stands for Genome Characterization Centres in the SMaHT network. B, Refinement of sMEI detection using haplotype phasing and donor-specific assembly (DSA) and tracing of source L1s using internal sequence variation. In silico tumor-normal mixture (CASTLE project) dataset was generated to simulate sMEIs at various VAFs. Source L1s were traced using internal L1 sequence variation and their CpG methylation status was analyzed. FL-L1: full-length L1 C, The discovery of sMEIs in normal human tissues by applying the integrative framework for sMEI detection analysis. Each step established from HapMap mixture (sMEI detection method benchmark and multi-platform integration) and tumor-normal mixture (haplotype phasing/DSA and source L1 tracing) were integrated. The final sMEI candidates were confirmed by MEI targeted sequencing and nested PCR.
Figure 2 ∣
Figure 2 ∣. sMEI detection method benchmark and multi-platform (short and long read) integration using HapMap mixture
A, F1 scores, recall, and precision across sequencing platforms (Illumina, PacBio, ONT, and MEI-targeted sequencing) at various sequencing depths (full coverage WGS from BCM GCC and their downsampling data). HAT-seq performance is the average of four experimental replicates and the error bar was calculated based on the standard deviation. B, F1 scores, recall, and precision across 151-mer unique vs. non-unique regions. C, F1 scores, recall, and precision in each VAF bin. Platforms are indicated by shape. The dots located in the yellow area in the recall-precision scatter plot indicate that the corresponding method does not report the calls in the corresponding VAF bin. For HAT-seq, only recall is shown, because caller-level VAFs are not available for precision calculation. A-C, the same color key represents detection methods. D, Call set similarity (Jaccard index) of true positives (TPs) and false positives (FPs). Sequencing platforms are grouped by the same color. B-D, 200x and 60x callsets were used for short-read (Illumina) and long-read (PacBio and ONT) data, respectively. E, Schematic diagram of integrative strategy combined with raw signal rescues and the number of TP and FPs by each support type (S1-5) before and after FP filtering (S3’-5’). Support types are noted in the x-axis. F, Percentages and counts of true positives (TPs) across VAF bins, insertion length bins, and genomic regions, shown for each support type. G, F1 scores of multi-platform integrated callsets across different Illumina and PacBio coverage combinations.
Figure 3 ∣
Figure 3 ∣. Refinement of sMEI detection from haplotype phasing and DSA analysis
A, Schematic illustrating how phasing information and DSA are leveraged to mitigate errors caused by germline MEIs, conflicting signals, or segmental duplications (SDs). B, Precision of sMEI calls across four tumor spike-in mixtures (2%, 10%, 20% and 40%). Bars indicate precision across four detection methods (cuteSV, PALMER, Sniffles2, xTea_long) using GRCh38-based initial alignment, DSA-based alignment, and various refinement methods (phasing on GRCh38, phasing on DSA, GRCh38-phasing with germline site filter, and DSA-phasing with germline site filter), coloured points show per-caller values. C, Composition of true positives (TPs) and false positives (FPs) inside vs. outside SDs, and FPs due to germline signals in the population, aggregated across four callers. Results are shown for GRCh38-based alignment, phasing on GRCh38, and DSA-based alignment, based on unique observations. D, Mis-mapping in a SD pair generates false sMEI calls on GRCh38 but not on the DSA. Upper panel: GRCh38 shows two 4.96kb reference blocks with L1 signal (black) within the inverted SD pair (chr11:4,230,966, green, and chr11:4,285,816, orange). Reads carrying an additional nearly identical 4.96kb L1 sequence (purple) align arbitrarily into the two blocks, yielding two apparent sMEIs at chr11:4,276,716 and chr11:4,291,927. Lower panel: In the DSA, a germline tandem duplicate (purple) was constructed in one SD; reads align consistently to both SD regions with no somatic signals.
Figure 4 ∣
Figure 4 ∣. L1 source tracing using internal sequence variations in tumor-normal mixture
A, Schematic overview of the transduction-based and internal sequence-based source tracing pipelines. Recall and precision was shown for the internal sequence-based source tracing method. B, Comprehensive characterization of L1 source loci with at least two offspring. The left bar chart shows the number of offspring from each source, with color indicating the prediction method (transduction, internal sequence, or both). The adjacent columns provide detailed annotations for each source locus, including reference (ref) or non-reference (nonref) status, unique allelic sequences, allele-specific activity, and source activity reported in human population studies (1KGP, HGSVC) and cancer (ICGC, PCAWG, Tubio) studies. For allele-specific activity, sources detected only by the transduction method and containing only one unique allelic sequence are shown in white, as allele-specific activity cannot be calculated for them. For other sources inferred from L1 internal sequence variation, if two unique allelic sequences are present at the source locus, allele-specific activity was calculated and was colored according to the major haplotype. C, Examples of two sources (7p15.3 and 2p22.2) and their offspring. Multiple sequence alignment of offspring sequences and source sequences was performed using Clustal Omega. Bases differing among the four source sequences were extracted, with asterisks indicating haplotype-specific variants. D, Success rate of source tracing using L1HS internal sequence variations depending on the internal L1 sequence size. E, DNA methylation level at the 5’UTR (±2kb) of FL-L1HS for four source categories of source loci in the H2009 tumor cell line.
Figure 5 ∣
Figure 5 ∣. Characterization of sMEIs and source L1s in normal human tissue homogenates
A, Overview example of the somatic L1 detection pipeline applied to ST003-brain yielding one high-confidence insertion supported by long-read call and short-read alignment signal. This process was repeated for all samples from each of 5 GCCs and for other donors. B, The barplot shows the number of somatic L1 insertions per sample. C, Length distribution of somatic L1 insertions across the tissue samples. D Somatic L1 categorization by 5’ inversion, 3’ transduction, and intragenic vs. intergenic genomic regions. E, Schematic diagram of the U5/L1 chimeric insertion and representative agarose gels (bottom) for the insertion in ST003 brain sample. Due to low mosaicism, the L1 insertion band was not visible after the initial PCR (FL-PCR). A region of the gel at the expected size was excised (dashed green box) for subsequent DNA extraction and re-amplification. This repeated nested PCR approach successfully isolated a clean band corresponding to the L1 insertion allele (green arrowheads). Red arrowheads indicate the band from the reference empty allele. F, Source L1s identified by transduction and internal L1 sequences and schematic diagram of the L1 insertions in ST002 lung and colon samples from the same L1 source traced by internal L1 sequence variation.

References

    1. Feng Q., Moran J.V., Kazazian H.H. Jr, and Boeke J.D. (1996). Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 87, 905–916. - PubMed
    1. Cost G.J., Feng Q., Jacquier A., and Boeke J.D. (2002). Human L1 element target-primed reverse transcription in vitro. EMBO J 21, 5899–5910. - PMC - PubMed
    1. Cordaux R., and Batzer M.A. (2009). The impact of retrotransposons on human genome evolution. Nat Rev Genet 10, 691–703. - PMC - PubMed
    1. Symer D.E., Connelly C., Szak S.T., Caputo E.M., Cost G.J., Parmigiani G., and Boeke J.D. (2002). Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327–338. - PubMed
    1. Gilbert N., Lutz-Prigge S., and Moran J.V. (2002). Genomic deletions created upon LINE-1 retrotransposition. Cell 110, 315–325. - PubMed

Publication types

LinkOut - more resources