Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Oct 11:2024.12.18.629274.
doi: 10.1101/2024.12.18.629274.

A personalized multi-platform assessment of somatic mosaicism in the human frontal cortex

Affiliations

A personalized multi-platform assessment of somatic mosaicism in the human frontal cortex

Weichen Zhou et al. bioRxiv. .

Abstract

Somatic mutations in individual cells create genomic mosaicism, influencing genetic disorders and cancers. While clonal mutations in cancers are well-studied, rarer somatic variants in normal tissues remain poorly characterized. This study systematically evaluates detection methods using a personalized donor-specific assembly (DSA) from a neurotypical individual's dorsolateral prefrontal cortex assessed with Oxford Nanopore, NovaSeq, linked-read sequencing, Cas9-targeted long-read sequencing (TEnCATS), and single-neuron MALBAC amplification. The haplotype-resolved DSA improved cross-platform analysis, dramatically increasing phasing rates. Germline SNVs, structural variations (SVs), and transposable elements (TEs) were recalled with 99.4%-99.7% accuracy in bulk tissue, and phased haplotype analysis reduced false positives by 15.4%-75.1% for putative somatic candidates. Long-read single-neuron sequencing detected nine somatic SV candidates, demonstrating enhanced sensitivity for rare variants, while TEnCATS identified eight low-frequency somatic TE candidates. These findings highlight advanced methodologies for precise somatic variant detection, critical for understanding mosaicism's role in health and disease.

Keywords: Multi-platform Sequencing; Personalized Genome Assembly; Single Cell; Somatic Mosaicism.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Diagram of multi-platform DNA sequencing data generation for the LIBD75 frontal cortex.
Black arrows correspond to relevant methods and data used for genome assembly, while blue arrows and dotted lines indicate methods and data used for variant calling. DLPFC, dorsolateral prefrontal cortex.
Figure 2.
Figure 2.. Construction of a haplotype-resolved donor-specific assembly to facilitate genetic variation calling in the LIBD75 DLPFC tissue.
a, Pipeline to generate a haplotype-resolved assembly for LIBD75 DLPFC tissue using bulk sequences (underlined). ONT reads were used to build the raw diploid assemblies. Illumina reads were used to refine the raw assemblies and phased SNVs due to its high accuracy in point mutations. Linked reads were used to bridge the phase block for the two haplotypes. The three deliverables are highlighted in bold font within the diagram. b, Number and length distributions of assembly contig-based genetic variations. c, Refinement of phased contig-based SNVs was based on the allele frequency distribution in the Illumina bulk WGS. A 20% allele frequency cutoff is denoted by the red dotted line. d, Length distribution of phase blocks from the phased assembly (blue), linked reads called by LongRanger2.0 (green), and the final refined assembly (red) by bridging those from phased assembly and linked reads. The N50 length is denoted by dotted lines based on the phase blocks after filtering out reads without heterozygous phased SNVs in each category. e, An example (chromosome 4) shows the improvement of the final refined phased blocks versus those from phased assembly and linked reads. Adjacent blocks are colored in maize and blue. f, Phasing rates across the various platform sequences based on assembly information. Maize represents reads in haplotype 1 (H1), blue represents reads in haplotype 2 (H2), and grey represents non-phased reads.
Figure 3.
Figure 3.. Assessment of germline genetic variants in bulk tissue across sequencing platforms.
a, The recall rates (bar plots) and allele frequency distributions (histograms) in ONT bulk WGS sequencing (left) and Illumina bulk WGS sequencing (right). Orange represents SNVs, blue represents SVs, and green represents TEs. b, Recall rates of SV (left) and TE (right) subtypes in the analysis of ONT and Illumina bulk tissue. c, Recall rates of SNV (left), SV (middle), and TE (right) in pseudobulk samples derived from pooled single-cell runs. 115 cells (light purple) are single-cell libraries in batches of one, five, and ten cells on MinION flow cells. Five cells (purple) are in batch on one PromethION flow cell. And one cell (dark purple) is sequenced by one PromethION.
Figure 4.
Figure 4.. Haplotype-based analysis enables the removal of false positive somatic calls in bulk tissue.
a, Schematic illustrating i) the germline (white and black) and somatic (green) genetic variants in the diploid contigs and reads, and ii) the use of phasing information to eliminate false positive somatic calls due to unequal representation of haplotypes (hapErrors, orange), sequencing noise (seqErrors, red), or misalignment errors (mapErrors, purple). b, Heatmaps of germline homozygous SNVs (left) and heterozygous SNVs (right) in ONT WGS bulk tissue sequences. The X-axis represents the allele frequency (AF) in Haplotype A, which contains the highest alternative allele frequency, and the Y-axis represents the AF in Haplotype B, the second haplotype. c, 2D kernel density plot (left, bin=0.1AF) for putative somatic SNVs showing the exclusion of false positives by hapErrors (bottom right, orange block) and seqErrors (upper left, red block) using two dotted lines (x=0.8 and y=0.2x). The plot (right, bin=0.02AF) offers a magnified view of the range between (0,0) and (0.3, 0.3). d, 2D kernel density plots (bin=0.03AF) for putative somatic variants: SVs (left) and TEs (right). mapErrors is indicated by purple blocks.
Figure 5.
Figure 5.. TEnCATS methodology detects non-reference transposable elements (TEs) within donor DLPFC tissue.
a, Recall rates for targeted active TE subfamilies by TEnCATS based on the assembly-based TE callset. b, Number of supporting reads of non-reference TEs reported by NanoPal from TEnCATS versus PALMER from ONT WGS. Dotted line represents the mean value of the group. c, A semi-automatic pipeline with manual inspection was used to stringently refine somatic TE candidates into a final list. Low-quality regions include any segmental duplicates, low-confidence mask regions used in this project, and reference Alu repeats or LINE-1 repeat regions in RepeatMask for Alu candidates or L1Hs candidates, respectively. Non-TE-related sequences refer to any genomic content that is not TE sequence, polyA tracts, target site duplications, or potential transduction sequences between two polyA tract signals. We ran BLAT and IGV for the final manual inspection process. d, IGV screenshot of a somatic Alu element candidate captured by TEnCATS at chr8:82,726,776 (H1, haplotype1; H2, haplotype2). The red arrow indicates the insertion site. The middle illustration demostrates the soft-clipped sequence in the supporting read representing the somatic signal at the 3’ end of AluYb8. This sequence aligns with the cut site of guide RNA (marked in red) and the AluYb8 consensus sequence (orange bar).
Figure 6.
Figure 6.. Haplotype-aware detection of somatic CNVs and TEs in single neurons.
a. Length distribution of candidate somatic deletions (>1 Mb) identified using ONT long-read sequencing by GARLIC and Illumina short-read sequencing by Ginkgo. b, Swarm plot of aggregated potential somatic deletion detected in each cell by two methods. Cells with aggregated deletions more than 10Mb were highlighted by bold circles. c, Overlap calls by two methods. a, b, and c share the same color legend: ONT by GARLIC (blue) and Illumina by Ginkgo (maize). d. An example of two candidate somatic deletions on chr7 in a single neuron (9203) detected by two methods. The main panel shows the read depth plots for the 9203 single neuron from ONT single-cell sequencing (above, blue frame) and Illumina single-cell sequencing (below for six random single neurons). The left panel shows the signal distribution from Rppc by GARLIC for chr7 in neuron 9203 (above) and the copy number states by Ginkgo (below). The bottom panel illustrates the signals from ONT single-cell sequences for two mutations, Rppc (yellow) and read coverage (blue). The plots for neuron 9203 are depicted above, while plots for three random cells are depicted below. Signals representing the three candidate somatic deletions across all panels are highlighted between red bars. e, Swarm plot of recall rates for high-confidence assembly-based germline TEs in individual single cells. Each point represents one cell, and dashed lines represent pooled single cells. 115 cells (dark green) are single-cell libraries in batches of one, five, and ten cells on MinION flow cells. Five cells (green) are in batch on one PromethION flow cell. And one cell (light purple) is sequenced by one PromethION. The same color legend for dashed lines applies to f and g. f, Number of candidate somatic calls per individual single cell. Each point represents one cell. g, Number of cells in which each candidate somatic call is detected. Left bar, non-supportive cells with non-supportive go-through reads; right bar, supportive cells with supportive go-through reads, or left or right clipped reads. h, A candidate somatic Alu insertion at chromosome 3 was observed in two out of 121 cell samples and was not detected by ONT WGS in bulk tissue.

References

    1. 1000 Genomes Project Consortium, Abecasis G.R., Altshuler D., Auton A., Brooks L.D., Durbin R.M., Gibbs R.A., Hurles M.E., and McVean G.A. (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073. - PMC - PubMed
    1. Byrska-Bishop M., Evani U.S., Zhao X., Basile A.O., Abel H.J., Regier A.A., Corvelo A., Clarke W.E., Musunuri R., Nagulapalli K., et al. (2022). High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e19. - PMC - PubMed
    1. Ebert P., Audano P.A., Zhu Q., Rodriguez-Martin B., Porubsky D., Bonder M.J., Sulovari A., Ebler J., Zhou W., Serra Mari R., et al. (2021). Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372. 10.1126/science.abf7117. - DOI
    1. Mills R.E., Luttig C.T., Larkins C.E., Beauchamp A., Tsui C., Pittard W.S., and Devine S.E. (2006). An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 1182–1190. - PMC - PubMed
    1. Ho S.S., Urban A.E., and Mills R.E. (2020). Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171–189. - PMC - PubMed

Publication types

LinkOut - more resources