Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 20;34(2):189-200.
doi: 10.1101/gr.278556.123.

Genomic origin, fragmentomics, and transcriptional properties of long cell-free DNA molecules in human plasma

Affiliations

Genomic origin, fragmentomics, and transcriptional properties of long cell-free DNA molecules in human plasma

Huiwen Che et al. Genome Res. .

Abstract

Recent studies have revealed an unexplored population of long cell-free DNA (cfDNA) molecules in human plasma using long-read sequencing technologies. However, the biological properties of long cfDNA molecules (>500 bp) remain largely unknown. To this end, we have investigated the origins of long cfDNA molecules from different genomic elements. Analysis of plasma cfDNA using long-read sequencing reveals an uneven distribution of long molecules from across the genome. Long cfDNA molecules show overrepresentation in euchromatic regions of the genome, in sharp contrast to short DNA molecules. We observe a stronger relationship between the abundance of long molecules and mRNA gene expression levels, compared with short molecules (Pearson's r = 0.71 vs. -0.14). Moreover, long and short molecules show distinct fragmentation patterns surrounding CpG sites. Leveraging the cleavage preferences surrounding CpG sites, the combined cleavage ratios of long and short molecules can differentiate patients with hepatocellular carcinoma (HCC) from non-HCC subjects (AUC = 0.87). We also investigated knockout mice in which selected nuclease genes had been inactivated in comparison with wild-type mice. The proportion of long molecules originating from transcription start sites are lower in Dffb-deficient mice but higher in Dnase1l3-deficient mice compared with that of wild-type mice. This work thus provides new insights into the biological properties and potential clinical applications of long cfDNA molecules.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Distribution of long and short molecules from human plasma DNA. (A) Comparison of genomic representation on Chromosome 10 between long and short DNA molecules in 15 nonpregnant controls using SMRT sequencing. The overrepresentation and underrepresentation of long cfDNA molecules with respect to short molecules are indicated in blue and red, respectively. The genomic representation was determined based on 100-kb bins and was further smoothed by a 1-Mb moving average sliding window. The horizontal solid lines indicate normalized median differences between long and short molecules. The dashed rectangular boxes indicate one euchromatic (i) and one heterochromatic (ii) region. The track in between overrepresentation and underrepresentation of long molecules shows the chromosome ideogram. The ideogram band colors correspond to cytogenetic bands in UCSC Genome Browser. Darker bands are AT-rich, and lighter bands are GC-rich. Centromeric regions are indicated in dark green. The bottom track displays gene densities estimated by number of genes in 100-kb windows. (B) Comparison of genomic representation on Chromosome 10 between long and short DNA molecules in 31 pregnant samples using ONT sequencing.
Figure 2.
Figure 2.
The abundance of long cfDNA molecules shows a positive correlation with transcriptional activity. (A,B) The abundance of long and short molecules on gene bodies of unexpressed and housekeeping genes for nonpregnant controls and pregnant subjects. (A) Data from SMRT sequencing. (B) Data from ONT sequencing. (C,D) The correlation between gene expression and molecule abundance. The median expression level of a gene across tissues was log-transformed and scaled. The median molecule abundance was derived from scaled expression levels. P-values by Pearson's correlation test. (C) Pooled data of 15 nonpregnant controls from SMRT sequencing. (D) Pooled data of 31 pregnant samples from ONT sequencing. (E,F) Comparisons of DNA methylation between long and short molecules. Data points from one sample are connected with a black line. P-values by Wilcoxon rank-sum test. (E) Data from SMRT sequencing. (F) Data from ONT sequencing.
Figure 3.
Figure 3.
The abundance of long cfDNA molecules for HCC detection. (A) Comparison of the abundance of SMRT sequencing molecules among healthy individuals, HBV carriers, and patients with HCC. The abundance of long and short molecules was measured using the top 5000 expressed genes in HCC tumor tissues. The Kruskal–Wallis test P-value for differences among groups. Post hoc pairwise Wilcoxon rank-sum test P-values with Benjamini–Hochberg adjustment are shown above horizontal lines. (B) ROCs of long molecule abundance measured in A for distinguishing individuals without HCC, including healthy subjects and HBV carriers, and with HCC. Multiple thresholds, including 0.1 million (total > 0.1M), 0.3 million (total > 0.3M), 0.5 million (total > 0.5M), and 1 million (total > 1M) molecules from a sample, were used to include samples for constructing ROCs.
Figure 4.
Figure 4.
Normalized end frequencies of SMRT sequencing data. (A) Normalized end frequencies of SMRT long and short molecules pooled from plasma of healthy individuals at TSSs of the expression-stratified gene groups EXP1 to EXP5, corresponding to low to high expression. Transcription start positions are denoted as position 0. All transcription start sites were strand-adjusted so that positive positions are in the direction of transcription. (B,C) Normalized end frequencies of SMRT long and short molecules pooled from plasma of healthy individuals at DHSs (B) and CTCF binding sites (C). DHSs or CTCF binding site peaks are denoted as position 0; downstream and upstream 2000 bp is shown.
Figure 5.
Figure 5.
Cleavage profiles of long and short molecules surrounding CpGs. (A,B) Cleavage profiles surrounding all autosomal CpGs for short (A) and long (B) cfDNA molecules. Each line represents one sample. A cleavage window of 11 bases is shown. Positions 0 and 1 indicate cytosine and guanine, respectively. (C) Box plot of CGN/NCG motif ratios for long and short molecules. (D) Box plot of cleavage ratios between aggregating positions −4, −2, 1, and 4 and position −1. (E) AUCs for distinguishing patients with HCC from non-HCC subjects using the cleavage ratios in D. P-values of differences among groups by Kruskal–Wallis tests. Post hoc pairwise P-values by Wilcoxon rank-sum tests with Benjamini–Hochberg adjustment shown above horizontal lines (C,D).
Figure 6.
Figure 6.
Nuclease-mediated fragmentation in knockout mice. (A) Pooled molecule size distributions of wild-type and nuclease-deficient mice from SMRT sequencing. Visualization of size in the range of 0 to 5000 bp and log10-transformed frequencies were used. (B) Zoom-in plot of A showing size in the range of 0 to 250 bp on the linear scale. (C) Boxplot showing the proportions of long molecules originated from 2000 bp upstream of and downstream from transcription start sites. (D) Percentage of changes in pooled long molecules (>500 bp) abundance relative to wild-type mice on low, medium, and high expression gene groups.

References

    1. Al-Mayouf SM, Sunker A, Abdwani R, Abrawi SA, Almurshedi F, Alhashmi N, Al Sonbul A, Sewairi W, Qari A, Abdallah E, et al. 2011. Loss-of-function variant in DNASE1L3 causes a familial form of systemic lupus erythematosus. Nat Genet 43: 1186–1188. 10.1038/ng.975 - DOI - PubMed
    1. Amemiya HM, Kundaje A, Boyle AP. 2019. The ENCODE blacklist: identification of problematic regions of the genome. Sci Rep 9: 9354. 10.1038/s41598-019-45839-z - DOI - PMC - PubMed
    1. The BAC Resource Consortium, Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen X-N, Furey TS, Kim U-J, Kuo W-L, et al. 2001. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409: 953–958. 10.1038/35057192 - DOI - PMC - PubMed
    1. Ballarino R, Bouwman BAM, Agostini F, Harbers L, Diekmann C, Wernersson E, Bienko M, Crosetto N. 2022. An atlas of endogenous DNA double-strand breaks arising during human neural cell fate determination. Sci Data 9: 400. 10.1038/s41597-022-01508-x - DOI - PMC - PubMed
    1. Bickmore WA, Sumner AT. 1989. Mammalian chromosome banding: an expression of genome organization. Trends Genet 5: 144–148. 10.1016/0168-9525(89)90055-3 - DOI - PubMed

Publication types