Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct;41(10):1457-1464.
doi: 10.1038/s41587-022-01652-0. Epub 2023 Feb 6.

Simultaneous sequencing of genetic and epigenetic bases in DNA

Affiliations

Simultaneous sequencing of genetic and epigenetic bases in DNA

Jens Füllgrabe et al. Nat Biotechnol. 2023 Oct.

Abstract

DNA comprises molecular information stored in genetic and epigenetic bases, both of which are vital to our understanding of biology. Most DNA sequencing approaches address either genetics or epigenetics and thus capture incomplete information. Methods widely used to detect epigenetic DNA bases fail to capture common C-to-T mutations or distinguish 5-methylcytosine from 5-hydroxymethylcytosine. We present a single base-resolution sequencing methodology that sequences complete genetics and the two most common cytosine modifications in a single workflow. DNA is copied and bases are enzymatically converted. Coupled decoding of bases across the original and copy strand provides a phased digital readout. Methods are demonstrated on human genomic DNA and cell-free DNA from a blood sample of a patient with cancer. The approach is accurate, requires low DNA input and has a simple workflow and analysis pipeline. Simultaneous, phased reading of genetic and epigenetic bases provides a more complete picture of the information stored in genomes and has applications throughout biomedicine.

PubMed Disclaimer

Conflict of interest statement

Competing interests: S.B. is a founder, adviser and shareholder of Cambridge Epigenetix and of Inflex. All the other authors are current or former employees and hold share options. Patents covering this work and the methodologies described in this manuscript have been filed by Cambridge Epigenetix (patent applicant), inventors are S.B., J.F., W.S.G., J.D.H., S.L., D.M., O.N., T.O., M.S., A.V., N.J.W., S.Y, H.R.B. and R.S.S.-B. The application numbers are WO2022023753A1 (published), US20220298551A1 (pending), US20220290215A1 (issued), EP4034676A1(pending) and EP4083231A1 (pending).

Figures

Fig. 1
Fig. 1. Five-letter seq.
a, Double-stranded DNA with base modifications. b, Traditional genetic sequencing only captures four states of information, which makes it impossible to determine genetic and epigenetic information. Base conversions can alter the information output, but the approach is inherently limited by only having four output states. c, Two-base coding results in 42 = 16 possible states enabling simultaneous determination of epigenetic and genetic states. d, Laboratory workflow. Hairpins are ligated to double-stranded DNA and the strands are separated. The 5′–3′ strand is omitted for clarity, but follows a similar procedure to the 5′–3′ strand. An additional copy strand is synthesized using Klenow exo-polymerase and short sequencing adapters are ligated. ModCs are protected through oxidation by TET2 and glycosylation by beta-glucosyltransferase (BGT). Treatment by APOBEC3A and UvrD helicase is used to simultaneously open up and deaminate the hairpin. Unprotected Cs are deaminated from C to U (read as T). e, Sequencing protocol. The deaminated DNA libraries are PCR amplified and indexes are added. Templates are paired-end sequenced. The two reads represent the same stretch of DNA and are locally aligned. Using a set of resolution rules, the pairs of bases across the two reads are resolved into one of five states: A, C, modC, G, T. The method is able to identify errors occurring during PCR and sequencing. f, Overview of the resolution rules and states under the five-letter decoding model. modC is denoted in pink in the diagram and is coded for by the pair CG.
Fig. 2
Fig. 2. Performance of five-letter seq on genomic DNA.
a, Top, five-letter seq (blue), WGBS (orange) and EM-seq (green) average modC levels across all autosomes in NA12878 at CpGs in two samples per technology (n = 6) (left) and non-CpG contexts—four datapoints correspond to CHH in both samples and CHG in both samples per technology (n = 12) (right). Bottom, sensitivity and specificity of modC calling in five-letter seq, as computed on spike-in ground-truth control sequences for both samples (n = 6). b, Correlation heatmap showing high levels of agreement with WGBS (Pearson’s R 0.94, P < 10−8). Counts were pooled across duplicate samples for both WGBS and five-letter seq and the comparison was limited to sites that were covered at least three times in both methods (26,067,695 sites or 94.24% of all CpGs). c, Bland–Altman plot, with the average of the modC levels between the two methods on the x axis and the difference on the y axis (median difference of −2.6% with 95% of CpGs differing by between −33% and 23%, indicated by solid and dashed red lines, respectively). Counts were pooled across duplicate samples for both WGBS and five-letter seq and the comparison was limited to sites that were covered at least three times in both methods (26,067,695 sites or 94.24% of all CpGs). d, Genetic accuracy as calculated on NA12878 high-confidence regions for five-letter seq (blue), WGBS (orange), EM-seq (green) and standard Illumina sequencing (red). e, Precision and sensitivity of variant calling (SNPs and indels) on the y axis, using different quantities of five-letter seq reads on the x axis, pooled across duplicates. f, Manhattan plot of allele-specific methylation in NA12878. The x axis is chromosomal location and y axis is −log10(p) from Fisher’s exact test of association between genotype and in cis modC levels. PLAGL1, a known imprinted gene, is highlighted in red. g, Integrative Genomics Viewer (IGV) plot of 92nt region of PLAGL1 gene centered on a C/T heterozygous SNP. Reads are grouped by the base observed at the variant site and forward and reverse mapping reads are shown in gray and green respectively. ModCs in CpG sites are highlighted in red, with the modification being associated with the G base for reverse reads. Reads exhibiting the (reference) C allele are entirely methylated at CpG sites, whereas reads harboring the T allele are entirely unmethylated.
Fig. 3
Fig. 3. Application of five-letter seq to cfDNA.
a, Proportion of reads that are PCR and cluster duplicates (y-axis) rates achieved at input of 2 ng or 10 ng of cfDNA or 80 ng gDNA. b, Proportion of genome covered with at least one read (y-axis) at input of 2 ng or 10 ng of cfDNA or 80 ng gDNA. c, Sensitivity and specificity of modC detection is unaffected by input amount. Input of 0.5 ng spike-in ground-truth control DNAs for the gDNA samples and 0.05 ng for the cfDNA samples, sensitivity on methylated lambda DNA in blue and specificity on unmethylated pUC19 DNA in orange.
Fig. 4
Fig. 4. Six-letter seq.
a, Schematic of six-letter epigenetic sequencing protocol. A similar protocol to that of five-letter seq, described in Fig. 1d, is followed with the addition of a methyl-copy step where DNMT5 copies the 5mC from the original to the copy strand. The 5hmC is protected by glycosylation and not copied. b, Overview of the resolution rules and states under the six-letter decoding model. A protected C on the original strand, signifying modC, is denoted by pink in the diagram and table; a G opposite a protected C on the copy strand is denoted by light blue. The 5mC is denoted by a protected (pink) C followed a protected (blue) G and 5hmC is denoted by a protected (pink) C followed by an unprotected (black) G. c, Call-rate matrix, which contains the rate at which six-letter seq calls unmodC, 5mC and 5hmC when the true state is unmodC, 5mC and 5hmC. This is estimated from ground-truth control sequences for which the modification status of each CpG is known. We calculate the rate at which six-letter seq calls unmodC, 5mC and 5hmC at unmodCpGs on a fully unmethylated pUC19 (first column), the rate at which six-letter seq calls unmodC, 5mC and 5hmC at 5mCpGs on a fully methylated lambda genome (second column), and the rate at which six-letter seq calls unmodC, 5mC and 5hmC at 5hmCpGs on a synthetic oligonucleotide (third column).

Comment in

References

    1. He L, et al. DNA methylation-free Arabidopsis reveals crucial roles of DNA methylation in regulating gene expression and development. Nat. Commun. 2022;13:1335. - PMC - PubMed
    1. Mazid MA, et al. Rolling back human pluripotent stem cells to an eight-cell embryo-like stage. Nature. 2022;605:315–324. - PubMed
    1. Nachun D, et al. Clonal hematopoiesis associated with epigenetic aging and clinical outcomes. Aging Cell. 2021;20:e13366. - PMC - PubMed
    1. Yokobayashi S, et al. Inherent genomic properties underlie the epigenomic heterogeneity of human induced pluripotent stem cells. Cell Rep. 2021;37:109909. - PubMed
    1. Nishizawa M, et al. Epigenetic variation between human induced pluripotent stem cell lines is an indicator of differentiation capacity. Cell Stem Cell. 2016;19:341–354. - PubMed

MeSH terms