. 2023 Oct;41(10):1457-1464.

doi: 10.1038/s41587-022-01652-0. Epub 2023 Feb 6.

Simultaneous sequencing of genetic and epigenetic bases in DNA

Jens Füllgrabe^#¹, Walraj S Gosal^#¹, Páidí Creed¹, Sidong Liu¹, Casper K Lumby¹, David J Morley¹, Tobias W B Ost¹, Albert J Vilella¹, Shirong Yu¹, Helen Bignell¹, Philippa Burns¹, Tom Charlesworth¹, Beiyuan Fu¹, Howerd Fordham¹, Nicolas J Harding¹, Olga Gandelman¹, Paula Golder¹, Christopher Hodson¹, Mengjie Li¹, Marjana Lila¹, Yang Liu¹, Joanne Mason¹, Jason Mellad¹, Jack M Monahan¹, Oliver Nentwich¹, Alexandra Palmer¹, Michael Steward¹, Minna Taipale¹, Audrey Vandomme¹, Rita Santo San-Bento¹, Ankita Singhal¹, Julia Vivian¹, Natalia Wójtowicz¹, Nathan Williams¹, Nicolas J Walker¹, Nicola C H Wong¹, Gary N Yalloway¹, Joanna D Holbrook², Shankar Balasubramanian^{3

4}

Affiliations

¹ Cambridge Epigenetix Ltd, The Trinity Building, Chesterford Research Park, Cambridge, UK.
² Cambridge Epigenetix Ltd, The Trinity Building, Chesterford Research Park, Cambridge, UK. Joanna.holbrook@cegx.co.uk.
³ Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK. sb10031@cam.ac.uk.
⁴ Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK. sb10031@cam.ac.uk.

^# Contributed equally.

PMID: 36747096
PMCID: PMC10567558
DOI: 10.1038/s41587-022-01652-0

Simultaneous sequencing of genetic and epigenetic bases in DNA

Jens Füllgrabe et al. Nat Biotechnol. 2023 Oct.

. 2023 Oct;41(10):1457-1464.

doi: 10.1038/s41587-022-01652-0. Epub 2023 Feb 6.

Authors

Affiliations

¹ Cambridge Epigenetix Ltd, The Trinity Building, Chesterford Research Park, Cambridge, UK.
² Cambridge Epigenetix Ltd, The Trinity Building, Chesterford Research Park, Cambridge, UK. Joanna.holbrook@cegx.co.uk.
³ Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK. sb10031@cam.ac.uk.
⁴ Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK. sb10031@cam.ac.uk.

^# Contributed equally.

PMID: 36747096
PMCID: PMC10567558
DOI: 10.1038/s41587-022-01652-0

Abstract

DNA comprises molecular information stored in genetic and epigenetic bases, both of which are vital to our understanding of biology. Most DNA sequencing approaches address either genetics or epigenetics and thus capture incomplete information. Methods widely used to detect epigenetic DNA bases fail to capture common C-to-T mutations or distinguish 5-methylcytosine from 5-hydroxymethylcytosine. We present a single base-resolution sequencing methodology that sequences complete genetics and the two most common cytosine modifications in a single workflow. DNA is copied and bases are enzymatically converted. Coupled decoding of bases across the original and copy strand provides a phased digital readout. Methods are demonstrated on human genomic DNA and cell-free DNA from a blood sample of a patient with cancer. The approach is accurate, requires low DNA input and has a simple workflow and analysis pipeline. Simultaneous, phased reading of genetic and epigenetic bases provides a more complete picture of the information stored in genomes and has applications throughout biomedicine.

PubMed Disclaimer

Conflict of interest statement

Competing interests: S.B. is a founder, adviser and shareholder of Cambridge Epigenetix and of Inflex. All the other authors are current or former employees and hold share options. Patents covering this work and the methodologies described in this manuscript have been filed by Cambridge Epigenetix (patent applicant), inventors are S.B., J.F., W.S.G., J.D.H., S.L., D.M., O.N., T.O., M.S., A.V., N.J.W., S.Y, H.R.B. and R.S.S.-B. The application numbers are WO2022023753A1 (published), US20220298551A1 (pending), US20220290215A1 (issued), EP4034676A1(pending) and EP4083231A1 (pending).

Figures

**Fig. 1. Five-letter seq.**
a, Double-stranded DNA with base modifications. b, Traditional genetic sequencing only captures four states of information, which makes it impossible to determine genetic and epigenetic information. Base conversions can alter the information output, but the approach is inherently limited by only having four output states. c, Two-base coding results in 4² = 16 possible states enabling simultaneous determination of epigenetic and genetic states. d, Laboratory workflow. Hairpins are ligated to double-stranded DNA and the strands are separated. The 5′–3′ strand is omitted for clarity, but follows a similar procedure to the 5′–3′ strand. An additional copy strand is synthesized using Klenow exo-polymerase and short sequencing adapters are ligated. ModCs are protected through oxidation by TET2 and glycosylation by beta-glucosyltransferase (BGT). Treatment by APOBEC3A and UvrD helicase is used to simultaneously open up and deaminate the hairpin. Unprotected Cs are deaminated from C to U (read as T). e, Sequencing protocol. The deaminated DNA libraries are PCR amplified and indexes are added. Templates are paired-end sequenced. The two reads represent the same stretch of DNA and are locally aligned. Using a set of resolution rules, the pairs of bases across the two reads are resolved into one of five states: A, C, modC, G, T. The method is able to identify errors occurring during PCR and sequencing. f, Overview of the resolution rules and states under the five-letter decoding model. modC is denoted in pink in the diagram and is coded for by the pair CG.

**Fig. 2. Performance of five-letter seq on genomic DNA.**
a, Top, five-letter seq (blue), WGBS (orange) and EM-seq (green) average modC levels across all autosomes in NA12878 at CpGs in two samples per technology (n = 6) (left) and non-CpG contexts—four datapoints correspond to CHH in both samples and CHG in both samples per technology (n = 12) (right). Bottom, sensitivity and specificity of modC calling in five-letter seq, as computed on spike-in ground-truth control sequences for both samples (n = 6). b, Correlation heatmap showing high levels of agreement with WGBS (Pearson’s R 0.94, P < 10⁻⁸). Counts were pooled across duplicate samples for both WGBS and five-letter seq and the comparison was limited to sites that were covered at least three times in both methods (26,067,695 sites or 94.24% of all CpGs). c, Bland–Altman plot, with the average of the modC levels between the two methods on the x axis and the difference on the y axis (median difference of −2.6% with 95% of CpGs differing by between −33% and 23%, indicated by solid and dashed red lines, respectively). Counts were pooled across duplicate samples for both WGBS and five-letter seq and the comparison was limited to sites that were covered at least three times in both methods (26,067,695 sites or 94.24% of all CpGs). d, Genetic accuracy as calculated on NA12878 high-confidence regions for five-letter seq (blue), WGBS (orange), EM-seq (green) and standard Illumina sequencing (red). e, Precision and sensitivity of variant calling (SNPs and indels) on the y axis, using different quantities of five-letter seq reads on the x axis, pooled across duplicates. f, Manhattan plot of allele-specific methylation in NA12878. The x axis is chromosomal location and y axis is −log10(p) from Fisher’s exact test of association between genotype and in cis modC levels. *PLAGL1*, a known imprinted gene, is highlighted in red. g, Integrative Genomics Viewer (IGV) plot of 92nt region of *PLAGL1* gene centered on a C/T heterozygous SNP. Reads are grouped by the base observed at the variant site and forward and reverse mapping reads are shown in gray and green respectively. ModCs in CpG sites are highlighted in red, with the modification being associated with the G base for reverse reads. Reads exhibiting the (reference) C allele are entirely methylated at CpG sites, whereas reads harboring the T allele are entirely unmethylated.

**Fig. 3. Application of five-letter seq to cfDNA.**
a, Proportion of reads that are PCR and cluster duplicates (y-axis) rates achieved at input of 2 ng or 10 ng of cfDNA or 80 ng gDNA. b, Proportion of genome covered with at least one read (y-axis) at input of 2 ng or 10 ng of cfDNA or 80 ng gDNA. c, Sensitivity and specificity of modC detection is unaffected by input amount. Input of 0.5 ng spike-in ground-truth control DNAs for the gDNA samples and 0.05 ng for the cfDNA samples, sensitivity on methylated lambda DNA in blue and specificity on unmethylated pUC19 DNA in orange.

**Fig. 4. Six-letter seq.**
a, Schematic of six-letter epigenetic sequencing protocol. A similar protocol to that of five-letter seq, described in Fig. 1d, is followed with the addition of a methyl-copy step where DNMT5 copies the 5mC from the original to the copy strand. The 5hmC is protected by glycosylation and not copied. b, Overview of the resolution rules and states under the six-letter decoding model. A protected C on the original strand, signifying modC, is denoted by pink in the diagram and table; a G opposite a protected C on the copy strand is denoted by light blue. The 5mC is denoted by a protected (pink) C followed a protected (blue) G and 5hmC is denoted by a protected (pink) C followed by an unprotected (black) G. c, Call-rate matrix, which contains the rate at which six-letter seq calls unmodC, 5mC and 5hmC when the true state is unmodC, 5mC and 5hmC. This is estimated from ground-truth control sequences for which the modification status of each CpG is known. We calculate the rate at which six-letter seq calls unmodC, 5mC and 5hmC at unmodCpGs on a fully unmethylated pUC19 (first column), the rate at which six-letter seq calls unmodC, 5mC and 5hmC at 5mCpGs on a fully methylated lambda genome (second column), and the rate at which six-letter seq calls unmodC, 5mC and 5hmC at 5hmCpGs on a synthetic oligonucleotide (third column).

See this image and copyright information in PMC

Comment in

Simultaneous sequencing of genome and epigenome.
Koch L. Koch L. Nat Rev Genet. 2023 Apr;24(4):208. doi: 10.1038/s41576-023-00589-7. Nat Rev Genet. 2023. PMID: 36829055 No abstract available.
Combined sequencing of genomes and epigenomes.
Tang L. Tang L. Nat Methods. 2023 Apr;20(4):482. doi: 10.1038/s41592-023-01856-5. Nat Methods. 2023. PMID: 37046016 No abstract available.
Speed reading the epigenome and genome.
George JM, Chinnaiyan AM. George JM, et al. Nat Biotechnol. 2023 Oct;41(10):1392-1393. doi: 10.1038/s41587-023-01757-0. Nat Biotechnol. 2023. PMID: 37085619 No abstract available.
Deciphering the cancer genome and epigenome.
Esain-Garcia I. Esain-Garcia I. Nat Rev Cancer. 2023 Aug;23(8):509. doi: 10.1038/s41568-023-00590-6. Nat Rev Cancer. 2023. PMID: 37286894 No abstract available.

References

1. He L, et al. DNA methylation-free Arabidopsis reveals crucial roles of DNA methylation in regulating gene expression and development. Nat. Commun. 2022;13:1335. - PMC - PubMed
1. Mazid MA, et al. Rolling back human pluripotent stem cells to an eight-cell embryo-like stage. Nature. 2022;605:315–324. - PubMed
1. Nachun D, et al. Clonal hematopoiesis associated with epigenetic aging and clinical outcomes. Aging Cell. 2021;20:e13366. - PMC - PubMed
1. Yokobayashi S, et al. Inherent genomic properties underlie the epigenomic heterogeneity of human induced pluripotent stem cells. Cell Rep. 2021;37:109909. - PubMed
1. Nishizawa M, et al. Epigenetic variation between human induced pluripotent stem cell lines is an indicator of differentiation capacity. Cell Stem Cell. 2016;19:341–354. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- Coriell Cell Repositories

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Simultaneous sequencing of genetic and epigenetic bases in DNA

Affiliations

Simultaneous sequencing of genetic and epigenetic bases in DNA

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials