. 2022 Nov 1;12(1):17257.

doi: 10.1038/s41598-022-21606-5.

DNA read count calibration for single-molecule, long-read sequencing

Luis M M Soares¹, Terrence Hanscom¹, Donald E Selby¹, Samuel Adjei¹, Wei Wang¹, Dariusz Przybylski¹, John F Thompson²

Affiliations

¹ Genomics and Computational Biology, Homology Medicines Inc, Bedford, MA, USA.
² Genomics and Computational Biology, Homology Medicines Inc, Bedford, MA, USA. Thompson.john.f@gmail.com.

PMID: 36319642
PMCID: PMC9626564
DOI: 10.1038/s41598-022-21606-5

DNA read count calibration for single-molecule, long-read sequencing

Luis M M Soares et al. Sci Rep. 2022.

. 2022 Nov 1;12(1):17257.

doi: 10.1038/s41598-022-21606-5.

Authors

Luis M M Soares¹, Terrence Hanscom¹, Donald E Selby¹, Samuel Adjei¹, Wei Wang¹, Dariusz Przybylski¹, John F Thompson²

Affiliations

¹ Genomics and Computational Biology, Homology Medicines Inc, Bedford, MA, USA.
² Genomics and Computational Biology, Homology Medicines Inc, Bedford, MA, USA. Thompson.john.f@gmail.com.

PMID: 36319642
PMCID: PMC9626564
DOI: 10.1038/s41598-022-21606-5

Abstract

There are many applications in which quantitative information about DNA mixtures with different molecular lengths is important. Gene therapy vectors are much longer than can be sequenced individually via short-read NGS. However, vector preparations may contain smaller DNAs that behave differently during sequencing. We have used two library preparations each for Pacific Biosystems (PacBio) and Oxford Nanopore Technologies NGS to determine their suitability for quantitative assessment of varying sized DNAs. Equimolar length standards were generated from E. coli genomic DNA. Both PacBio library preparations provided a consistent length dependence though with a complex pattern. This method is sufficiently sensitive that differences in genomic copy number between DNA from E. coli grown in exponential and stationary phase conditions could be detected. The transposase-based Oxford Nanopore library preparation provided a predictable length dependence, but the random sequence starts caused the loss of original length information. The ligation-based approach retained length information but read frequency was more variable. Modeling of E. coli versus lambda read frequency via cubic spline smoothing showed that the shorter genome could be used as a suitable internal spike-in for DNAs in the 200 bp to 10 kb range, allowing meaningful QC to be carried out with AAV preparations.

PubMed Disclaimer

Conflict of interest statement

All authors are current or former employees of Homology Medicines Inc.

Figures

**Figure 1**
Observed sizes versus expected sizes for XmnI digests. Expected versus observed lengths of XmnI-cut *E. coli* stationary phase DNA. The expected length of XmnI fragments prepared using PacBio library method 2.1 derived from the *E. coli* genome sequence is plotted on the X-axis versus the mean (A) or median (B) size of observed fragments from stationary phase DNA. The mean and median for the same DNA prepared using the Oxford Nanopore ligation method (C and D) and the transposase method (E and F) are also shown.

**Figure 2**
Coverage of long fragments. The relative coverage of individual DNA fragments is shown for *E. coli* (blue) and lambda (orange). Each line represents an individual PvuII fragment from either *E. coli* (blue) or lambda (orange). With PvuII, there were no lambda fragments with a size of 10–15 kb while the 30 such *E. coli* fragments are each shown individually in blue. Coverage across the length of each fragment is normalized to the maximum for the given DNA and then plotted versus the normalized length. The dip in coverage in the middle of the fragments is most extreme for long *E. coli* fragments.

**Figure 3**
Read frequency as a function of DNA length. The number of reads for each DNA was normalized to the total number of reads for the whole library and then plotted as a function of predicted DNA length. For each enzyme, the short library (protocol 2.1) is shown with red squares and the long library (protocol 2.0) with blue circles. The expected read count for each library if fragment recovery and sequencing were perfect for all fragments is shown by the dashed line in each panel.

**Figure 4**
Impact of DNA size on CCS read frequency. The ratio of reads with CCS cycle set to 1 versus 3 is shown as a function of DNA size. At sizes of 1000 bp and below, there is little or no effect. The impact grows progressively larger with longer DNA sizes.

**Figure 5**
Ratio of normalized reads in exponential vs. stationary phase *E. coli* genomes. Reads from three digests were normalized within the exponential and stationary phase DNA samples. The ratio of exponential/stationary phase reads for each fragment as a function of genomic position is shown in panel (A). The results from all three digests were then combined and the ratios averaged over bins of 100,000 bp as shown in panel (B).

**Figure 6**
Oxford Nanopore reads using the transposon-based fast library preparation. XmnI-cut *E. coli* DNA was prepared for sequencing using the standard fast library preparation method. The number of RPM is plotted as a function of predicted DNA length (A). The same data set was converted to RKPM to adjust for length and plotted again (B) but on a log scale.

**Figure 7**
Read frequency versus length with Oxford Nanopore ligation-based library preparation. The normalized frequencies of all DNAs cut with XmnI (A), PshI (B), AleI (C), and PvuII (D) are shown as a function of length when the DNA is prepared using the ligation-based method for Oxford Nanopore. The coefficient of variation (E) for frequency for each digest as a function of length using 500 bp bins is also shown for each digest with the PvuII digest shown with larger symbols to distinguish more easily from the digests with variable end sequences.The normalized frequencies of all DNAs cut with XmnI (A), PshI (B), AleI (C), and PvuII (D) are shown as a function of length when the DNA is prepared using the ligation-based method for Oxford Nanopore. The coefficient of variation (E) for frequency for each digest as a function of length using 500 bp bins is also shown for each digest with the PvuII digest shown with larger symbols to distinguish more easily from the digests with variable end sequences.

**Figure 8**
Read frequency as a function of length and 3’ terminal bases with Oxford Nanopore ligation-based library preparation. The average read count (A) and coefficient of variation (B) were calculated for all XmnI *E. coli* fragments from 400 to 10,000 bp. Length bins were selected to provide enough examples in each bin for all terminal base combinations. All points included > 5 fragments except for the 4001–5000 bp bin with CC termini which was omitted from the plots. All fragments with the same pair of 3’ ends were combined for analysis.

**Figure 9**
Fitting of DNA length versus read frequency. The read frequency of DNA fragments was fit using linear regression and cubic spline regression with *E. coli* DNA read counts as a data source (A). *E. coli* and lambda DNAs were mixed, cut with PvuII, and prepared using the 2.1 PacBio protocol. Because the linear regression methods did not model the *E. coli* DNA well, only cubic smoothing spline regression was used for lambda DNA. Lambda DNA was then used as the data source (B) and the spline regression generated with lambda data only was compared to the experimentally observed data for *E. coli* lengths and read counts. The darker blue line provides the best fit for both the modelled lambda and the experimental *E. coli* data.

See this image and copyright information in PMC

References

1. Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21:597–614. doi: 10.1038/s41576-020-0236-x. - DOI - PMC - PubMed
1. Boldogkoi Z, Moldovan N, Balazs Z, Snyder M, Tombacz D. Long-read sequencing - a powerful tool in viral transcriptome research. Trends Microbiol. 2019;27:578–592. doi: 10.1016/j.tim.2019.01.010. - DOI - PubMed
1. Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: Bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 2018;19:329–346. doi: 10.1038/s41576-018-0003-4. - DOI - PubMed
1. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The third revolution in sequencing technology. Trends Genet. 2018;34:666–681. doi: 10.1016/j.tig.2018.05.008. - DOI - PubMed
1. Tvedte ES, et al. Comparison of long read sequencing technologies in interrogating bacteria and fly genomes. G3 (Bethesda) 2021;11:jkab083. doi: 10.1093/g3journal/jkab083. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DNA read count calibration for single-molecule, long-read sequencing

Affiliations

DNA read count calibration for single-molecule, long-read sequencing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources