Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 1;12(1):17257.
doi: 10.1038/s41598-022-21606-5.

DNA read count calibration for single-molecule, long-read sequencing

Affiliations

DNA read count calibration for single-molecule, long-read sequencing

Luis M M Soares et al. Sci Rep. .

Abstract

There are many applications in which quantitative information about DNA mixtures with different molecular lengths is important. Gene therapy vectors are much longer than can be sequenced individually via short-read NGS. However, vector preparations may contain smaller DNAs that behave differently during sequencing. We have used two library preparations each for Pacific Biosystems (PacBio) and Oxford Nanopore Technologies NGS to determine their suitability for quantitative assessment of varying sized DNAs. Equimolar length standards were generated from E. coli genomic DNA. Both PacBio library preparations provided a consistent length dependence though with a complex pattern. This method is sufficiently sensitive that differences in genomic copy number between DNA from E. coli grown in exponential and stationary phase conditions could be detected. The transposase-based Oxford Nanopore library preparation provided a predictable length dependence, but the random sequence starts caused the loss of original length information. The ligation-based approach retained length information but read frequency was more variable. Modeling of E. coli versus lambda read frequency via cubic spline smoothing showed that the shorter genome could be used as a suitable internal spike-in for DNAs in the 200 bp to 10 kb range, allowing meaningful QC to be carried out with AAV preparations.

PubMed Disclaimer

Conflict of interest statement

All authors are current or former employees of Homology Medicines Inc.

Figures

Figure 1
Figure 1
Observed sizes versus expected sizes for XmnI digests. Expected versus observed lengths of XmnI-cut E. coli stationary phase DNA. The expected length of XmnI fragments prepared using PacBio library method 2.1 derived from the E. coli genome sequence is plotted on the X-axis versus the mean (A) or median (B) size of observed fragments from stationary phase DNA. The mean and median for the same DNA prepared using the Oxford Nanopore ligation method (C and D) and the transposase method (E and F) are also shown.
Figure 2
Figure 2
Coverage of long fragments. The relative coverage of individual DNA fragments is shown for E. coli (blue) and lambda (orange). Each line represents an individual PvuII fragment from either E. coli (blue) or lambda (orange). With PvuII, there were no lambda fragments with a size of 10–15 kb while the 30 such E. coli fragments are each shown individually in blue. Coverage across the length of each fragment is normalized to the maximum for the given DNA and then plotted versus the normalized length. The dip in coverage in the middle of the fragments is most extreme for long E. coli fragments.
Figure 3
Figure 3
Read frequency as a function of DNA length. The number of reads for each DNA was normalized to the total number of reads for the whole library and then plotted as a function of predicted DNA length. For each enzyme, the short library (protocol 2.1) is shown with red squares and the long library (protocol 2.0) with blue circles. The expected read count for each library if fragment recovery and sequencing were perfect for all fragments is shown by the dashed line in each panel.
Figure 4
Figure 4
Impact of DNA size on CCS read frequency. The ratio of reads with CCS cycle set to 1 versus 3 is shown as a function of DNA size. At sizes of 1000 bp and below, there is little or no effect. The impact grows progressively larger with longer DNA sizes.
Figure 5
Figure 5
Ratio of normalized reads in exponential vs. stationary phase E. coli genomes. Reads from three digests were normalized within the exponential and stationary phase DNA samples. The ratio of exponential/stationary phase reads for each fragment as a function of genomic position is shown in panel (A). The results from all three digests were then combined and the ratios averaged over bins of 100,000 bp as shown in panel (B).
Figure 6
Figure 6
Oxford Nanopore reads using the transposon-based fast library preparation. XmnI-cut E. coli DNA was prepared for sequencing using the standard fast library preparation method. The number of RPM is plotted as a function of predicted DNA length (A). The same data set was converted to RKPM to adjust for length and plotted again (B) but on a log scale.
Figure 7
Figure 7
Read frequency versus length with Oxford Nanopore ligation-based library preparation. The normalized frequencies of all DNAs cut with XmnI (A), PshI (B), AleI (C), and PvuII (D) are shown as a function of length when the DNA is prepared using the ligation-based method for Oxford Nanopore. The coefficient of variation (E) for frequency for each digest as a function of length using 500 bp bins is also shown for each digest with the PvuII digest shown with larger symbols to distinguish more easily from the digests with variable end sequences.The normalized frequencies of all DNAs cut with XmnI (A), PshI (B), AleI (C), and PvuII (D) are shown as a function of length when the DNA is prepared using the ligation-based method for Oxford Nanopore. The coefficient of variation (E) for frequency for each digest as a function of length using 500 bp bins is also shown for each digest with the PvuII digest shown with larger symbols to distinguish more easily from the digests with variable end sequences.
Figure 8
Figure 8
Read frequency as a function of length and 3’ terminal bases with Oxford Nanopore ligation-based library preparation. The average read count (A) and coefficient of variation (B) were calculated for all XmnI E. coli fragments from 400 to 10,000 bp. Length bins were selected to provide enough examples in each bin for all terminal base combinations. All points included > 5 fragments except for the 4001–5000 bp bin with CC termini which was omitted from the plots. All fragments with the same pair of 3’ ends were combined for analysis.
Figure 9
Figure 9
Fitting of DNA length versus read frequency. The read frequency of DNA fragments was fit using linear regression and cubic spline regression with E. coli DNA read counts as a data source (A). E. coli and lambda DNAs were mixed, cut with PvuII, and prepared using the 2.1 PacBio protocol. Because the linear regression methods did not model the E. coli DNA well, only cubic smoothing spline regression was used for lambda DNA. Lambda DNA was then used as the data source (B) and the spline regression generated with lambda data only was compared to the experimentally observed data for E. coli lengths and read counts. The darker blue line provides the best fit for both the modelled lambda and the experimental E. coli data.

References

    1. Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21:597–614. doi: 10.1038/s41576-020-0236-x. - DOI - PMC - PubMed
    1. Boldogkoi Z, Moldovan N, Balazs Z, Snyder M, Tombacz D. Long-read sequencing - a powerful tool in viral transcriptome research. Trends Microbiol. 2019;27:578–592. doi: 10.1016/j.tim.2019.01.010. - DOI - PubMed
    1. Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: Bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 2018;19:329–346. doi: 10.1038/s41576-018-0003-4. - DOI - PubMed
    1. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The third revolution in sequencing technology. Trends Genet. 2018;34:666–681. doi: 10.1016/j.tig.2018.05.008. - DOI - PubMed
    1. Tvedte ES, et al. Comparison of long read sequencing technologies in interrogating bacteria and fly genomes. G3 (Bethesda) 2021;11:jkab083. doi: 10.1093/g3journal/jkab083. - DOI - PMC - PubMed