Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Apr 5:2023.01.12.523790.
doi: 10.1101/2023.01.12.523790.

Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

Affiliations

Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

Mikhail Kolmogorov et al. bioRxiv. .

Update in

Abstract

Long-read sequencing technologies substantially overcome the limitations of short-reads but to date have not been considered as feasible replacement at scale due to a combination of being too expensive, not scalable enough, or too error-prone. Here, we develop an efficient and scalable wet lab and computational protocol for Oxford Nanopore Technologies (ONT) long-read sequencing that seeks to provide a genuine alternative to short-reads for large-scale genomics projects. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the NIH Center for Alzheimer's and Related Dementias (CARD). Using a single PromethION flow cell, we can detect SNPs with F1-score better than Illumina short-read sequencing. Small indel calling remains to be difficult inside homopolymers and tandem repeats, but is comparable to Illumina calls elsewhere. Further, we can discover structural variants with F1-score comparable to state-of the-art methods involving Pacific Biosciences HiFi sequencing and trio information (but at a lower cost and greater throughput). Using ONT based phasing, we can then combine and phase small and structural variants at megabase scales. Our protocol also produces highly accurate, haplotype-specific methylation calls. Overall, this makes large-scale long-read sequencing projects feasible; the protocol is currently being used to sequence thousands of brain-based genomes as a part of the NIH CARD initiative. We provide the protocol and software as open-source integrated pipelines for generating phased variant calls and assemblies.

PubMed Disclaimer

Conflict of interest statement

Competing interests. K.S. is an employee of Google LLC and owns Alphabet stock as part of the standard compensation package; authors from Google LLC did not have access to the cell line and brain tissue sample data. WT has two patents (8,748,091 and 8,394,584) licensed to Oxford Nanopore Technologies. F.J.S. received research support from Illumina, Pacific Biosciences and Oxford Nanopore Technologies. S.W.S. serves on the Scientific Advisory Council of the Lewy Body Dementia Association and the Multiple System Atrophy Coalition. S.W.S. and B.J.T. receive research support from Cerevel Therapeutics. B.J.T. holds patents on the clinical testing and therapeutic implications of the C9orf72 repeat expansion.

Figures

Figure 1.
Figure 1.. Single flow cell Oxford Nanopore Technologies (ONT) sequencing protocol.
Left: Overview of the sequencing protocol, indicating all processes from DNA extraction to sequencing. In brief, the DNA is extracted using a Kingfisher Apex instrument using the Nanobind Tissue Big DNA kit. The DNA is then sheared on a Megaruptor3 instrument, and libraries are constructed using an SQK-LSK 110 kit and sequenced on a PromethION for 72 hours. Right: (from left-to-right) Total sequenced bases / haploid human genome coverage (assuming a 3.1GB genome) from PASS reads (with estimated QV>=10) for each sample. The vertical dotted line marks the average yield across samples. Read length N50 of PASS reads, i.e., the read length (y-axis) such that reads of this length or longer represent 50% of the total sequence. The vertical dotted line marks the average N50 across samples. Distribution of PASS read identities when aligned to T2T-CHM13 v2.0. The dots mark the median identity in each sample, and the vertical dotted line is the average across samples.
Figure 2.
Figure 2.. Small variant calling performance evaluation.
(Left panels) Number of false positive (light bars) and false negative (dark bars) calls made by PEPPER-Margin-DeepVariant (PMDV) using ONT reads and DeepVariant using Illumina reads. Statistics computed against the Genome in a Bottle v4.2.1 benchmark for HG002; for other cell lines (HG00733, HG02723) calls generated by DeepVariant with HiFi reads are used. Whole genome SNP counts are stratified by mappability in (A) and local context in (B); INDEL counts by local context in (C). (Right panels) Number of true positive variant calls stratified by different genomic intervals. F1-score is reported on top of each bar.
Figure 3.
Figure 3.. Assemblies of 14 brain tissues and 3 cell lines generated by Shasta+Hapdup.
(A) NG50 and NGA50 contiguity measured using QUAST. Sample 06_66 had the lowest contiguity due to the decreased sequencing yield. (B) Assembly length. (C) Mean assemblies QV computed using yak. (D) Contiguity of phased blocks, broken at phase switches. An increased value for HG02723 suggests an increased heterozygosity rate. (A-D) Cell lines marked with asterisks. (E) Structural variation call concordance with HiFi-based assemblies for various regions of the genome.
Figure 4.
Figure 4.. Structural variant evaluations using the GIAB HG002 benchmark.
(Top) Recall, precision and F1-scores computed for various tools and sequencing technologies with Genome in a Bottle Tier1 v0.6 benchmark as reference (defined on HG002). (Bottom) F1-score computed for various SV size bins. The gray histogram shows the distribution of SV sizes in the reference set.
Figure 5.
Figure 5.. Combined, phased small and structural variants improve the profiling of complex genomic regions.
(Top) Variant phasing evaluation. Left plot shows the phased block NGx, reported by Margin. HG02723 has an increased phase block length due to higher heterozygosity. Right plots show SNP hamming and switch error computed against the small variants in HiFi-based assemblies. Evaluations are also shown for a subset of SNPs that are within 100 bp of structural variants (Bottom) An example of a Hapdup and hifiasm representations of complex clusters with small and structural variants at chr1:55,544,500– 55,551,000 (in CHM13 reference), visualized using IGV. Top tracks show phased SNPs and SVs produced by our pipeline and derived from HPRC assemblies (using dipcall). A few inconsistencies between SNP positions are explained by ambiguities between read and contig alignments around SV sites.
Figure 6.
Figure 6.. Structural variant landscape summary.
(A) The number of structural variants across samples. In the left panel, structural variants were annotated with three SV catalogs (the gnomAD-SV database, a long-read-based SV catalog, and the HPRC v1.0 SV catalog). SVs are matched if they have at least 10% genomic overlap. SVs close to centromeres, telomeres, or within segmental duplications were removed. The colors highlight the maximum frequency across these catalogs, the lighter blue showing “rare” SVs (with an allele frequency below 1%) in the catalogs, or unmatched. SVs may be unmatched, either because they are novel or due to the difficulties in the database comparison. The right panel shows the number of rare structural variants in protein-coding genes, grouped by their impact on the gene structure. (B) MHC pangenome built from 28 brain and 6 cell line haplotypes, containing 640 nodes, SVs over 100bp are shown. (C) IGH pangenome built from 28 brain haplotypes containing 268 nodes. In contrast, cell lines are typically derived from B-cell lymphocytes and contain extensive somatic rearrangements in this locus.
Figure 7.
Figure 7.. Haplotype-specific methylation profiling.
(A) Heatmap of concordance between Bisulfite whole genome sequencing and ONT Remora Methylation calls in HG002 at sites shared by both technologies covered by at least 5 reads. The lower coverage of ONT data causes striping in the heatmap at specific frequencies. (B) Read depth of Bisulfite and ONT samples, this plot shows that less ONT coverage is able to obtain the same methylation information as bisulfite with more than twice the coverage. CpG sites are one position apart in the sense and antisense DNA strands due to C-G base pairing. Since this read coverage is counted per CpG location the actual coverage was doubled to account for the neighboring strand locations and estimate actual genome wide coverage. (C) A positive control plot showing the expected differential methylation pattern in the SNRPN (Small Nuclear Ribonucleoprotein Polypeptide N) of phased ONT reads for brain sample SH-04–08. Red CpG sites are methylated and blue sites are unmethylated. Above the reads is a plot of methylation frequency and gene locations, visualized using modbamtools (D) IGV visualization of phased methylated ONT reads and the phased assemblies of brain sample SH-04–08 at the gene DLGAP2 locus that shows a 1,379 base pair insertion that is differentially methylated across haplotypes.

References

    1. 100,000 Genomes Project Pilot Investigators, Smedley D., Smith K. R., Martin A., Thomas E. A., McDonagh E. M., Cipriani V., Ellingford J. M., Arno G., Tucci A., Vandrovcova J., Chan G., Williams H. J., Ratnaike T., Wei W., Stirrups K., Ibanez K., Moutsianas L., Wielscher M., … Caulfield M. (2021). 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care - Preliminary Report. The New England Journal of Medicine, 385(20), 1868–1880. - PMC - PubMed
    1. 1000 Genomes Project Consortium, Abecasis G. R., Auton A., Brooks L. D., DePristo M. A., Durbin R. M., Handsaker R. E., Kang H. M., Marth G. T., & McVean G. A. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422), 56–65. - PMC - PubMed
    1. Cheng H., Concepcion G. T., Feng X., Zhang H., & Li H. (2021). Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18(2), 170–175. - PMC - PubMed
    1. Cheng H., Jarvis E. D., Fedrigo O., Koepfli K.-P., Urban L., Gemmell N. J., & Li H. (2022). Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology, 40(9), 1332–1335. - PMC - PubMed
    1. Chen X., Schulz-Trieglaff O., Shaw R., Barnes B., Schlesinger F., Källberg M., Cox A. J., Kruglyak S., & Saunders C. T. (2016). Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics , 32(8), 1220–1222. - PubMed

Publication types