Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Mar 7:2024.03.05.24303792.
doi: 10.1101/2024.03.05.24303792.

Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation

Affiliations

Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation

Jonas A Gustafson et al. medRxiv. .

Update in

  • High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation.
    Gustafson JA, Gibson SB, Damaraju N, Zalusky MPG, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson ZB, Storz SHR, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E; 1000 Genomes ONT Sequencing Consortium; University of Washington Center for Rare Disease Research (UW-CRDR); Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium; McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN Jr, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. Gustafson JA, et al. Genome Res. 2024 Nov 20;34(11):2061-2073. doi: 10.1101/gr.279273.124. Genome Res. 2024. PMID: 39358015 Free PMC article.

Abstract

Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.

Keywords: 1000 Genomes Project; Nanopore sequencing; long-read sequencing; methylation; repeat expansions; structural variation.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTEREST STATEMENT WDC, ML, FS, and DEM have received research support and/or consumables from ONT. WDC, JG, FS, and DEM have received travel funding to speak on behalf of ONT. DEM is on a scientific advisory board at ONT. FS has received research support from Illumina, Genetech, and PacBio. SBM is an advisor to BioMarin, MyOme, and Tenaya Therapeutics. EEE is a scientific advisory board (SAB) member of Variant Bio, Inc. DEM holds stock options in MyOme.

Figures

Figure 1.
Figure 1.. Summary statistics of samples, sequencing and small variant detection.
A: Samples selected for sequencing are shown by superpopulation and sex. B: Violin plots showing average read length, read N50, and average depth of coverage for all 100 samples. C: DNA was extracted from cells grown from aliquots received from Coriell and sequenced using the R9.4.1 pore. Data was analyzed using both alignment- and assembly-based approaches. D: Comparison of precision, recall, and F1 scores for SNVs and indels called from ONT data (PMDV) or Illumina data (GATK) compared to GIAB or HPRC calls for 5 high-confidence samples genome-wide in GIAB high-confidence regions only (GIAB.HG002.mask.incl.HP) and when excluding homopolymers in the GIAB high-confidence regions (GIAB.HG002.mask.excl.HP). Homopolymers were defined as any sequence of four identical nucleotides or more, including one bp flanking each side of the sequence. E: Precision, recall, and F1 scores for SNVs and indels from chromosomes 1–22 called with PMDV in GIAB high-confidence regions (including homopolymers) and GIAB high-confidence regions when excluding homopolymers.
Figure 2.
Figure 2.. Summary of de novo assembly results.
A: Contig NG50 compared to total number of contigs for both assembly methods shows that the haploid assemblies generated by Flye are longer and have fewer contigs than Shasta-Hapdup, but no contigs generated by Flye exceed 40 Mbp. Assemblies for each benchmarking sample show similar statistics. B: Read N50 compared to assembly NG50 shows that assembly NG50 does not significantly improve with higher read N50. C: QV scores for both Flye and Shasta-Hapdup assemblies show slightly higher assembly QV scores for the haploid Flye assemblies. Values for the five benchmarking genomes are shown. D: Count of contig breaks for all 100 samples on chromosome 7 demonstrate that assembly breaks cluster in similar locations when using both assembly approaches and that there are a large number of single breaks spread across the chromosome. The 1.5–1.8 Mbp Williams-Beuren syndrome critical region is indicated with a dashed box and is flanked by clusters of assembly breaks within segdups (Morris 1993). The position of assembly breaks were categorized as “Satellite” (only satellite repeats), “SegDup+Satellite” (segdups and satellite repeats), “SegDup” (only segdups) or “Neither” (outside segdups and satellite repeat regions). E: Contig sizes filtered for contigs longer than 1 Mb for each superpopulation. F: OMIM genes incompletely assembled in 50 or more samples using either Flye (orange) or Shasta-Hapdup (blue). For Shasta-Hapdup, if one haplotype was completely assembled in a sample but the other was incomplete, the gene is counted as incompletely assembled. Assembly of 5 genes (FAM20C, HYDIN, NOTCH2NLC, PRKAR1B, and SHANK2) was incomplete for all 100 samples using both assemblers. Genes that are not in or do not contain a segdup are in bold with an asterisk.
Figure 3.
Figure 3.. SV call set.
A: SV calls were benchmarked against HPRC Sniffles2 SV calls within the GIAB HG002 SV Tier1 benchmarking regions. B: A similar number of genome-wide SVs were identified by all five callers used in this study, with the highest number of SVs per sample identified by hapdiff. The confident call set is defined as variants called by hapdiff and at least 2 unique alignment-based callers. For each call set the average number of deletions (DEL), insertions (INS) and total SVs (including INV, DUP and BND events) per sample is shown below the plot. C: Histogram of insertion and deletion counts stratified by size using Sniffles2 from the Napu pipeline. The peak around 300 bp represents Alu insertions or deletions, and the peak around 6 kbp represents LINE insertions or deletions. D: Cumulative novel SVs per sample. The frequency of new SVs observed increases when samples from individuals of African ancestry are included. E: Upset plot of overlap among SV callers after merging with Jasmine. For each sample, 5 vcf files were merged, demonstrating that the majority of calls in each sample were called by all 5 callers. The next highest violin plots are calls made by all callers except for hapdiff (the only assembly-based caller) and calls made only by one caller. F: Among 113,696 SVs from the Jasmine-merged confident call set, 12,432 were found in exactly 2 samples, with 6,181 (50%) of those calls in pairs in which both samples are from the African superpopulation.
Figure 4.
Figure 4.. SVs, including multi-exon deletions are found in medically relevant genes.
A. Phased IGV view of a 22,791-bp deletion in GM19035 that includes all or part of HBB, HBD, HBBP1, BGLT3, and HBG1. Variants in this region are associated with beta thalassemia (MIM: 613985) with this specific deletion known as Hemoglobin Kenya with this individual likely being an asymptomatic carrier (Huisman et al. 1972). GM19035 is from an individual from the Luhya population within the African superpopulation. B. Phased IGV view from LRS data showing a CYP2D6 full gene deletion on one haplotype (HP1) and a hybrid tandem arrangement (*36+*10) represented by an insertion on the second haplotype (HP2) in HG02396, compared to short-read whole genome sequence data from the same sample in which the complex nature of this event cannot be resolved.
Figure 5.
Figure 5.. Evaluation of repeat expansions known to be associated with Mendelian conditions.
A: Haplotype-resolved repeat expansions of selected repeat loci for simple and complex repeat units. Pathogenic repeat size is shown to the right of each plot (*), the associated condition is in parentheses, and the full name of each condition can be found in Table S10. The pathogenic repeat size for FMR1 is listed as 200 repeats, but a dashed vertical line represents the 55-repeat threshold that puts 46,XX and 46,XY individuals at risk for fragile X-associated tremor/ataxia syndrome (FXTAS, MIM #300623) and 46,XX individuals at risk of fragile X-associated primary ovarian insufficiency (POF1/FXPOI, MIM #311360). (AD, autosomal dominant; AD/AR, autosomal dominant/recessive; AR, autosomal recessive; XR, X-linked recessive; XD, X-linked dominant.) B: Among 200 haplotypes (y-axis), an expansion in RFC1 near or over 400 repeat units was seen in 5 haplotypes. The fraction of each motif within a single haplotype is shown. AAGGG is the most common pathogenic repeat expansion; additional pathogenic expansions include ACAGG (not shown), and a mixed AAAGG/AAGGG expansion.(Cortese et al. 1993) C: Haplotype (HP)-resolved detail of RFC1 repeat expansions in five samples with an expansion of one allele. Haplotypes are assigned arbitrarily. Dotted line represents the position of full penetrance alleles typically seen at 400 repeat units. D: Three samples with expansions in ATXN10 larger than 280 ATTCT repeats were observed, one of which carries one allele larger than 800 repeat units and one allele around 500 repeat units in size. The dotted line at 800 repeat units represents the position of the lower end of the full penetrance range. ExpansionHunter (EH) estimates are overlayed atop the bar plots in (C) and (D), placed on HP1 or HP2 based on their length.
Figure 6.
Figure 6.. Patterns of methylation among the 1000 Genomes samples.
A. Among 69 46,XX samples, 42 had mixed X-chromosome inactivation (top, example from HG01414), while 27 were skewed (bottom, example from HG01801). The color differences are related to breaks in phasing and do not suggest methylation is mixed along a single haplotype. B. Haplotype-resolved methylation fraction is shown for three imprinted loci associated with four imprinting disorders. Methylated (>75%) or unmethylated (<25%) fraction at IC1 in H19 and IC2 in KCNQ1OT1, which are associated with BWS and SRS on 11p15.5. Haplotype-resolved methylation fraction is also shown for the CpG island within SNURF-SNRPN that is evaluated when testing for PWS or AS. Two samples have either gain (GM19473) or loss (HG00525) of methylation at this locus. C. Unique distinct methylation differences within defined CpG islands were identified in individual samples. An example from HG02389 shows three CpG sites with increased methylation (red boxes) compared to controls (gray). D. As an example, we identified one haplotype in HG02389 that has increased methylation at an internal CpG site (3007) within an intron of SLC29A3. Methylation frequency by haplotype is shown for HG02389 and one control (HG03022). Methylation status is shown for individual reads for each haplotype from HG02389 only (red indicates a methylated CpG; blue indicates an unmethylated CpG).

Similar articles

  • High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation.
    Gustafson JA, Gibson SB, Damaraju N, Zalusky MPG, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson ZB, Storz SHR, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E; 1000 Genomes ONT Sequencing Consortium; University of Washington Center for Rare Disease Research (UW-CRDR); Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium; McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN Jr, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. Gustafson JA, et al. Genome Res. 2024 Nov 20;34(11):2061-2073. doi: 10.1101/gr.279273.124. Genome Res. 2024. PMID: 39358015 Free PMC article.
  • Comprehensive de novo mutation discovery with HiFi long-read sequencing.
    Kucuk E, van der Sanden BPGH, O'Gorman L, Kwint M, Derks R, Wenger AM, Lambert C, Chakraborty S, Baybayan P, Rowell WJ, Brunner HG, Vissers LELM, Hoischen A, Gilissen C. Kucuk E, et al. Genome Med. 2023 May 8;15(1):34. doi: 10.1186/s13073-023-01183-6. Genome Med. 2023. PMID: 37158973 Free PMC article.
  • Long-read sequencing of hundreds of diverse brains provides insight into the impact of structural variation on gene expression and DNA methylation.
    Billingsley KJ, Meredith M, Daida K, Jerez PA, Negi S, Malik L, Genner RM, Moller A, Zheng X, Gibson SB, Mastoras M, Baker B, Kouam C, Paquette K, Jarreau P, Makarious MB, Moore A, Hong S, Vitale D, Shah S, Monlong J, Pantazis CB, Asri M, Shafin K, Carnevali P, Marenco S, Auluck P, Mandal A, Miga KH, Rhie A, Reed X, Ding J, Cookson MR, Nalls M, Singleton A, Miller DE, Chaisson M, Timp W, Gibbs JR, Phillippy AM, Kolmogorov M, Jain M, Sedlazeck FJ, Paten B, Blauwendraat C. Billingsley KJ, et al. bioRxiv [Preprint]. 2024 Dec 17:2024.12.16.628723. doi: 10.1101/2024.12.16.628723. bioRxiv. 2024. PMID: 39764002 Free PMC article. Preprint.
  • Long-read sequencing for diagnosis of genetic myopathies.
    Yeow D, Rudaks LI, Davis R, Ng K, Ghaoui R, Cheong PL, Ravenscroft G, Kennerson M, Deveson I, Kumar KR. Yeow D, et al. BMJ Neurol Open. 2025 May 11;7(1):e000990. doi: 10.1136/bmjno-2024-000990. eCollection 2025. BMJ Neurol Open. 2025. PMID: 40357124 Free PMC article. Review.
  • Application of long-read sequencing to the detection of structural variants in human cancer genomes.
    Sakamoto Y, Zaha S, Suzuki Y, Seki M, Suzuki A. Sakamoto Y, et al. Comput Struct Biotechnol J. 2021 Jul 28;19:4207-4216. doi: 10.1016/j.csbj.2021.07.030. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 34527193 Free PMC article. Review.

References

    1. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, et al. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Akbari V, Garant J-M, O’Neill K, Pandoh P, Moore R, Marra MA, Hirst M, Jones SJM. 2022. Genome-wide detection of imprinted differentially methylated regions using nanopore sequencing. Elife 11: e77898. - PMC - PubMed
    1. Akçimen F, Ross JP, Bourassa CV, Liao C, Rochefort D, Gama MTD, Dicaire M-J, Barsottini OG, Brais B, Pedroso JL, et al. 2019. Investigation of the RFC1 Repeat Expansion in a Canadian and a Brazilian Ataxia Cohort: Identification of Novel Conformations. Front Genet 10: 1219. - PMC - PubMed
    1. AlAbdi L, Shamseldin HE, Khouj E, Helaby R, Aljamal B, Alqahtani M, Almulhim A, Hamid H, Hashem MO, Abdulwahab F, et al. 2023. Beyond the exome: utility of long-read whole genome sequencing in exome-negative autosomal recessive diseases. Genome Medicine 15: 114. - PMC - PubMed
    1. Alonso I, Jardim LB, Artigalas O, Saraiva-Pereira ML, Matsuura T, Ashizawa T, Sequeiros J, Silveira I. 2006. Reduced penetrance of intermediate size alleles in spinocerebellar ataxia type 10. Neurology 66: 1602–1604. - PubMed

Publication types