Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 20;34(11):2061-2073.
doi: 10.1101/gr.279273.124.

High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation

Affiliations

High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation

Jonas A Gustafson et al. Genome Res. .

Abstract

Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Summary statistics of samples, sequencing, and small variant detection. (A) Samples selected for sequencing are shown by superpopulation and sex. (B) Violin plots showing average read length, read N50, and average depth of coverage for all 100 samples. (C) DNA was extracted from cells grown from aliquots received from Coriell and sequenced using the R9.4.1 pore. Data were analyzed using both alignment- and assembly-based approaches.
Figure 2.
Figure 2.
Summary of de novo assembly results. (A) Contig NG50 compared to the total number of contigs shows that haploid assemblies generated by Flye are longer and have fewer contigs than Shasta–Hapdup. No contig NG50 generated by Flye exceeds 40 Mbp. Assemblies for each benchmarking sample show similar statistics. (B) Assembly NG50 does not significantly improve with higher read N50. (C) QV scores for both Flye and Shasta–Hapdup assemblies, and the five benchmarking genomes. (D) Count of contig breaks for all 100 samples on Chromosome 7 shows that while assembly breaks cluster there are a large number of single breaks spread across the chromosome. The 1.5–1.8 Mbp Williams–Beuren syndrome critical region is indicated with a dashed box and is flanked by clusters of assembly breaks within segdups (Morris 1993). (E) Contig sizes filtered for contigs longer than 1 Mbp for each superpopulation. (F) OMIM genes incompletely assembled in 50 or more samples using Flye or Shasta–Hapdup. For Shasta–Hapdup, if one haplotype was completely assembled in a sample but the other was incomplete, the gene is counted as incompletely assembled. Assembly of five genes (FAM20C, HYDIN, NOTCH2NLC, PRKAR1B, and SHANK2) was incomplete for all 100 samples using both assemblers. Genes that are not in or do not contain a segdup are in bold with an asterisk.
Figure 3.
Figure 3.
SV call set. (A) SV calls were benchmarked against HPRC Sniffles2 SV calls within the GIAB HG002 SV Tier1 benchmarking regions. (B) A similar number of genome-wide SVs were identified by all five callers used in this study. The confident call set is defined as variants called by hapdiff and at least two unique alignment-based callers. For each call set, the average number of deletions (DEL), insertions (INS), and total SVs (including INV, DUP, and BND events) per sample is shown. (C) Histogram of insertion and deletion counts stratified by size. The peak ∼300 bp represents Alu insertions or deletions, and the peak ∼6 kbp represents LINE insertions or deletions. (D) Cumulative novel SVs per sample. The frequency of new SVs observed increases when samples from individuals of African ancestry are included. (E) Upset plot of overlap among SV callers after merging with Jasmine. For each sample, five VCF files were merged, demonstrating that the majority of calls in each sample were called by all five callers. (F) Among 113,696 SVs from the Jasmine-merged confident call set, 12,432 were found in exactly two samples, with 6181 (50%) of those calls in pairs in which both samples are from the African superpopulation.
Figure 4.
Figure 4.
Evaluation of repeat expansions known to be associated with Mendelian conditions. (A) Haplotype-resolved repeat expansions of selected repeat loci for simple and complex repeat units. Pathogenic repeat size is shown to the right of each plot (*), the associated condition is in parentheses, and the full name of each condition can be found in Supplemental Table S11. The pathogenic repeat size for FMR1 is listed as 200 repeats, but a dashed vertical line represents the 55-repeat threshold that puts 46,XX and 46,XY individuals at risk for fragile X-associated tremor/ataxia syndrome (FXTAS, MIM #300623) and 46,XX individuals at risk of fragile X-associated primary ovarian insufficiency (POF1/FXPOI, MIM #311360). (AD) autosomal dominant, (AD/AR) autosomal dominant/recessive, (AR) autosomal recessive, (XR) X-linked recessive, (XD) X-linked dominant. (B) Among 200 haplotypes (y-axis), an expansion in RFC1 near or over 400 repeat units was seen in five haplotypes. AAGGG is the most common pathogenic repeat expansion; additional pathogenic expansions include ACAGG (not shown), and a mixed AAAGG/AAGGG expansion (Cortese et al. 1993). (C) Haplotype (HP)-resolved detail of RFC1 repeat expansions in five samples with an expansion of one allele. Haplotypes are assigned arbitrarily. The dotted line represents the position of full penetrance alleles typically seen at 400 repeat units. (D) Three samples with expansions in ATXN10 larger than 280 ATTCT repeats were observed. The dotted line at 800 repeat units represents the position of the lower end of the full penetrance range. ExpansionHunter (EH) estimates are overlayed atop the bar plots in (C) and (D), placed on HP1 or HP2 based on their length.
Figure 5.
Figure 5.
Patterns of methylation among the 1000 Genomes samples. (A) Among 69 46,XX samples, 42 had mixed X-Chromosome inactivation (top, example from HG01414), while 27 were skewed (bottom, example from HG01801). The color differences are related to breaks in phasing and do not suggest methylation is mixed along a single haplotype. (B) Haplotype-resolved methylation fraction is shown for three imprinted loci associated with four imprinting disorders. Methylated (>75%) or unmethylated (<25%) fraction at IC1 in H19 and IC2 in KCNQ1OT1. Haplotype-resolved methylation fraction is also shown for the CpG island within SNURF-SNRPN that is evaluated when testing for PWS or AS. Two samples have either gain (GM19473) or loss (HG00525) of methylation at this locus. (C) Unique methylation differences within defined CpG islands were identified in individual samples. An example from HG02389 shows three CpG sites with increased methylation (red boxes) compared to controls (gray).

Update of

  • Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation.
    Gustafson JA, Gibson SB, Damaraju N, Zalusky MP, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson Z, Storz SH, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E; 1000 Genomes ONT Sequencing Consortium; University of Washington Center for Rare Disease Research (UW-CRDR); Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium; McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN Jr, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. Gustafson JA, et al. medRxiv [Preprint]. 2024 Mar 7:2024.03.05.24303792. doi: 10.1101/2024.03.05.24303792. medRxiv. 2024. Update in: Genome Res. 2024 Nov 20;34(11):2061-2073. doi: 10.1101/gr.279273.124. PMID: 38496498 Free PMC article. Updated. Preprint.

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
    1. Akçimen F, Ross JP, Bourassa CV, Liao C, Rochefort D, Gama MTD, Dicaire M-J, Barsottini OG, Brais B, Pedroso JL, et al. 2019. Investigation of the RFC1 repeat expansion in a Canadian and a Brazilian ataxia cohort: identification of novel conformations. Front Genet 10: 1219. 10.3389/fgene.2019.01219 - DOI - PMC - PubMed
    1. AlAbdi L, Shamseldin HE, Khouj E, Helaby R, Aljamal B, Alqahtani M, Almulhim A, Hamid H, Hashem MO, Abdulwahab F, et al. 2023. Beyond the exome: utility of long-read whole genome sequencing in exome-negative autosomal recessive diseases. Genome Med 15: 114. 10.1186/s13073-023-01270-8 - DOI - PMC - PubMed
    1. Alonso I, Jardim LB, Artigalas O, Saraiva-Pereira ML, Matsuura T, Ashizawa T, Sequeiros J, Silveira I. 2006. Reduced penetrance of intermediate size alleles in spinocerebellar ataxia type 10. Neurology 66: 1602–1604. 10.1212/01.wnl.0000216266.30177.bb - DOI - PubMed
    1. Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, et al. 2019. Characterizing the major structural variant alleles of the human genome. Cell 176: 663–675.e19. 10.1016/j.cell.2018.12.019 - DOI - PMC - PubMed

LinkOut - more resources