Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May;27(5):677-685.
doi: 10.1101/gr.214007.116. Epub 2016 Nov 28.

Discovery and genotyping of structural variation from long-read haploid genome sequence data

Affiliations

Discovery and genotyping of structural variation from long-read haploid genome sequence data

John Huddleston et al. Genome Res. 2017 May.

Erratum in

Abstract

In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as ∼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Structural variant (SV) discovery. (A) SV deletions (red) and insertions (black) identified by SMRT-SV in a theoretical diploid human (CHM1 and CHM13) are classified as either novel (83%) or previously reported (17%) based on their presence in previously published SV call sets (Conrad et al. 2010; Kidd et al. 2010a; Mills et al. 2011; Sudmant et al. 2015a,b). (B) Compared specifically against insertions and deletions from Phase 3 of the 1000 Genomes Project (Sudmant et al. 2015b). Counts per call set are shown with mean and median SV size (base pair) shown in parentheses. The Venn diagram compares one theoretical diploid genome sequenced and analyzed using SMRT sequence data versus 2504 diploid genomes lightly sequenced (approximately sixfold coverage) with Illumina sequence.
Figure 2.
Figure 2.
Indel discovery. Small indels (2–49 bp) identified by SMRT-SV in a theoretical diploid human (CHM1 and CHM13) from SMRT WGS data are compared with merged FreeBayes and GATK HaplotypeCaller indel calls from CHM1 and CHM13 Illumina WGS. All call sets were filtered to exclude previously defined low-complexity regions (Li 2014) and 1-bp indels that cannot be reliably detected by SMRT sequence data (Gordon et al. 2016). (A) The proportion of SMRT-SV calls that are not observed in Illumina call sets increases linearly with indel size. (B) The total number of calls shared between or distinct to SMRT and Illumina WGS call sets (with mean and median call size in parentheses) highlights that 43% of SMRT-SV indels were not detected by FreeBayes or GATK, while 22% of indels in Illumina-based call sets were not detected by SMRT-SV.
Figure 3.
Figure 3.
SMRT-SV genotyping with Illumina sequence data. (A) The heatmap depicts genotypes for 18,211 of 29,992 (61%) nonredundant CHM1 and CHM13 SVs that could be concordantly genotyped in both moles by their respective Illumina WGS. Each row is a sample (two moles and 30 PCR-free samples from the 1000 Genomes Project), each column is an SV, and each cell is colored by genotype: homozygous alternate (dark blue), heterozygous (light blue), and homozygous reference (white). The number of heterozygous and homozygous alternate genotypes for each sample is indicated (parentheses). Columns are ordered by presence/absence of the SV in CHM1, CHM1/CHM13, and CHM13 and then by allele count and genomic coordinate. Specifically highlighted are 1161 SVs present in both CHM1/CHM13 and fixed (homozygous alternate) in all 30 diploid human genomes, suggesting minor alleles or sequencing errors in GRCh38. (B) The density plot compares the GC composition (x-axis) of CHM1 and CHM13 SVs that could be successfully genotyped by their respective PCR-free Illumina WGS data (77%) versus those that could not. Density plots do not represent relative proportion between the two SV categories. SVs that failed to genotype were particularly biased for GC-rich regions of the genome.

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, et al. 2009. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41: 1061–1067. - PMC - PubMed
    1. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. - PubMed
    1. Bhangale TR, Rieder MJ, Livingston RJ, Nickerson DA. 2005. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet 14: 59–69. - PubMed
    1. Browning BL, Browning SR. 2016. Genotype imputation with millions of reference samples. Am J Hum Genet 98: 116–126. - PMC - PubMed

Publication types

LinkOut - more resources