Discovery and genotyping of structural variation from long-read haploid genome sequence data

John Huddleston^{1

2}, Mark J P Chaisson¹, Karyn Meltz Steinberg³, Wes Warren³, Kendra Hoekzema¹, David Gordon^{1

2}, Tina A Graves-Lindsay³, Katherine M Munson¹, Zev N Kronenberg¹, Laura Vives¹, Paul Peluso⁴, Matthew Boitano⁴, Chen-Shin Chin⁴, Jonas Korlach⁴, Richard K Wilson⁵, Evan E Eichler^{1

2}

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.
² Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.
³ McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA.
⁴ Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA.
⁵ Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA.

PMID: 27895111
PMCID: PMC5411763
DOI: 10.1101/gr.214007.116

Discovery and genotyping of structural variation from long-read haploid genome sequence data

John Huddleston et al. Genome Res. 2017 May.

. 2017 May;27(5):677-685.

doi: 10.1101/gr.214007.116. Epub 2016 Nov 28.

Authors

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.
² Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.
³ McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA.
⁴ Pacific Biosciences of California, Incorporated, Menlo Park, California 94025, USA.
⁵ Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA.

PMID: 27895111
PMCID: PMC5411763
DOI: 10.1101/gr.214007.116

Erratum in

Corrigendum: Discovery and genotyping of structural variation from long-read haploid genome sequence data.
Huddleston J, Chaisson MJP, Steinberg KM, Warren W, Hoekzema K, Gordon D, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L, Peluso P, Boitano M, Chin CS, Korlach J, Wilson RK, Eichler EE. Huddleston J, et al. Genome Res. 2018 Jan;28(1):144. doi: 10.1101/gr.233007.117. Genome Res. 2018. PMID: 29295848 Free PMC article. No abstract available.

Abstract

In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as ∼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.

PubMed Disclaimer

Figures

**Figure 1.**
Structural variant (SV) discovery. (A) SV deletions (red) and insertions (black) identified by SMRT-SV in a theoretical diploid human (CHM1 and CHM13) are classified as either novel (83%) or previously reported (17%) based on their presence in previously published SV call sets (Conrad et al. 2010; Kidd et al. 2010a; Mills et al. 2011; Sudmant et al. 2015a,b). (B) Compared specifically against insertions and deletions from Phase 3 of the 1000 Genomes Project (Sudmant et al. 2015b). Counts per call set are shown with mean and median SV size (base pair) shown in parentheses. The Venn diagram compares one theoretical diploid genome sequenced and analyzed using SMRT sequence data versus 2504 diploid genomes lightly sequenced (approximately sixfold coverage) with Illumina sequence.

**Figure 2.**
Indel discovery. Small indels (2–49 bp) identified by SMRT-SV in a theoretical diploid human (CHM1 and CHM13) from SMRT WGS data are compared with merged FreeBayes and GATK HaplotypeCaller indel calls from CHM1 and CHM13 Illumina WGS. All call sets were filtered to exclude previously defined low-complexity regions (Li 2014) and 1-bp indels that cannot be reliably detected by SMRT sequence data (Gordon et al. 2016). (A) The proportion of SMRT-SV calls that are not observed in Illumina call sets increases linearly with indel size. (B) The total number of calls shared between or distinct to SMRT and Illumina WGS call sets (with mean and median call size in parentheses) highlights that 43% of SMRT-SV indels were not detected by FreeBayes or GATK, while 22% of indels in Illumina-based call sets were not detected by SMRT-SV.

**Figure 3.**
SMRT-SV genotyping with Illumina sequence data. (A) The heatmap depicts genotypes for 18,211 of 29,992 (61%) nonredundant CHM1 and CHM13 SVs that could be concordantly genotyped in both moles by their respective Illumina WGS. Each row is a sample (two moles and 30 PCR-free samples from the 1000 Genomes Project), each column is an SV, and each cell is colored by genotype: homozygous alternate (dark blue), heterozygous (light blue), and homozygous reference (white). The number of heterozygous and homozygous alternate genotypes for each sample is indicated (parentheses). Columns are ordered by presence/absence of the SV in CHM1, CHM1/CHM13, and CHM13 and then by allele count and genomic coordinate. Specifically highlighted are 1161 SVs present in both CHM1/CHM13 and fixed (homozygous alternate) in all 30 diploid human genomes, suggesting minor alleles or sequencing errors in GRCh38. (B) The density plot compares the GC composition (x-axis) of CHM1 and CHM13 SVs that could be successfully genotyped by their respective PCR-free Illumina WGS data (77%) versus those that could not. Density plots do not represent relative proportion between the two SV categories. SVs that failed to genotype were particularly biased for GC-rich regions of the genome.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
1. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, et al. 2009. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41: 1061–1067. - PMC - PubMed
1. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. - PubMed
1. Bhangale TR, Rieder MJ, Livingston RJ, Nickerson DA. 2005. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet 14: 59–69. - PubMed
1. Browning BL, Browning SR. 2016. Genotype imputation with millions of reference samples. Am J Hum Genet 98: 116–126. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- Cellosaurus - a cell line knowledge resource
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Discovery and genotyping of structural variation from long-read haploid genome sequence data

Affiliations

Discovery and genotyping of structural variation from long-read haploid genome sequence data

Authors

Affiliations

Erratum in

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous