. 2021 May 6;108(5):919-928.

doi: 10.1016/j.ajhg.2021.03.014. Epub 2021 Mar 30.

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies

Xuefang Zhao¹, Ryan L Collins², Wan-Ping Lee³, Alexandra M Weber⁴, Yukyung Jun³, Qihui Zhu³, Ben Weisburd⁵, Yongqing Huang⁶, Peter A Audano⁷, Harold Wang⁸, Mark Walker⁹, Chelsea Lowther¹, Jack Fu¹; Human Genome Structural Variation Consortium; Mark B Gerstein¹⁰, Scott E Devine¹¹, Tobias Marschall¹², Jan O Korbel¹³, Evan E Eichler¹⁴, Mark J P Chaisson¹⁵, Charles Lee¹⁶, Ryan E Mills⁴, Harrison Brand¹, Michael E Talkowski¹⁷

Affiliations

¹ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA.
² Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Division of Medical Sciences, Harvard Medical School, Boston, MA 02115, USA.
³ The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA.
⁵ Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
⁶ Data Sciences Platform, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
⁷ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA.
⁸ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
⁹ Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA.
¹⁰ Yale University Medical School, Computational Biology and Bioinformatics Program, New Haven, CT 06520, USA.
¹¹ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
¹² Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany.
¹³ European Molecular Biology Laboratory, Genome Biology Unit, 69117 Heidelberg, Germany; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
¹⁴ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA.
¹⁵ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
¹⁶ The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA; Department of Graduate Studies - Life Sciences, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul 03760, South Korea; Precision Medicine Center, The First Affiliated Hospital of Xi'an Jiaotong University, 277 West Yanta Road, Xi'an 710061, Shaanxi, People's Republic of China.
¹⁷ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Division of Medical Sciences, Harvard Medical School, Boston, MA 02115, USA. Electronic address: talkowsk@broadinstitute.org.

PMID: 33789087
PMCID: PMC8206509
DOI: 10.1016/j.ajhg.2021.03.014

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies

Xuefang Zhao et al. Am J Hum Genet. 2021.

. 2021 May 6;108(5):919-928.

doi: 10.1016/j.ajhg.2021.03.014. Epub 2021 Mar 30.

Authors

Affiliations

¹ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA.
² Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Division of Medical Sciences, Harvard Medical School, Boston, MA 02115, USA.
³ The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA.
⁵ Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
⁶ Data Sciences Platform, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
⁷ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA.
⁸ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
⁹ Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA.
¹⁰ Yale University Medical School, Computational Biology and Bioinformatics Program, New Haven, CT 06520, USA.
¹¹ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
¹² Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany.
¹³ European Molecular Biology Laboratory, Genome Biology Unit, 69117 Heidelberg, Germany; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
¹⁴ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA.
¹⁵ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
¹⁶ The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA; Department of Graduate Studies - Life Sciences, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul 03760, South Korea; Precision Medicine Center, The First Affiliated Hospital of Xi'an Jiaotong University, 277 West Yanta Road, Xi'an 710061, Shaanxi, People's Republic of China.
¹⁷ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Division of Medical Sciences, Harvard Medical School, Boston, MA 02115, USA. Electronic address: talkowsk@broadinstitute.org.

PMID: 33789087
PMCID: PMC8206509
DOI: 10.1016/j.ajhg.2021.03.014

Abstract

Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.

Keywords: copy number variation; genome assembly; long-read technology; segmental duplication; simple repeats; structural variation; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Comparison of SV callsets from srWGS and lrWGS (A) The substantial increased yield of lrWGS in SV detection is displayed from the HGSVC and the largest Pacific Biosciences (PacBio) lrWGS study published to date by comparison with contemporary srWGS studies. As shown, there is wide variability of SV detection across srWGS studies to date that report SVs detected per individual in more than 100 genomes. Parenthetical numbers next to each study label indicate the number of genomes analyzed, and bold numbers next to each bar represent the number of SVs per genome reported by each study. (B) Overlap of SVs from the HGSVC srWGS and lrWGS callsets across children of the three trio families, partitioned by SV class. (C) Distribution of repetitive sequences across the genome, genes, and exons. “Constrained” refers to genes and exons with pLI > 0.9, and “OMIM genes” includes a curated list of autosomal dominant genes that were defined in both Berg et al. and Blekhman et al. Gb, gigabase; Mb, megabase. Percentage listed within each bar is the fraction of each group composed of "unique + RM" sequences. (D) Distribution of SVs from srWGS and lrWGS split by repetitive sequence context. Formatting conventions are the same as in (C). (E and F) Concordance of deletions (E) and insertions and duplications (F) between srWGS and lrWGS split by repetitive sequence context.

**Figure 2**
Methods to recalibrate SVs in "unique + RM" sequences based on read-level alignment signatures (A) *In silico* evaluation results from VaPoR on deletions (pink background), insertions (purple background), and duplications (blue background). Duplications and insertions reported by srWGS were both compared against insertions from lrWGS. “Concordant” represents SVs discovered by both lrWGS and srWGS, and “technology-specific” represents SVs specifically discovered from one technology. (B) Distribution of normalized read depth of srWGS across deletions (pink background), insertions (purple background), and duplications (blue background) that were supported by VaPoR (red) and the 1 kb genomic regions that flank these SVs (gray). (C and D) Distribution of aberrant srWGS read pairs (C) and split reads (D) around deletions (pink background), insertions (purple background), and duplications (blue background) that were either homozygous (red), heterozygous (green), or false positives (blue). The homozygous, heterozygous, and likely false positive SV sets were selected with the criteria described in the supplemental material and methods. (E and F) Concordance of deletions (E) and insertions and duplications (F) in "unique + RM" sequences that were supported by the *in silico* SV refinement procedure. Percentages represent the fraction of total variants shared between srWGS and lrWGS.

**Figure 3**
Alignment of assembled lrWGS insertion sequences against known repeat elements (A) Count of lrWGS insertions in "unique + RM" sequences per genome by alignment of inserted sequences to known repeat elements. The number on top of the bar represents the averaged count of high-confidence insertions in "unique + RM" sequences per genome. (B) Count of lrWGS insertions that are specifically discovered by lrWGS and shared by srWGS, by alignment of inserted sequences to known repeat elements. Formatting conventions are the same as in (A). (C) An example of an insertion SV assembled by lrWGS, annotated with sequences that align to known repeat element classes. White shading represents sequences not annotated as a known repeat element. (D) Counts of lrWGS insertions in "unique + RM" sequences per genome by the class of inserted sequence and the proportion that was overlapped by srWGS. “OTH^∗” represents insertions aligned to multiple known repeat elements, such as the example shown in (B). “OTH#” stands for insertions that were not aligned to any repeat elements. Numbers in parentheses represent the proportion of insertions that were overlapped by srWGS. (E) Count of split reads around the lrWGS high-confidence insertions in histogram.

See this image and copyright information in PMC

References

1. Abel H.J., Larson D.E., Regier A.A., Chiang C., Das I., Kanchi K.L., Layer R.M., Neale B.M., Salerno W.J., Reeves C., NHGRI Centers for Common Disease Genomics Mapping and characterization of structural variation in 17,795 human genomes. Nature. 2020;583:83–89. - PMC - PubMed
1. Posey J.E., O’Donnell-Luria A.H., Chong J.X., Harel T., Jhangiani S.N., Coban Akdemir Z.H., Buyske S., Pehlivan D., Carvalho C.M.B., Baxter S., Centers for Mendelian Genomics Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 2019;21:798–812. - PMC - PubMed
1. Wright C.F., Fitzgerald T.W., Jones W.D., Clayton S., McRae J.F., van Kogelenberg M., King D.A., Ambridge K., Barrett D.M., Bayzetinova T., DDD study Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet. 2015;385:1305–1314. - PMC - PubMed
1. Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. - PMC - PubMed
1. Denny J.C., Rutter J.L., Goldstein D.B., Philippakis A., Smoller J.W., Jenkins G., Dishman E., All of Us Research Program Investigators The “All of Us” Research Program. N. Engl. J. Med. 2019;381:668–676. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies

Affiliations

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials