Read clouds uncover variation in complex regions of the human genome

Alex Bishara¹, Yuling Liu², Ziming Weng³, Dorna Kashef-Haghighi¹, Daniel E Newburger⁴, Robert West³, Arend Sidow⁵, Serafim Batzoglou¹

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, California 94305, USA;
² Department of Computer Science, Stanford University, Stanford, California 94305, USA; Department of Chemistry, Stanford University, Stanford, California 94305, USA;
³ Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA;
⁴ Biomedical Informatics Training Program, Stanford, California 94305, USA;
⁵ Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA; Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA.

PMID: 26286554
PMCID: PMC4579342
DOI: 10.1101/gr.191189.115

Read clouds uncover variation in complex regions of the human genome

Alex Bishara et al. Genome Res. 2015 Oct.

. 2015 Oct;25(10):1570-80.

doi: 10.1101/gr.191189.115. Epub 2015 Aug 18.

Authors

Alex Bishara¹, Yuling Liu², Ziming Weng³, Dorna Kashef-Haghighi¹, Daniel E Newburger⁴, Robert West³, Arend Sidow⁵, Serafim Batzoglou¹

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, California 94305, USA;
² Department of Computer Science, Stanford University, Stanford, California 94305, USA; Department of Chemistry, Stanford University, Stanford, California 94305, USA;
³ Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA;
⁴ Biomedical Informatics Training Program, Stanford, California 94305, USA;
⁵ Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA; Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA.

PMID: 26286554
PMCID: PMC4579342
DOI: 10.1101/gr.191189.115

Abstract

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.

PubMed Disclaimer

Figures

**Figure 1.**
Read clouds (RC) and synthetic long reads (SLR) obtained by Illumina TruSeq Synthetic Long-Read sequencing. Each well initially contains long molecules that represent a small fraction of the target genome; reads from each long molecule are separated in genomic coordinates within the target genome, and therefore, clusters of such reads (read clouds) are formed with each cluster originating from one source fragment. Blue reads denote end-markers of the source fragments and may not always be present as sequenced short reads. (A) In the RC approach, long fragments from several wells w_n are sequenced to a shallow depth and aligned to the reference to obtain read clouds. Pooling of reads across several read clouds allows inference of the variation in the underlying long fragments. (B) In the SLR approach, long fragments are sequenced to a much higher depth to enable de novo assembly of synthetic long reads. For the same total sequencing budget C, the RC approach covers proportionally more target genome space than the SLR approach.

**Figure 2.**
RFA overview. (A) Wells w_n from the sample are first aligned to the reference using an existing short-read aligner, and uniquely mapped read clouds are used to learn a prior P(M), which captures protocol properties such as the long fragment size distribution. (B) Each well is aligned separately with the aid of a short-read aligner to determine candidate source long fragment locations as well as multiple candidate short-read alignments to the long fragments. Finally, MAP inference is performed to converge on optimal alignments. In this example, RFA successfully determines the correct repeat copy R that overlaps with a source long fragment.

**Figure 3.**
Histograms of simulation results across 1000 wells. Each point in the histogram represents the result of a single simulated well. (A) All reads. (B) Only reads that were multimapped in the abbreviated reference. RFA confidently maps an additional 2.9% (out of 3.2% from Oracle) of the total reads over the Baseline approach, and achieves 92% of the Oracle performance.

**Figure 4.**
Whole-genome SNV calling on the IDC sample. (A) Comparison of the initial baseline short-read alignments of all the wells merged together with four wells aligned with RFA (from two distinct haplotypes), in a region overlapping the *FCGR1C* gene. (B) Placement of recovered SNVs within the surrounding 300-kbp region. (C) Density of recovered SNVs throughout the whole genome (*bottom* track), by chromosome, compared to density of segmental duplications (*top* track). Long clustered regions of recovered SNVs coincide with dense regions of annotated segmental duplications.

**Figure 5.**
Abbreviated reference framework. The framework for generating putative long reads and associated short-read alignments to these segments: (1) Reads are aligned to hg19 at most once in pass1; (2) putative long read segments are identified and spliced together to create an abbreviated reference; and (3) reads are aligned again in pass2 to this abbreviated reference allowing multiple mappings.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. - PMC - PubMed
1. Amini S, Pushkarev D, Christiansen L, Kostem E, Royce T, Turk C, Pignatelli N, Adey A, Kitzman JO, Vijayan K, et al. 2014. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 46: 1343–1349. - PMC - PubMed
1. Antonacci F, Dennis MY, Huddleston J, Sudmant PH, Steinberg KM, Rosenfeld JA, Miroballo M, Graves TA, Vives L, Malig M, et al. 2014. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat Genet 46: 1293–1302. - PMC - PubMed
1. Ashton PM, Nair S, Dallman T, Rubino S, Rabsch W, Mwaigwisya S, Wain J, O'Grady J. 2014. Minion nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat Biotechnol 33: 296–300. - PubMed
1. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. 2001. Segmental duplications: organization and impact within the current Human Genome Project assembly. Genome Res 11: 1005–1017. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Read clouds uncover variation in complex regions of the human genome

Affiliations

Read clouds uncover variation in complex regions of the human genome

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous