. 2010 Nov 24;143(5):837-47.

doi: 10.1016/j.cell.2010.10.027.

A human genome structural variation sequencing resource reveals insights into mutational mechanisms

Jeffrey M Kidd¹, Tina Graves, Tera L Newman, Robert Fulton, Hillary S Hayden, Maika Malig, Joelle Kallicki, Rajinder Kaul, Richard K Wilson, Evan E Eichler

Affiliations

PMID: 21111241
PMCID: PMC3026629
DOI: 10.1016/j.cell.2010.10.027

A human genome structural variation sequencing resource reveals insights into mutational mechanisms

Jeffrey M Kidd et al. Cell. 2010.

. 2010 Nov 24;143(5):837-47.

doi: 10.1016/j.cell.2010.10.027.

Authors

Jeffrey M Kidd¹, Tina Graves, Tera L Newman, Robert Fulton, Hillary S Hayden, Maika Malig, Joelle Kallicki, Rajinder Kaul, Richard K Wilson, Evan E Eichler

Affiliation

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, 98195, USA.

PMID: 21111241
PMCID: PMC3026629
DOI: 10.1016/j.cell.2010.10.027

Abstract

Understanding the prevailing mutational mechanisms responsible for human genome structural variation requires uniformity in the discovery of allelic variants and precision in terms of breakpoint delineation. We develop a resource based on capillary end sequencing of 13.8 million fosmid clones from 17 human genomes and characterize the complete sequence of 1054 large structural variants corresponding to 589 deletions, 384 insertions, and 81 inversions. We analyze the 2081 breakpoint junctions and infer potential mechanism of origin. Three mechanisms account for the bulk of germline structural variation: microhomology-mediated processes involving short (2-20 bp) stretches of sequence (28%), nonallelic homologous recombination (22%), and L1 retrotransposition (19%). The high quality and long-range continuity of the sequence reveals more complex mutational mechanisms, including repeat-mediated inversions and gene conversion, that are most often missed by other methods, such as comparative genomic hybridization, single nucleotide polymorphism microarrays, and next-generation sequencing.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement

E.E.E is on the scientific advisory board for Pacific Biosciences. T.L.N. is an employee and founder of iGenix Inc.

Figures

**Figure 1. Sequence and breakpoint analyses**
Variant breakpoints were defined based on alignments of sequences from the sequenced insertion and deletion alleles. For example, (A) the sequence of fosmid clone AC207429 is compared with sequence from the corresponding region on chr2. A 10-kb deletion, relative to the reference sequence, is readily apparent (indicated by the red bracket). The position of segmental duplications, common repeats (LINEs are green, SINEs are purple, and LTR elements are orange), and RefSeq exons are shown. Sequence segments corresponding to three different breakpoint regions (red, green and purple bars) are extracted for further analysis. (B) The sequence across the variant junction is aligned against each of the other two sequences and the resulting pairwise alignments are merged. The pattern of sequence identity is assessed to identify the positions where the junction sequence switches from being a better match to the first breakpoint to being a better match to sequence from the second breakpoint. The breakpoint coordinates correspond to the innermost positions that can be confidently assigned to be before and after the variant boundary. (C) The result of aligning the three segments depicted in (A). Alignment columns where the junction sequence matches the sequence from the first (left-most) breakpoint are indicated by a ‘1’ while alignment columns where the junction sequence matches the second (right-most) breakpoint are indicated by a ‘2’. Positions where all three sequences are the same are indicated by an asterisk ‘*’. The red square highlights the position of the breakpoint coordinates (highlighted in red and green text). The two breakpoints are separated by seven nucleotides found at both breakpoints with perfect identity (blue text). Highlighted in gray is a 293-bp segment present at both breakpoints with a sequence identity of 91%. See also Tables S2, and S7.

**Figure 2. Sequenced structural variant alleles**
(A) Size distribution for 1,054 sequenced structural variants. Insertions, deletions, and inversions relative to the genome reference assembly are depicted separately. Note that the bins are not of equal sizes. The mean size of the sequenced variants is 14.9 kb for deletions, 6.1 kb for insertions, and 196 kb for inversions. Our variant selection methodology can only identify deletions larger than ~5 kb and insertions from ~5 kb to ~40 kb in size and is biased against inversions smaller than ~40 kb. (B) The relationship between the donor site of transduced sequences and LINE insertion position are given for 30 events with a match to hg18 using BLAT. Relationships are shown for 20 LINE insertions in library source individuals relative to the reference (blue lines) and for 10 insertions in the genome reference (red lines). The blue circles represent three different loci associated with multiple distinct LINE insertions. See also Figure S1 and Table S1.

**Figure 3. Examples of sequenced variants**
Examples of the complete sequence of structural variant alleles that have been associated with disease risk, including (A) a 45.5-kb deletion upstream of *NEGR1*, (B) a 72-kb deletion of *RHD*, (C) a 3.9-kb and a 20.1-kb deletion upstream of *IRGM*, and (D) a 32-kb deletion of *LCE3C*. See also Table S3.

**Figure 4. Variant breakpoint analyses**
Class-I variants (A–D) are defined as those without additional nucleotides at the breakpoint. (A) A histogram of the extent of matching breakpoint sequence (black) and extended breakpoint homology (gray) is shown for 590 class-I insertion/deletion events. The red line corresponds to the expected distribution of breakpoint match lengths found from 100 random permutations. Note that bin sizes are not equal. The increase in extended homology segments 250–299 bp in length corresponds to variants having Alus at their breakpoints. (B) As in (A) zoomed in to show variants having 20 bp or less matching sequence. (C) Box plot of variant size partitioned by length of extended breakpoint homology for 590 class-I insertion-deletion variants (red line : median; blue box : interquartile range; whiskers : within 1.5X interquartile range). (D) Breakpoint density map within a consensus Alu repeat sequence based on 269 copy-number variant events (blue box : RNA pol III promoter, black boxes : AT-rich segment between the two monomers that make up the Alu element and the poly A tail, purple box : position of motif (CCNCCNTNNCCNC) found in some Alus and associated with recombination hotspots (Myers et al., 2008)). Class-II variants (E–G) contain additional sequence across the breakpoint junction. (A) A class-II variant containing a 55 nucleotide-long stretch of additional sequence (in blue) that is not found at either breakpoint. (B) Histogram of the length of additional sequence found at variant breakpoints (black) and the length of detected extended homology between breakpoint sequences (gray) for 153 class-II insertion/deletion variants. (C) Genomic location for class-II unmatched sequences (>20 bp) associated with deletions. The black lines connect the positions of a class-II deletion variant (relative to the genome assembly) and the corresponding location where the additional sequence across the variant breakpoint can be found. The relationship for 31 deletion variants is depicted. One event involves a match to unlocalized sequence on chromosome 1 (chr1_rand). See also Figure S2, Table S4, and Table S5.

**Figure 5. Breakpoint assessment using paralogous sequence variants**
(A) Schematic comparison of the structures of the insertion and deletion haplotypes of a putative NAHR variant. The blue and red boxes represent homologous sequences present at the breakpoints, which mediate the rearrangement. The blue and red vertical lines identify paralogous sequence variants that distinguish the 5’ and 3’ copy of the matching sequence. Scanning along the deletion allele, which is missing the intervening sequence, one observes single nucleotides specific with the 5’ breakpoint, followed by a stretch of sequence that matches both, then sequences that match the 3’ breakpoint. (B) Representation for three variants showing a ‘classic’ NAHR pattern. Each line represents the deletion allele corresponding to the indicated variant. We note a single unexpected paralogous sequence variant mismatch located 145 bp past the 3’ breakpoint, which could correspond to a SNP, short gene conversion, or alignment artifact due to the placement of indels between 5’ and 3’ segments. (C) Representation of four variants having breakpoints that show a pattern of alternating sequences that match the 3’ then 5’ breakpoints. (D) An extreme pattern of alternating matches that contains 182 switches spanning over a 7.9-kb interval. (E) Rearrangements associated with gene conversion. See also Figure S3.

**Figure 6. Comparison of events detected from three studies**
Only variants estimated to be >5 kb are included. The Kidd et al. set includes sites of insertion or deletion in one of the five samples relative to the genome assembly; the Conrad et al. set includes gains and losses in at least one of the five samples relative to a reference arrayCGH sample; and the McCarroll et al. set includes CNVs that were successfully genotyped on the Affymetrix 6.0 platform and are variable among the five included samples. Prior to comparison, the variant sets within each study were merged into a single, nonredundant interval set, and any overlap among regions between studies was sufficient regardless of which sample a variant was detected in.

See this image and copyright information in PMC

References

1. Abe H, Ochi H, Maekawa T, Hatakeyama T, Tsuge M, Kitamura S, Kimura T, Miki D, Mitsui F, Hiraga N, et al. Effects of structural variations of APOBEC3A and APOBEC3B genes in chronic hepatitis B virus infection. Hepatol Res. 2009;39:1159–1168. - PubMed
1. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009;41:1061–1067. - PMC - PubMed
1. An P, Johnson R, Phair J, Kirk GD, Yu XF, Donfield S, Buchbinder S, Goedert JJ, Winkler CA. APOBEC3B deletion and risk of HIV-1 acquisition. J Infect Dis. 2009;200:1054–1058. - PMC - PubMed
1. Antonacci F, Kidd JM, Marques-Bonet T, Teague B, Ventura M, Girirajan S, Alkan C, Campbell CD, Vives L, Malig M, et al. A large and complex structural polymorphism at 16p12.1 underlies microdeletion disease risk. Nat Genet. 2010;42:745–750. - PMC - PubMed
1. Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet. 2008;40:955–962. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- Coriell Cell Repositories

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A human genome structural variation sequencing resource reveals insights into mutational mechanisms

Affiliation

A human genome structural variation sequencing resource reveals insights into mutational mechanisms

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials