Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 May;20(5):623-35.
doi: 10.1101/gr.102970.109. Epub 2010 Mar 22.

Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome

Affiliations

Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome

Aaron R Quinlan et al. Genome Res. 2010 May.

Abstract

Structural variation (SV) is a rich source of genetic diversity in mammals, but due to the challenges associated with mapping SV in complex genomes, basic questions regarding their genomic distribution and mechanistic origins remain unanswered. We have developed an algorithm (HYDRA) to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences. We applied these methods to two inbred mouse strains: C57BL/6J and DBA/2J. We demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications; however, our analysis of the C57BL/6J reference strain shows that incomplete reference genome assemblies are a major source of noise. We report 7196 SVs between the two strains, more than two-thirds of which are due to transposon insertions. Of the remainder, 59% are deletions (relative to the reference), 26% are insertions of unlinked DNA, 9% are tandem duplications, and 6% are inversions. To investigate the origins of SV, we characterized 3316 breakpoint sequences at single-nucleotide resolution. We find that approximately 16% of non-transposon SVs have complex breakpoint patterns consistent with template switching during DNA replication or repair, and that this process appears to preferentially generate certain classes of complex variants. Moreover, we find that SVs are significantly enriched in regions of segmental duplication, but that this effect is largely independent of DNA sequence homology and thus cannot be explained by non-allelic homologous recombination (NAHR) alone. This result suggests that the genetic instability of such regions is often the cause rather than the consequence of duplicated genomic architecture.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of structural variation discovery pipeline. (A) Paired-end mapping signatures are shown for five different classes of structural variation as detected by paired-end mapping. Notably, each end of a given discordant matepair will map to each copy of a segmental duplication (blue, bottom left panel) when a mutation arises in one copy. In the case of a new transposon insertion (gray rectangle with red arrowhead, bottom right panel), ends of discordant matepairs that originated from the newly inserted sequence will map to all other similar elements in the genome. (Exp.) Experimental genome; (Disc.) discordant pairs from experimental genome; (Ref.) reference genome; (SD) segmental duplication. (B) Matepairs from the DBA strain are aligned to the mouse reference genome. (C) Clusters of discordant matepairs (often with multiple possible mapping combinations) are identified by HYDRA as putative variants (a deletion is shown). (D) Discordant WGS long-reads that corroborate the HYDRA call are assembled into breakpoint contigs (“breaktigs”) with phrap. The red asterisk indicates the nucleotide at which the SV breakpoint occurred. (E) Breaktigs are then aligned to the reference genome with MEGABLAST using very sensitive settings. The observed sequence homology (evident as alignment “overlap” at the breakpoint) in the resulting alignments is a hallmark of the causal SV mechanism, where negative overlap indicates the presence of an insertion or small-scale rearrangements directly at the breakpoint. (NAHR) Non-allelic homologous recombination.
Figure 2.
Figure 2.
Validation of HYDRA calls. (A) HYDRA breakpoint calls in DBA were compared to split-read (S/R) alignments of WGS long-reads from both DBA and B6. Calls in DBA are corroborated by split-read mapping(s) from DBA that map within the predicted breakpoint interval in an orientation that is consistent with the HYDRA call. However, if one or more split-reads from B6 supports the call, it is refuted under the assumption that it originated from a mapping artifact or an error in the reference genome assembly. Cases in which split-reads were observed in neither strain were deemed inconclusive. Due to the relatively low coverage of DBA WGS long-reads, many HYDRA calls were inconclusive. (B) The number and validation rate for high-confidence and low-confidence HYDRA calls are shown for DBA. The 7784 final HYDRA calls represent the high-confidence calls that were not refuted by split-reads plus the low-confidence calls that were confirmed. (LSV) Local SVs, such as duplications, deletions, and inversions; (TEV) transposable element variants; (DI) “distant” insertions of non-transposon DNA from >1 Mb away (including retrogenes). For a detailed table describing the different SV classes and their validation rates, see Supplemental Table S1. (C) The validation rate (blue) and number (gray) of HYDRA SV calls is shown as a function of the mean number of mapping combinations observed among the supporting matepairs. (D) A comparison of the validation rate of HYDRA SV calls by the type of variant.
Figure 3.
Figure 3.
Characterization of 3316 breakpoint sequences. (A) Histogram of the alignment overlap at all 3316 assembled breakpoint sequences (breaktigs). Positive overlap indicates homology at the breakpoint, while negative overlap indicates the presence of an unaligned segment at the breakpoint, suggesting an insertion or small-scale rearrangement. (B) Histogram of the subset (2145 of 3316) of breakpoints that were determined to be transposon insertions (TEVs) based on TE annotations. Note that the majority of the breakpoints in A showing 3–10-bp and 10–20-bp overlap are explained by target-site duplications from LTR and LINE insertions, respectively. (C) Histogram of the 1171 duplication, deletion, and inversion (LSV) breakpoints. (D) For each of four different ranges (dashed lines) of observed homology at LSV breakpoints, the fraction of breakpoints that overlapped with six different repeat annotations is shown. In all four observed homology ranges, the observed overlap with segmental duplications is higher than the ∼5% null expectation. Whereas breakpoints having little or no homology (two left pie charts) typically only overlapped with SDs, breakpoints having >10 bp of homology overlapped more frequently with SDs and dispersed repeats. (Seg. dup.) Segmental duplications; (LINE) long interspersed nuclear elements; (LTR) long terminal repeats ; (SINE) short interspersed nuclear elements; (DNA trans.) DNA transposons; (SSR) simple sequence repeats. (NHEJ) nonhomologous end-joining; (NAHR) non-allelic homologous recombination. (E) Detailed histograms of C reflecting simple and complex LSV breakpoints, respectively, as defined in the text. (F) The distribution of observed combinations of breakpoints (at least one breakpoint of each type at a given complex locus) at complex loci. (del) Deletion; (dup) duplication; (ins) insertion; (inv) inversion.
Figure 4.
Figure 4.
Visualizing a complex SV in a promoter region. (A) A snapshot of aligned sequence data at a validated SV locus from our local mirror of the UCSC Genome Browser (chr9: 98,880,333–98,889,602). At this locus, HYDRA detected one deletion and two inversion breakpoints in the DBA strain from the aligned discordant matepairs (red, those suggesting a deletion; blue, those suggesting an inversion), where F denotes a read mapping to the forward, or plus strand, and R the reverse strand. The dearth of uniquely aligned concordant matepairs (dark green) corroborates the deletion call. Note that a single concordant matepair is aligned within the span of the putative deletion where the two inversion breakpoints overlap, indicating that this segment is not deleted. Three WGS split-reads (gray) from the DBA strain also confirm the HYDRA calls and the observed complexity. (B) The three WGS split-reads were assembled into a 712-bp breakpoint sequence (breaktig) that was then aligned to the reference genome. The image displayed (using PARASIGHT) is representative of the 3316 such images we used to inspect assembled breakpoints. Aligned sections in black are in the same orientation in the breaktig and the reference genome, and the alignments in orange are in the opposite orientation. The complex variant involves two adjacent deletions of 2.5 kb and 0.9 kb, which are separated by an intervening ∼300-bp segment that was not deleted but, rather, inverted. An additional 15-bp deletion is present between the two rightmost alignments to the reference, but is difficult to see at this scale.

References

    1. Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim BC, Kim SY, Kim WY, Kim C, Park D, et al. 2009. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res 19: 1622–1629 - PMC - PubMed
    1. Akagi K, Li J, Stephens RM, Volfovsky N, Symer DE 2008. Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18: 869–880 - PMC - PubMed
    1. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, et al. 2009. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41: 1061–1067 - PMC - PubMed
    1. Bailey JA, Baertsch R, Kent WJ, Haussler D, Eichler EE 2004. Hotspots of mammalian chromosomal evolution. Genome Biol 5: R23 http://genomebiology.com/2004/5/4/R23 - PMC - PubMed
    1. Bauters M, Van Esch H, Friez MJ, Boespflug-Tanguy O, Zenker M, Vianna-Morgante AM, Rosenberg C, Ignatius J, Raynaud M, Hollanders K, et al. 2008. Nonrecurrent MECP2 duplications mediated by genomic architecture-driven DNA breaks and break-induced replication repair. Genome Res 18: 847–858 - PMC - PubMed

Publication types

Substances