Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2007 Jun 12;104(24):10110-5.
doi: 10.1073/pnas.0703834104. Epub 2007 Jun 5.

Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome

Affiliations
Comparative Study

Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome

Jan O Korbel et al. Proc Natl Acad Sci U S A. .

Abstract

Copy-number variants (CNVs) are an abundant form of genetic variation in humans. However, approaches for determining exact CNV breakpoint sequences (physical deletion or duplication boundaries) across individuals, crucial for associating genotype to phenotype, have been lacking so far, and the vast majority of CNVs have been reported with approximate genomic coordinates only. Here, we report an approach, called BreakPtr, for fine-mapping CNVs (available from http://breakptr.gersteinlab.org). We statistically integrate both sequence characteristics and data from high-resolution comparative genome hybridization experiments in a discrete-valued, bivariate hidden Markov model. Incorporation of nucleotide-sequence information allows us to take into account the fact that recently duplicated sequences (e.g., segmental duplications) often coincide with breakpoints. In anticipation of an upcoming increase in CNV data, we developed an iterative, "active" approach to initially scoring with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2,503 parameters to a core one of only 10. Using our approach, we accurately mapped >400 breakpoints on chromosome 22 and a region of chromosome 11, refining the boundaries of many previously approximately mapped CNVs. Four predicted breakpoints flanked known disease-associated deletions. We validated an additional four predicted CNV breakpoints by sequencing. Overall, our results suggest a predictive resolution of approximately 300 bp. This level of resolution enables more precise correlations between CNVs and across individuals than previously possible, allowing the study of CNV population frequencies. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Association of breakpoints and SDs. Genomic locations of SDs are indicated by BlastZ (32) self-chain matches to the human reference sequence (black vertical bars). SDs coinciding with deletion/duplication breakpoints are highlighted by a red dashed line. The association of breakpoints and SDs [consistent with earlier observations (1, 2, 9, 16, 22)] indicates that nucleotide sequence signatures can facilitate breakpoint mapping.
Fig. 2.
Fig. 2.
Overview of BreakPtr and its parameter optimization procedure. (A) Data from HighRes-CGH experiments are statistically integrated with nucleotide sequence signatures. Finder fine-maps CNV breakpoints. The subsequently implemented Annotator provides information in terms of copy number ratios, and Flagger identifies putative cross-hybridization for regions for which Finder has predicted CNVs (i.e., regions colored in light gray are disregarded). (HighRes-CGH signals shown in the figure do not correspond to original data but were generated for visualization purposes.) (B) Parameter optimization. Training data and gold standards are used to estimate initial parameters. Parameters are then optimized by using an EM-based algorithm (25). Finally, CNV breakpoints are predicted, and sequenced. A new round of parameter estimation is initiated subsequently by using further knowledge from validated breakpoints.
Fig. 3.
Fig. 3.
Hidden Markov models (HMMs): architecture and parameters. (A) HMMs: arrows indicate transitions used by the dbHMM (gray and black arrows) and by the univariate HMM (black arrows only), e.g., for the core parameterization. (B) Emission distributions for the dbHMM shown as heat maps, here exemplified by a 5 × 25-bin-model (x and y axes refer to each individual heat map). (C) Scheme illustrating the incorporation of discretized signals into bins: (1) scores quantifying DNA sequence characteristics, i.e., SD-like repeats (horizontal axis; schematically depicted distributions (in gray) are drawn for visualization purposes only); (2) normalized microarray fluorescent intensity log2-ratios (vertical axis).

Similar articles

Cited by

References

    1. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al. Science. 2004;305:525–528. - PubMed
    1. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Nat Genet. 2004;36:949–951. - PubMed
    1. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, et al. Nat Genet. 2005;37:727–732. - PubMed
    1. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al. Nature. 2006;444:444–454. - PMC - PubMed
    1. Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs RJ, Freedman BI, Quinones MP, Bamshad MJ, et al. Science. 2005;307:1434–1440. - PubMed

Publication types

Associated data