PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

Kai Wang¹, Mingyao Li, Dexter Hadley, Rui Liu, Joseph Glessner, Struan F A Grant, Hakon Hakonarson, Maja Bucan

Affiliations

PMID: 17921354
PMCID: PMC2045149
DOI: 10.1101/gr.6861907

PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

Kai Wang et al. Genome Res. 2007 Nov.

. 2007 Nov;17(11):1665-74.

doi: 10.1101/gr.6861907. Epub 2007 Oct 5.

Authors

Kai Wang¹, Mingyao Li, Dexter Hadley, Rui Liu, Joseph Glessner, Struan F A Grant, Hakon Hakonarson, Maja Bucan

Affiliation

¹ Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.

PMID: 17921354
PMCID: PMC2045149
DOI: 10.1101/gr.6861907

Abstract

Comprehensive identification and cataloging of copy number variations (CNVs) is required to provide a complete view of human genetic variation. The resolution of CNV detection in previous experimental designs has been limited to tens or hundreds of kilobases. Here we present PennCNV, a hidden Markov model (HMM) based approach, for kilobase-resolution detection of CNVs from Illumina high-density SNP genotyping data. This algorithm incorporates multiple sources of information, including total signal intensity and allelic intensity ratio at each SNP marker, the distance between neighboring SNPs, the allele frequency of SNPs, and the pedigree information where available. We applied PennCNV to genotyping data generated for 112 HapMap individuals; on average, we detected approximately 27 CNVs for each individual with a median size of approximately 12 kb. Excluding common rearrangements in lymphoblastoid cell lines, the fraction of CNVs in offspring not detected in parents (CNV-NDPs) was 3.3%. Our results demonstrate the feasibility of whole-genome fine-mapping of CNVs via high-density SNP genotyping.

PubMed Disclaimer

Figures

**Figure 1.**
An illustration of log R Ratio (LRR) and B Allele Freq (BAF) values for the chromosome 15 q-arm of an individual. A normal chromosome region has three BAF genotype clusters, as represented as AA, AB, and BB genotypes in boxes, and with LRR values centered around zero. The copy-neutral LOH region has normal LRR values, but without the AB genotype cluster. The increased copy number for a CNV region can be detected based on an increased number of peaks in the BAF distribution, as well as increased LRR values. The patterns of LRR and BAF for different CNV regions, normal regions, and copy-neutral LOH regions are distinct from each other, thus the combination of LRR and BAF can be used to generate CNV calls.

**Figure 2.**
A flowchart outlining the procedure for CNV calling from genotyping data. The first step for LRR and BAF calculation can be alternatively performed by the BeadStudio software, given a clustering file containing canonical genotype cluster positions. The HMM integrates several sources of information to give CNV calls. When genotype data are available for family members, the pedigree information can be incorporated to model CNV events more accurately.

**Figure 3.**
(A) A predicted ∼700-bp CNV within an intronic region of the *FBXL7* gene; (B) a predicted ∼1-kb CNV within an intronic region of the *EYA1* gene; and (C) a predicted ∼4-kb CNV within an intronic region of the *CTDSPL* gene are inherited from parent to offspring. The scatterplots for log R Ratio and B Allele Frequency are shown for the father, mother, and offspring; (red dots) the SNPs within the CNVs. The presence of CNVs and their copy numbers are validated by PCR amplification of the region encompassing breakpoints for *FBXL7* and *EYA1*, or by PCR primer walking for *CTDSPL* (see Fig. 4 for more detail on primer locations).

**Figure 4.**
UCSC Genome Browser (Kuhn et al. 2007) shots of the CNVs within the *FBXL7* (A), *EYA1* (B), and *CTDSPL* (C) genes, as well as the location of SNPs and PCR primers. The predicted CNV regions with (gray solid boxes) deletion of one copy or (black solid boxes) deletion of two copies on the “CNV calls” track; the actual CNV breakpoints identified by resequencing are shown in the “BLAT Search” track. For the CNV within *FBXL7*, a pair of PCR primers (P1 and P2) is able to generate two PCR products, thus resequencing of shorter PCR products identifies the CNV breakpoint. For the CNV within *EYA1*, the primer pair P1–P2, but not P1–P3, generates two PCR products, indicating that the breakpoint is between P2 and P3; thus resequencing by P2 identifies the exact breakpoint. For the CNV within *CTDSPL*, the primer pairs P1–P2, P1–P3, and P1–P4 all generate two PCR products, indicating that the breakpoint is between P1 and P4; thus resequencing of the shortest PCR product in Figure 3C by P1 and P4 from both ends identifies the breakpoint. These examples illustrate that the combined PCR-resequencing approach can pinpoint the exact location of predicted CNVs in the human genome.

See this image and copyright information in PMC

References

1. Aardema M.J., Crosby L.L., Gibson D.P., Kerckaert G.A., LeBoeuf R.A., Crosby L.L., Gibson D.P., Kerckaert G.A., LeBoeuf R.A., Gibson D.P., Kerckaert G.A., LeBoeuf R.A., Kerckaert G.A., LeBoeuf R.A., LeBoeuf R.A. Aneuploidy and consistent structural chromosome changes associated with transformation of Syrian hamster embryo cells. Cancer Genet. Cytogenet. 1997;96:140–150. - PubMed
1. Bailey J.A., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E., Massa H.F., Trask B.J., Eichler E.E., Trask B.J., Eichler E.E., Eichler E.E. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. - PMC - PubMed
1. Baum L.E., Petrie T., Soules G., Weiss N., Petrie T., Soules G., Weiss N., Soules G., Weiss N., Weiss N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math. Statist. 1970;41:164–171.
1. Carter N. Methods and strategies for analyzing copy number variation using DNA microarrays. Nat. Genet. 2007;39:S16–S21. - PMC - PubMed
1. Colella S., Yau C., Taylor J.M., Mirza G., Butler H., Clouston P., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Yau C., Taylor J.M., Mirza G., Butler H., Clouston P., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Taylor J.M., Mirza G., Butler H., Clouston P., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Mirza G., Butler H., Clouston P., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Butler H., Clouston P., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Clouston P., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Bassett A.S., Seller A., Holmes C.C., Ragoussis J., Seller A., Holmes C.C., Ragoussis J., Holmes C.C., Ragoussis J., Ragoussis J. QuantiSNP: An objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–2025. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- Coriell Cell Repositories

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

Affiliation

PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials