Human copy number polymorphic genes

J A Bailey¹, J M Kidd, E E Eichler

Affiliations

PMID: 19287160
PMCID: PMC2920189
DOI: 10.1159/000184713

Human copy number polymorphic genes

J A Bailey et al. Cytogenet Genome Res. 2008.

. 2008;123(1-4):234-43.

doi: 10.1159/000184713. Epub 2009 Mar 11.

Authors

J A Bailey¹, J M Kidd, E E Eichler

Affiliation

¹ Department of Pathology, Case Western University School of Medicine and University Hospitals of Cleveland, Cleveland, OH, USA. jab@case.edu

PMID: 19287160
PMCID: PMC2920189
DOI: 10.1159/000184713

Abstract

Recent large-scale genomic studies within human populations have identified numerous genomic regions as copy number variant (CNV). As these CNV regions often overlap coding regions of the genome, large lists of potentially copy number polymorphic genes have been produced that are candidates for disease association. Most of the current data regarding normal genic variation, however, has been generated using BAC or SNP microarrays, which lack precision especially with respect to exons. To address this, we assessed 2,790 candidate CNV genes defined from available studies in nine well-characterized HapMap individuals by designing a customized oligonucleotide microarray targeted specifically to exons. Using exon array comparative genomic hybridization (aCGH), we detected 255 (9%) of the candidates as true CNVs including 134 with evidence of variation over the entire gene. Individuals differed in copy number from the control by an average of 100 gene loci. Both partial- and whole-gene CNVs were strongly associated with segmental duplications (55 and 71%, respectively) as well as regions of positive selection. We confirmed 37% of the whole-gene CNVs using the fosmid end sequence pair (ESP) structural variation map for these same individuals. If we modify the end sequence pair mapping strategy to include low-sequence identity ESPs (98-99.5%) and ESPs with an everted orientation, we can capture 82% of the missed genes leading to more complete ascertainment of structural variation within duplicated genes. Our results indicate that segmental duplications are the source of the majority of full-length copy number polymorphic genes, most of the variant genes are organized as tandem duplications, and a significant fraction of these genes will represent paralogs with levels of sequence diversity beyond thresholds of allelic variation. In addition, these data provide a targeted set of CNV genes enriched for regions likely to be associated with human phenotypic differences due to copy number changes and present a source of copy number responsive oligonucleotide probes for future association studies.

PubMed Disclaimer

Figures

**Fig. 1.**
Exon-targeted oligonucleotide array CGH design. From our identified list of candidate CNV genes and controls, we targeted an equal number of probes to each exon by including nearly equivalent amounts of sequence for probe design. For each exon, we identified two regions for probe design: 200 bp centered at the beginning and 200 bp centered at the end of the exon. For small exons (<200 bp) this amounted to 100 bp flanking either side plus the length of the exon since these regions overlapped. For medium size exons (200–999 bp) this amounted to 400 bp with equivalent amounts of flanking and exonic sequence. For large exons (≥1 kb), we added an additional 200 bp directly in the center of the exon to provide a measure of continuity in these larger regions. This scheme essentially increased the weight of small exons with the inclusion of flanking sequence and decreased the weight of large exons by only sampling a limited portion. The inclusion of flanking non-transcribed sequence also limited the detection of processed pseudogenes. Overall each of the exons for the candidate genes were represented by 203–600 bases of sequence. These probe design regions were merged into a non-overlapping set of sequence from which NimbleGen algorithms choose appropriate oligonucleotide sequences for array synthesis.

**Fig. 2.**
Examples of detected CNV transcripts. The observed relative signal intensities and results of the chaining algorithm are depicted for (a) the complete deletion of the RhD Blood group antigen gene *(RHD)* and (b) the partial-gene CNV of the lipoprotein Lp(a) precursor *(LPA)*. Each gene is depicted (blue), the regions used for probe selection, and the relative signal intensities of the probes for each individual assayed. Individual probe signals with absolute relative deviations >1.0 SD are colored green for gain and red for loss. For *RHD*, an expanded area shows the probes for exon 6 in detail. The results of our detection algorithm are depicted by a red or green line indicating a region of loss or gain. In the case of *RHD*, these represent detection of gains and losses of the entire transcript. For *LPA*, the detected regions demonstrate partial transcript loss relative to the control. The region identified in *LPA* represents a series of variously-sized tandem deletions and duplications based on a 2-exon module containing Kringle domains. Vertical scales represent the natural log of the normalized relative hybridization intensities.

**Fig. 3.**
Whole-gene CNVs compared to fosmid ESP analysis. (a) Validation rates categorized by the percent identity of the most similar duplicon within each whole-gene CNV region. There is a significant decrease in the validated fraction for regions containing duplicons >99% identity. (b) Venn diagram showing the association of the 83 whole-gene CNVs with best-placed fosmid ESPs of low-similarity (98–99.5%) suggesting more divergent unrepresented CNV paralogs and/or with best-placed everted ESPs (>99.5%) suggesting highly similar tandem duplications. Interesting regions containing everted and low-similarity regions overlap suggesting a more complex nature for these CNV genes. The inset depicts the basis for the formation of everted fosmid ESPs, where a clone that traverses the boundary of a tandem duplication can only map to the single copy contained within the reference genome (Cooper et al., 2008).

See this image and copyright information in PMC

References

1. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. - PMC - PubMed
1. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. - PubMed
1. Barber JC, Reed CJ, Dahoun SP, Joyce CA. Amplification of a pseudogene cassette underlies euchromatic variation of 16p at the cytogenetic level. Hum Genet. 1999;104:211–218. - PubMed
1. Beissbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004;20:1464–1465. - PubMed
1. Benjamini Y, Hochberg Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9:811–818. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- figshare - Access datasets and other research materials.
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Human copy number polymorphic genes

Affiliation

Human copy number polymorphic genes

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials