Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar 29;6(3):e17539.
doi: 10.1371/journal.pone.0017539.

A computational framework discovers new copy number variants with functional importance

Affiliations

A computational framework discovers new copy number variants with functional importance

Samprit Banerjee et al. PLoS One. .

Abstract

Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic of the approach used for the Identification of germline Changes in Copy Numbers (IgC2N).
IgC2N is a multistep approach, which includes the identification of potential CNV loci along the genome (A–F), a bias correction step (H) and the CN genotyping (I), leveraging the experimental data from many samples. (A–I) The log2 intensity ratio signals or the segmented signal (A) is dichotomized on a marker and sample basis (B). A genome-wide score vector S is obtained by summing the transformed signal across all samples on a marker basis (C). The null distribution of the score is obtained by permutations in order to identify the level of significant deviation of the score S, S_sig, from the baseline signal. (D) S_sig value corresponding to a pre-specified FDR threshold is applied to the data vector S (E). The intermediate output is a collection of putative polymorphic loci across the genome. No restriction on size or coverage is applied (F). A Gaussian Mixture Model (GMM) is applied to predict the CN classes (genotypes not assigned) (G). The distance between the median of consecutive CN classes (1 CN class difference) is compared to the 1 CN class difference of all CNVs, and relative classes are inferred (H). Along with 1 CN class differences, the presence of “0” class and expected direction of bias are also considered to infer the genotypes of these CN classes and the reference model bias is estimated (I).
Figure 2
Figure 2. Results of Power Simulation Study as function of Size and Coverage.
In silico power computations for IgC2N. The panel of 8 plots is organized in rows by sample size of the datasets used for simulations and in columns by the number of markers covering a CNV and size (in kb) of a CNV. Each plot shows the average power to detect CNVs with three different frequencies, i.e. 1%, 5% and 15% for the dotted, dashed and solid lines respectively.
Figure 3
Figure 3. Validation study summary.
(A) Barplots of the rate (percentage) of validation categorized with respect to number of marker coverage, size of the variant and its minor allele frequency. (B) The frequency distribution of CN genotypes of validated, not validated and all CNV. (C) An example of a new variant validated by NimbleGen data: The line plots of smoothed intensity signal using 42M NimbleGen platform for each of 40 HapMap samples showing polymorphism for the locus IgH3.965 on Chromosome 6. A scatter plot (inset) of the discovery signal (x-axis) and the validation signal (y-axis) color coded with respect to the IgC2N CN call on the discovery samples.
Figure 4
Figure 4. Functional impact of CNVs on human transcriptome.
(A) Proportion of functional variants with respect to variant size and type of polymorphisms. Percentages are evaluated with respect to each subclass. (B) Significance of associations with respect to gene-variant distance. The cis analysis included 2 Mb windows. Minus log 10 of the q-values are plotted against the distance between the mid points of genes and variants. Up and down arrows depict the direction of the association. Red symbols identify data points corresponding to the new CNVs. (C) List of top ranked associations involving new variant residing within protein coding regions. (D) Examples of new variants showing significant effect on gene transcript. mRNA levels are plotted against the copy number states of new variants identified by IgC2N (box plots) and against the copy number intensity ratios (scatter plots). P-values from the regression analysis against copy number states are reported.

Similar articles

Cited by

References

    1. Lee C, Morton CC. Structural genomic variation and personalized medicine. N Engl J Med. 2008;358:740–741. - PubMed
    1. Scherer SW, Lee C, Birney E, Altshuler DM, Eichler EE, et al. Challenges and standards in integrating surveys of structural variation. Nat Genet. 2007;39:S7–15. - PMC - PubMed
    1. Shaikh TH, Gai X, Perin JC, Glessner JT, Xie H, et al. High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res. 2009;19:1682–1690. - PMC - PubMed
    1. Gokcumen O, Lee C. Copy number variants (CNVs) in primate species using array-based comparative genomic hybridization. Methods. 2009;49:18–25. - PMC - PubMed
    1. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, et al. Origins and functional impact of copy number variation in the human genome. Nature. 464:704–712. - PMC - PubMed

Publication types