Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 24:81:7.23.1-7.23.21.
doi: 10.1002/0471142905.hg0723s81.

Using XHMM Software to Detect Copy Number Variation in Whole-Exome Sequencing Data

Affiliations

Using XHMM Software to Detect Copy Number Variation in Whole-Exome Sequencing Data

Menachem Fromer et al. Curr Protoc Hum Genet. .

Abstract

Copy number variation (CNV) has emerged as an important genetic component in human diseases, which are increasingly being studied for large numbers of samples by sequencing the coding regions of the genome, i.e., exome sequencing. Nonetheless, detecting this variation from such targeted sequencing data is a difficult task, involving sorting out signal from noise, for which we have recently developed a set of statistical and computational tools called XHMM. In this unit, we give detailed instructions on how to run XHMM and how to use the resulting CNV calls in biological analyses.

Keywords: Hidden Markov Model (HMM); copy number variation (CNV); data normalization; next-generation sequencing (NGS); principal component analysis (PCA).

PubMed Disclaimer

Figures

Figure 1
Figure 1. How genomic copy number affects depth of sequencing
Depicted are a reference individual with two copies of a gene, an individual with only a single copy (deletion of gene on one chromosome), and an individual with an extra copy of a gene in their genome (duplication). In an idealized setting where each individual's genome is targeted at an average of 10× coverage, 5 of those reads will come from the maternal chromosome and 5 from the paternal. If one of those chromosomes is missing that gene (deletion), then only 5 reads for that gene will be observed; on the other hand, if an individual has a duplication of that gene, then a total of 15 reads will be found. In reality, noise and biases in the data make it difficult to easily and directly read off genomic copy number from such coverage information. The purpose of XHMM is to automate a procedure for performing such inference in a robust manner.
Figure 2
Figure 2. Flowchart of calling CNV from exome sequence data using XHMM
Each step in the CNV discovery and genotyping XHMM pipeline is listed, with corresponding step numbers from Basic Protocol 1 listed in parentheses. Key steps are depicted graphically on the right.
Figure 3
Figure 3. Normalization by removal of top principal components
In this “toy” example (of 4 samples targeted at 10 exons), the 4×10 read depth matrix is first decomposed into the principal components (the key axes in which the read depth varies), the variance of the data in each such component, and the sample loadings (the coefficients of each sample in each component). Then, in this example, it is estimated (not shown) that the two largest principal components correspond to non-CNV read depth effects, based on their large relative contribution to the variance of the read depth data. These components are thus removed by zeroing them out, and the reconstructed read depth matrix will be used for CNV calling.
Figure 4
Figure 4. Sample coverage plot
This shows the sample-wide distribution of exome-wide sequencing coverage, where each per-sample coverage value is the mean of the coverage values calculated for each exome target (which itself is the mean coverage at all of its bases in that particular sample). In this experiment, we sequenced each sample to a mean coverage of 150×, so that we expect a typical sample to indeed have 150 reads covering an average base in an average exome target.
Figure 5
Figure 5. Exome target coverage plot
Analgous to Figure 4, this plot gives the target-wide distribution of coverage (over all samples). That is, each per-target coverage value is the mean of the per-sample coverage values at that target (where again, this is the mean coverage at all of its bases in that sample). As above, since our goal was to have 150× coverage exome-wide, we'd expect each target to have around 150× coverage, but we see here that there is high variability in target coverage. For example, some targets have as much 400× coverage (averaged over all samples), and we also see a non-trivial number of targets that have 0 coverage for all samples (e.g., targets where capture has presumably failed).
Figure 6
Figure 6. Principal component analysis (PCA) normalization
This plot compares each of the principal components (PC) to known sample and target features (samples features can be added in step 2E of Basic Protocol 2). The dotted line (at PC = 15) indicates that XHMM automatically removed the first 15 components based on their significant relative variance. In this plot, we consider known sample and target features (that XHMM did not incorporate in its decision to remove them). We see that these first 15 PC tend to show correlation with various target features (colored circles) such as GC content and the mean depth of sequencing coverage at that target, and also with various sample features (colored diamonds) such as gender and mean depth of sequencing for that sample. On the other hand, there is a marked change in quality of the PC after the first 15 or so, with a sudden drop-off in the levels of correlation with genome-wide and batch effects expected to strongly bias the read depth of coverage.
Figure 7
Figure 7. “Scree” plot for the PCA
This plot shows the standard deviation of the depth data independently ascribed to each of the principal components. This case is typical, where we see that the cut-off automatically detected by XHMM corresponds to a significant drop in the variance (an “elbow” in the curve). Note the log scale of the y axis.
Figure 8
Figure 8. Read depth projected in a principal component
This principal component (the 3rd one, in this instance) has found the variance in read depth due to gender differences, with males having lower coverage on the X chromosome and higher coverage on the Y. Therefore, the loadings for this component have a correlation of 0.99 with the gender of the samples. Note that the R script creates the ‘PC’ sub-directory, which contains plots of the read depth data projected into each of the principal components: PC/PC.*.png.
Figure 9
Figure 9. Distribution of post-normalization target variance
Before XHMM calculates z-scores and the HMM is run to call CNV for each sample (CNV “discovery”), we perform a final filtering step. Specifically, we remove any targets that have “very scattered” read depth distributions post normalization. These can be thought of as targets for which the normalization may have failed, and it is better to remove such strong effects (still likely to be artifacts) to prevent them from drowning out other more subtle signals. In detail, we removed any targets with large standard deviations of their post-normalization read depths across all samples. As a (proto-typical) example, we see here that the small fraction of targets with standard deviations any larger than the 30 to 50 range were removed (in this case, for a scenario of ~100× mean sequencing coverage).
Figure 10
Figure 10. XHMM copy number variation region plot
This plot shows each sample's original and normalized read depths at each of the targets in focus, which are connected by gray lines. If the sample has a called deletion, then it is colored in red, and duplications in green. Gene names are added below to annotate the genomic region, and black dots and bars mark the location of the exome targets. In this example from the XHMM paper (Fromer et al. 2012), also shown are the overlaps between the XHMM call (marked in red), the Affymetrix chip-based call and custom validated region (Kirov et al. 2012), and the exome-targeted region of DLGAP1 (delineated by square brackets). By following Basic Protocol 2, a regional plot is produced for each CNV called in each individual (as found in the .xcnv file): plot_CNV/sample_*.png. Alternatively, a PDF with all stages of read depth adjustment (from unnormalized [top panel here] to final normalized values used for CNV calling [bottom panel here]) can be generated by passing the PLOT_ONLY_PNG=FALSE argument to the XHMM_plots() function in the example_make_XHMM_plots.R script file.

References

    1. Cooper Gregory M., Coe Bradley P., Girirajan Santhosh, Rosenfeld Jill A., Vu Tiffany H., Baker Carl, Williams Charles, et al. A Copy Number Variation Morbidity Map of Developmental Delay. Nature Genetics. 2011 Sep;43(9):838–846. doi:10.1038/ng.909. - PMC - PubMed
    1. DePristo Mark A., Banks Eric, Poplin Ryan, Garimella Kiran V., Maguire Jared R., Hartl Christopher, Philippakis Anthony A., et al. A Framework for Variation Discovery and Genotyping Using Next-generation DNA Sequencing Data. Nature Genetics. 2011 May;43(5):491–498. doi:10.1038/ng.806. - PMC - PubMed
    1. Fromer Menachem, Moran Jennifer L., Chambert Kimberly, Banks Eric, Bergen Sarah E., Ruderfer Douglas M., Handsaker Robert E., et al. Discovery and Statistical Genotyping of Copy-Number Variation from Whole-Exome Sequencing Depth. The American Journal of Human Genetics. 2012 Oct 5;91(4):597–607. doi:10.1016/j.ajhg.2012.08.005. - PMC - PubMed
    1. International Schizophrenia Consortium Rare Chromosomal Deletions and Duplications Increase Risk of Schizophrenia. Nature. 2008 Sep 11;455(7210):237–241. doi:10.1038/nature07239. - PMC - PubMed
    1. Kirov G, Pocklington AJ, Holmans P, Ivanov D, Ikeda M, Ruderfer D, Moran J, et al. De Novo CNV Analysis Implicates Specific Abnormalities of Postsynaptic Signalling Complexes in the Pathogenesis of Schizophrenia. Molecular Psychiatry. 2012 Feb;17(2):142–153. doi:10.1038/mp.2011.154. - PMC - PubMed

Publication types