Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 21;46(14):7236-7249.
doi: 10.1093/nar/gky538.

Human copy number variants are enriched in regions of low mappability

Affiliations

Human copy number variants are enriched in regions of low mappability

Jean Monlong et al. Nucleic Acids Res. .

Abstract

Copy number variants (CNVs) are known to affect a large portion of the human genome and have been implicated in many diseases. Although whole-genome sequencing (WGS) can help identify CNVs, most analytical methods suffer from limited sensitivity and specificity, especially in regions of low mappability. To address this, we use PopSV, a CNV caller that relies on multiple samples to control for technical variation. We demonstrate that our calls are stable across different types of repeat-rich regions and validate the accuracy of our predictions using orthogonal approaches. Applying PopSV to 640 human genomes, we find that low-mappability regions are approximately 5 times more likely to harbor germline CNVs, in stark contrast to the nearly uniform distribution observed for somatic CNVs in 95 cancer genomes. In addition to known enrichments in segmental duplication and near centromeres and telomeres, we also report that CNVs are enriched in specific types of satellite and in some of the most recent families of transposable elements. Finally, using this comprehensive approach, we identify 3455 regions with recurrent CNVs that were missing from existing catalogs. In particular, we identify 347 genes with a novel exonic CNV in low-mappability regions, including 29 genes previously associated with disease.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Mappability and population-based RD estimates. (A) Inter-sample mean RD and average mappability in 5 kb bins. Regions with the same mappability estimate can have different RD levels. (B) Z-score distribution. In mappability, Z-scores were computed from the mappability-predicted RD and global standard deviation; In population estimates from the inter-sample mean and standard deviation. (C) Z-score distribution across the mappability spectrum. (D) Average RD in the Twin study. The right-tail of the histogram was winsorized using the IQR and the different coverage classes are shown with colors.
Figure 2.
Figure 2.
PopSV’s performance in low-mappability regions. (A) Cluster using PopSV calls in extremely low coverage regions (below 100 reads). (B) Proportion and number of calls replicated in the monozygotic twin. The point shows the median value per sample, the error bars the 95% confidence interval. (C) Proportion and number of regions with reliable calls, computed from call replication in twins.
Figure 3.
Figure 3.
Comparison with CNV catalogs from the 1000 Genomes Project (34) (1000GP) and a long-read sequencing study (59). (A) The x-axis represents the proportion of individuals with a CNV overlapping a region. The y-axis represents the cumulative proportion of the affected genome. (B) Overlap with the SV catalog from Chaisson et al. (59). In each cohort (color), the proportion of collapsed calls overlapping calls from Chaisson et al. (59) or control regions with similar size distribution was modeled using a logistic regression. Boxplots show variation across 50 sampling of control regions. low-map: calls in low-mappability regions; ext. low-map: calls in extremely low-mappability regions.
Figure 4.
Figure 4.
CNVs in normal genomes. (A) Enrichment of CNVs in different genomic classes (x-axis) across different cohorts (colors) and controlling for the distance to centromere/telomere/gap. Bars show the median fold enrichment compared to control regions. The error bar represents 90% of the samples in the cohort. (B) Enrichment of CNVs in repeat families (x-axis) controlling for the overlap with segmental duplication and distance to centromere/telomere/gap. The error bars were winsorized at 7 for clarity. STR: Short Tandem Repeat; TE: Transposable Element.

References

    1. Hall I.M., Quinlan A.R.. Detection and interpretation of genomic structural variation in mammals. Methods in molecular biology. 2012; 838:Clifton: Springer Science; 225–248. - PubMed
    1. Sharp A.J., Cheng Z., Eichler E.E.. Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 2006; 7:407–442. - PubMed
    1. Mills R.E., Walter K., Stewart C., Handsaker R.E., Chen K., Alkan C., Abyzov A., Yoon S.C., Ye K., Cheetham R.K. et al. . Mapping copy number variation by population-scale genome sequencing. Nature. 2011; 470:59–65. - PMC - PubMed
    1. Pang A.W., MacDonald J.R., Pinto D., Wei J., Rafiq M.A., Conrad D.F., Park H., Hurles M.E., Lee C., Venter J.C. et al. . Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 2010; 11:R52. - PMC - PubMed
    1. McCarroll S.A., Huett A., Kuballa P., Chilewski S.D., Landry A., Goyette P., Zody M.C., Hall J.L., Brant S.R., Cho J.H. et al. . Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn’s disease. Nat. Genet. 2008; 40:1107–1112. - PMC - PubMed

Publication types

LinkOut - more resources