Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Oct;41(10):1061-7.
doi: 10.1038/ng.437. Epub 2009 Aug 30.

Personalized copy number and segmental duplication maps using next-generation sequencing

Affiliations

Personalized copy number and segmental duplication maps using next-generation sequencing

Can Alkan et al. Nat Genet. 2009 Oct.

Abstract

Despite their importance in gene innovation and phenotypic variation, duplicated regions have remained largely intractable owing to difficulties in accurately resolving their structure, copy number and sequence content. We present an algorithm (mrFAST) to comprehensively map next-generation sequence reads, which allows for the prediction of absolute copy-number variation of duplicated segments and genes. We examine three human genomes and experimentally validate genome-wide copy number differences. We estimate that, on average, 73-87 genes vary in copy number between any two individuals and find that these genic differences overwhelmingly correspond to segmental duplications (odds ratio = 135; P < 2.2 x 10(-16)). Our method can distinguish between different copies of highly identical genes, providing a more accurate assessment of gene content and insight into functional constraint without the limitations of array-based technology.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Correlation of predicted and known segmental duplications (NA18507)
a) mrFAST sequence read-depth per 5-kbp window along the human genome correlates well (R2=0.87) with the known copy number of duplicated sequences. b) Predicted duplication interval length versus the assembly-based length intervals of known duplications (Whole Genome Assembly Comparison; WGAC, ≥94% sequence identity) shows that boundaries of duplications can be accurately predicted. A few intervals show discrepancy in boundary prediction, however, this is largely due to deletion polymorphism in the NA18507 genome within duplications (supported by arrayCGH). c) A cumulative plot of the fraction of duplication intervals detected as a function of various read-depth sequence coverage. The segmental duplication (SD) size is given in cumulative intervals (≥5 kbp, ≥10 kbp, etc.) and represents the set of intervals identified both within the public reference assembly (build35) and the Celera whole-genome shotgun sequence reads. As expected, the sensitivity of our method increases with more genome coverage; the most dramatic difference in detection is observed between 3- to 4-fold coverage.
Figure 1
Figure 1. Correlation of predicted and known segmental duplications (NA18507)
a) mrFAST sequence read-depth per 5-kbp window along the human genome correlates well (R2=0.87) with the known copy number of duplicated sequences. b) Predicted duplication interval length versus the assembly-based length intervals of known duplications (Whole Genome Assembly Comparison; WGAC, ≥94% sequence identity) shows that boundaries of duplications can be accurately predicted. A few intervals show discrepancy in boundary prediction, however, this is largely due to deletion polymorphism in the NA18507 genome within duplications (supported by arrayCGH). c) A cumulative plot of the fraction of duplication intervals detected as a function of various read-depth sequence coverage. The segmental duplication (SD) size is given in cumulative intervals (≥5 kbp, ≥10 kbp, etc.) and represents the set of intervals identified both within the public reference assembly (build35) and the Celera whole-genome shotgun sequence reads. As expected, the sensitivity of our method increases with more genome coverage; the most dramatic difference in detection is observed between 3- to 4-fold coverage.
Figure 1
Figure 1. Correlation of predicted and known segmental duplications (NA18507)
a) mrFAST sequence read-depth per 5-kbp window along the human genome correlates well (R2=0.87) with the known copy number of duplicated sequences. b) Predicted duplication interval length versus the assembly-based length intervals of known duplications (Whole Genome Assembly Comparison; WGAC, ≥94% sequence identity) shows that boundaries of duplications can be accurately predicted. A few intervals show discrepancy in boundary prediction, however, this is largely due to deletion polymorphism in the NA18507 genome within duplications (supported by arrayCGH). c) A cumulative plot of the fraction of duplication intervals detected as a function of various read-depth sequence coverage. The segmental duplication (SD) size is given in cumulative intervals (≥5 kbp, ≥10 kbp, etc.) and represents the set of intervals identified both within the public reference assembly (build35) and the Celera whole-genome shotgun sequence reads. As expected, the sensitivity of our method increases with more genome coverage; the most dramatic difference in detection is observed between 3- to 4-fold coverage.
Figure 2
Figure 2. Computational prediction and arrayCGH validation of segmental duplication copy-number differences for three human genomes
Regions of excess read-depth (average+3std) are shown in red in contrast to regions of intermediate read-depth (gray; average + 2std-3std) or normal read-depth (green, average +/− 2std).The absolute copy number and arrayCGH results for specific individual genome comparisons are shown in the context of RefSeq annotated genes. Oligonucleotide relative log2 ratios are depicted as red/green histograms and correspond to an increase and decrease in signal intensity when test/reference is reverse labeled. a) A known copy-number polymorphism on 17q21.31 that is associated with the H2 haplotype among Europeans (build35 coordinates chr17: 41,000,000–42,300,000). The JDW genome shows an increase of 1-2 copies of a 459-kbp segmental duplication mapping to 17q21.31 when compared to NA18507. b) An expansion of the complement factor H related gene family (chr1:193,350,000–193,700,000) within JDW. c) An increase in NA18507 copy number for the defensin gene cluster in 8p23.1 is confirmed by arrayCGH.
Figure 3
Figure 3. Validation of individual-specific segmental duplications
The number of duplicated base pairs predicted and validated in NA18507, JDW, and YH (autosomes only) are shown. The height of the bars represents the sum of computationally predicted interval lengths, and the blue color bars correspond to the experimentally validated portion. Only duplicated intervals >20 kbp were considered for validation.
Figure 4
Figure 4. Correlation between computational and experimental copy number for NA18507 vs. JDW
We computed the copy number for each shared (gray) and individual specific duplication interval (blue or orange) based on the depth-of-coverage of aligned WGS against the human reference assembly (build35). Based on this computational estimates of copy number, we calculated a predicted log2 copy-number ratio for each autosomal duplication interval >20 kbp in length (and with less than 80% of total common repeat content). These values were plotted against the experimental log2 ratios determined by oligonucleotide arrayCGH. The vertical red lines indicate the threshold used for the validated calls (see Supplementary Note).
Figure 5
Figure 5. FISH validation
a) Sequence read-depth predicts 5 copies of this particular 17q21.31 segment in the YH genome and 2 copies (unique) in NA18507. ArrayCGH shows an increase in the YH genome and interphase nuclei FISH confirms the absolute copy-number difference between the two genomes. b) Similarly, interphase FISH confirms copy-number difference of 5 vs. 12 copies for the NPEPPS gene. c) YH is predicted and validated to have two more copies of the defensin gene family cluster of 8p23.1. d) Due to the known mosaic architecture for this high copy locus (>30 copies), both arrayCGH and FISH methods fail to accurately estimate copy-number difference between NA18507 and YH genomes: despite the fact that sequence depth predicts ~2 more copies in NA18507.
Figure 6
Figure 6. Copy-number differences between unique and duplicated regions
The 113 genes that vary in copy number are partitioned based on the range of copy-number difference and their intersection with annotated segmental duplications. Duplicated genes show a greater extent of copy-number variation when compared to genes mapping to unique regions of the genome.

References

    1. Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–7. - PubMed
    1. Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat Genet. 2004;36:949–951. - PubMed
    1. Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–54. - PMC - PubMed
    1. Kidd JM, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64. - PMC - PubMed
    1. Fanciulli M, et al. FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nat Genet. 2007;39:721–3. - PMC - PubMed

Publication types