Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 May;7(5):365-71.
doi: 10.1038/nmeth.1451.

Characterization of missing human genome sequences and copy-number polymorphic insertions

Affiliations

Characterization of missing human genome sequences and copy-number polymorphic insertions

Jeffrey M Kidd et al. Nat Methods. 2010 May.

Abstract

The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18-37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Copy-number polymorphism of novel insertions
ArrayCGH intensity data is displayed for novel sequences ordered along (a) chromosome 5 and (b) chromosome 14 based on anchored map locations (build35 coordinates, UCSC). Copy-number gains (orange) and losses (blue) are shown relative to the reference sample (NA15510). Each column in the heat map represents a probe on the array, and each row represents a sample ordered and separated (yellow lines) by corresponding HapMap population (CEU, CHB, JPT and YRI). The bottom row depicts a reference self-self hybridization as control. The red brackets group multiple contigs into loci that generally show a consistent hybridization pattern by arrayCGH.
Figure 2
Figure 2. Sequencing and genotyping insertions
(a) The complete sequence of a clone (AC205876) carrying a 4.8-kbp novel insertion sequence is compared to the corresponding segment from chromosome 20 using miropeats (black lines connect segments of matching sequence; colored arrows correspond to common repeats; green: LINEs; purple: SINEs; orange: LTR elements; pink: DNA elements). The magenta lines denote the insertion breakpoints. The brown boxes correspond to the mapped position of three assembled novel sequence contigs. (b) ArrayCGH hybridization results represented as a heat map suggest that the deletion is fixed in CEU and CHB populations. The brown-red lines correspond to the three sequence contigs depicted in part (a) and are represented by 16, 15, and 18 arrayCGH probes respectively. The median log2 ratios (c) and single channel intensities (d) are shown for all probes matching AC205876. Note that the reference (blue bars) channel shows similar intensity across hybridizations. For this example the reference sample is inferred to have a copy number of 1. The signals form three distinct clusters that are assigned integer copy-number states of 0, 1, and 2. The dotted red, green, and blue lines correspond to the median intensities of each defined cluster. Using these genotypes an FST of 0.70 is calculated for this insertion. (e–h) A second example as described above depicting a 3.9-kb insertion (AC216083) within the first intron of the LCT (lactase) gene (red boxes represent exons as indicated).
Figure 3
Figure 3. Insertion allele frequency distribution
The frequency of the insertion allele is shown for 189 loci that are fitted to distinct copy numbers and are consistent with a simple autosomal insertion-deletion variant. Values are shown for all 28 individuals (black bars) and separately for each HapMap population as indicated.
Figure 4
Figure 4. Annotation of conserved and functional elements
(a) The complete sequence of an OEA clone carrying 29 kbp of novel sequence is compared by miropeats to the reference genome. We identify a 95-bp conserved element within this sequence (green rectangles) as defined by a GERP analysis of 8 species (see Online Methods). A multiple sequence alignment of one of these conserved elements (black arrow) is highlighted. (b) A novel exon is predicted within the sequence of a 4.3-kbp insertion based on comparison with the PECAM1 transcript (NM_000442.3), as shown in blue. This alternate exon is supported by RNA-seq data and corresponds to a conserved element identified by alignment comparisons.
Figure 5
Figure 5. Genotyping sequenced variants through unique k-mer matches
(a) Unique diagnostic k-mer sequences were identified for each variant using sequence-resolved breakpoints. For the deletion breakpoint, k-mers were required to have a single match to the reference genome and no matches to the fosmid sequences. For the insertion breakpoints, k-mers were required to have no matches to the genome and a single match to the fosmid. In order to be uniquely identifiable, a variant must have at least one deletion k-mer and at least one insertion k-mer that meet these criteria. (b) Effect of k-mer length and search stringency on ability to uniquely identify a variant. 71% (108/152) of the sequenced sites are uniquely identifiable with a criteria of k=36 and one substitution, while 97% (147/152) are assayable if k-mer length increased to 100 bp. (c) A comparison of genotypes determined using arrayCGH and breakpoint k-mer matching is depicted for sample NA18507. The search database consists of unique 36-mers (one substitution). Genotypes for 54 variants were successfully determined by both arrayCGH and breakpoint k-mer matching. Partitioning the breakpoint scores into distinct genotypes at 0.5 and 1.5 (red lines) results in 94.3% genotype agreement between the two methods. (d) Effect of sequence coverage on breakpoint k-mer genotyping. The number of variants genotyped (at least one matching read, solid line, left axis) and the percent agreement with arrayCGH results (dashed line, right axis) are shown at various sequence coverage levels (1–42X).

Comment in

  • E pluribus unum.
    [No authors listed] [No authors listed] Nat Methods. 2010 May;7(5):331. doi: 10.1038/nmeth0510-331. Nat Methods. 2010. PMID: 20440876 No abstract available.
  • The author file: Evan Eichler.
    Baker M. Baker M. Nat Methods. 2010 May;7(5):333. doi: 10.1038/nmeth0510-333. Nat Methods. 2010. PMID: 20440877 No abstract available.
  • Human genomics: Filling gaps and finding variants.
    Muers M. Muers M. Nat Rev Genet. 2010 Jun;11(6):387. doi: 10.1038/nrg2800. Epub 2010 May 5. Nat Rev Genet. 2010. PMID: 20442715 No abstract available.

References

    1. IHGSC Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed
    1. Levy S, et al. The Diploid Genome Sequence of an Individual Human. PLoS Biol. 2007;5:e254. - PMC - PubMed
    1. Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. - PubMed
    1. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Wang J, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. - PMC - PubMed

Publication types

Substances

Associated data