Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar;40(5):2041-53.
doi: 10.1093/nar/gkr1042. Epub 2011 Nov 18.

Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques

Affiliations

Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques

Jorge Duitama et al. Nucleic Acids Res. 2012 Mar.

Abstract

Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Fosmid pool-based NGS approach to haplotype-resolve whole genomes (16). (A) Diploid genomic DNA of an individual is used to generate approximately 1.5 Mio fosmid clones, and (B) partitioned into pools of 15 000 fosmids, each covering about 15% of the genome in 40-kb haploid DNA segments. (C) Fosmid pools are sequenced using NGS. Here only three pools are shown as an example. (D) Fosmids are mapped to the genome and positions of heterozygous variants detected. (E) Single Individual Haplotyping is used to separate fragments into the two underlying haplotypes based on allelic identity at overlapping positions. With low coverage fosmid data, the presence of fosmids on only one haplotype can be used to inform the phase, given accurate SNP calling data. (F) Long contiguous haplotype blocks are generated, covering the entire genome.
Figure 2.
Figure 2.
Distribution of blocks per different number of phased SNPs.
Figure 3.
Figure 3.
Comparison of algorithms for SIH on NA12878 whole genome fosmid sequence data. (A) Adjusted N50 which takes into consideration block length and number of phased SNPs but not quality; (B) Switch error rate, calculated using comparison with gold-standard trio haplotypes; (C) Quality adjusted N50 which combined measures of completeness and quality; (D) Runtimes of each algorithm on this data set (log scale); (E) QAN50 for ReFHap, DGS, FastHare and HapCUT on subsets of the data built by varying the number of fosmid pools considered; (F) QAN50 for ReFHap, DGS, FastHare and HapCUT for different heterozygosity rates obtained by varying the percentages of SNPs considered.
Figure 4.
Figure 4.
Comparison of MEC values predicted by HapCUT with real MEC values. The dark grey bars show the increase of MEC percentage for the gold-standard as the switch error rate increases. However, MEC percentages predicted by HapCUT (light grey bars) do not increase as they should because HapCUT tries to find the solution minimizing MEC. The number of blocks analyzed for each bin (medium grey bars) is shown in the right Y axis.
Figure 5.
Figure 5.
Comparison of the new gold-standard haplotype (“Overall”) with haplotypes predicted by statistical phasing using different numbers of individuals in the reference panel. The concordance was calculated separately for pairs of adjacent SNPs phased using parental genotypes (trio phased) and pairs phased using fosmid-based haplotyping (non-trio phased).
Figure 6.
Figure 6.
Examples of GAD genes containing many additional phased SNPs. Fosmid-based phasing allows resolution of the phase of significant numbers of additional SNPs which may be particularly useful within disease-associated genes and SNPs detected in genome-wide association studies (GWA SNPs). Here, we show three examples of disease-relevant genes that contain many additional phased SNPs: UGT1A genes associated with various cancers; CDH1 which plays a role in drug sensitivity and COL1A2 associated with hypertension and osteoporosis. Tracks are taken from the UCSC Genome Browser. SNPs resolved by trio phasing are shown in the top track with SNPs resolved using fosmid-based phasing shown below. SNPs from the GWAS Catalog are shown as green bars in a separate track and those GWA SNPs that are resolved by fosmid-based phasing are indicated by pink arrows. Annotation from the Gene Association Database (GAD) and OMIM are shown in the lower tracks.

References

    1. Drysdale CM, McGraw DW, Stack CB, Stephens JC, Judson RS, Nandabalan K, Arnold K, Ruano G, Liggett SB. Complex promoter and coding region β2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl Acad. Sci. USA. 2000;97:10483–10488. - PMC - PubMed
    1. Hoehe MR. Haplotypes and the systematic analysis of genetic variation in genes and genomes. Pharmacogenomics. 2003;4:547–570. - PubMed
    1. Hoehe MR, Köpke K, Wendel B, Rohde K, Flachmeier C, Kidd KK, Berrettini WH, Church GM. Sequence variability and candidate gene analysis in complex disease: association of μ opioid receptor gene variation with substance dependence. Hum. Mol. Genet. 2000;9:2895–2908. - PubMed
    1. Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. The importance of phase information for human genomics. Nat. Rev. Genet. 2011;12:215–223. - PMC - PubMed
    1. Marchini J, Cutler D, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR, Donnelly P, et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 2006;78:437–450. - PMC - PubMed

Publication types