. 2007 Sep 4;5(10):e254.

doi: 10.1371/journal.pbio.0050254.

The diploid genome sequence of an individual human

Affiliations

PMID: 17803354
PMCID: PMC1964779
DOI: 10.1371/journal.pbio.0050254

The diploid genome sequence of an individual human

Samuel Levy et al. PLoS Biol. 2007.

. 2007 Sep 4;5(10):e254.

doi: 10.1371/journal.pbio.0050254.

Affiliation

¹ J. Craig Venter Institute, Rockville, Maryland, USA. slevy@jcvi.org

PMID: 17803354
PMCID: PMC1964779
DOI: 10.1371/journal.pbio.0050254

Abstract

Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. DNA Donor Pedigree and Relatedness to Ethnogeographic Populations**
(A) Three-generation pedigree showing the relation of ancestors to study DNA sample. The donor is identified in red. (B) Cluster analysis based on 750 SNP genotype information to infer the ancestry of the HuRef donor. The figure shows the proportion of membership of the HuRef donor (yellow) to three pre-defined HapMap populations (CEU = Northern and Western Europe, YRI = Yoruban, Ibadan, Nigeria, and JPT+CHB = Japanase, Tokyo, and Han Chinese, Beijing). The results indicate that the HuRef donor clusters with 99.5% similarity to the samples of northern and western European ancestry.

**Figure 2. Results of Cytogenetic Analysis**
(A) HuRef donor G-banded karyotype. (B) Spectral karyotype analysis.

**Figure 3. Sequencing Continuity Plot for the HuRef Autosomes Compared to HuRef X and Y Chromosomes**
Note that the autosomes have more contiguous sequence with fewer gaps compared to chromosomes X and Y, probably due to half the read depth compared to the autosomes and the presence of extensive sequence similarity between the sex chromosomes.

**Figure 4. The Different Variant Types Identified from the HuRef Assembly and the HuRef-NCBI Assembly-to-Assembly Mapping**
HuRef consensus sequence (in red) with underlying sequence reads (in blue). Homozygous variants are identified by comparing the HuRef assembly with NCBI reference assembly. Heterozygous variants are identified by base differences between sequence reads. SNP = single nucleotide polymorphism; MNP = multi-nucleotide polymorphism, which contains contiguous mismatches.

**Figure 5. Diversity for SNPs and Indels in Autosomes**
This is most likely an under-estimate of the true diversity, because a fraction of real heterozygotes were missed due to insufficient read coverage.

**Figure 6. Distribution of Indel Length in the HuRef Genome**
Distributions of heterozygous (A) and homozygous (B) indels lengths of 1–100 bp (A and B, respectively) and at greater detail in the range 1–20 bp (C and D, respectively). Note that heterozygous indels range from 1–321 bp and homozygous indels between 1–82,711 bp, however both polymorphisms type have greater than 47% of indel events being single base. Also even-length indels appear to be overrepresented.

**Figure 7. Number and Length Distribution of Apparent Homozygous Insertion and Deletion Sequences Greater than 100 bp**
Note that the number of indel events are similar but that there are more longer insertions than deletions.

**Figure 8. Modeling the Rate of SNP Detection from Microarray Experiments**
Model of the false-negative rate of heterozygous SNP detection found on Affymetrix or Illumina genotyping platforms in relation to the number of supporting reads found in the HuRef assembly at these loci. The observed false-negative rate of detected heterozygous SNPs in the HuRef assembly closely follows the modeled rate given a Poisson model. The predicted false-negative error is based on the thresholds of requiring at least 20% of the reads supporting the minor allele, two reads minimum. The increased false-negative error at 11 is due to the increased number of reads required to call the minor allele compared to two reads being required at 4×–10× coverage. Therefore, at 11×–15× coverage, three reads are required. The additional read changes the binomial distribution and increases false-negative error (See Materials and Methods).

**Figure 9. Distribution of HuRef Read-Depth Coverage for Genotyped SNPs**
Distribution plot of number of underlying reads (average number of reads = 8.8) in HuRef heterozygous SNPs confirmed by the Affymetrix and Illumina genotyping platforms. This is compared to a distribution (average number of reads = 5.2) for SNP detected by the platforms but missed in the HuRef assembly.

**Figure 10. Non-Mapped HuRef Sequences Mapped to Coriell DNA Samples by FISH**
Sequences from the HuRef donor that had no match based on the one-to-one mapping or BLAST when compared to the NCBI Human reference genome were tested by FISH. Fosmids were used as probes and the experiments were run, using Coriell DNA, to confirm the localization of the contigs or to map contigs with no prior mapping information. Shown here are four representative results. (A) An insertion at 7q22 where the FISH confirmed the HuRef mapping, (B) FISH result confirming the mapping of a sequence extending into a gap at 1p21. (C) Localization of a contig with no prior mapping information to chromosomal band 1q42. (D) An example of euchromatic-like sequence with no prior mapping information, which hybridizes to multiple centromeric locations.

**Figure 11. Degree of Linkage of Heterozygous Variants**
The distribution of the number of other variants to which a given variant can be linked using sequencing reads only or using mated reads as well is shown. Linkage of variants based on individual sequencing reads is limited, regardless of sequence coverage beyond a modest level, but is substantially increased by the incorporation of mate pairing information. The size of the effect is considerably more than simply doubling read length, due to variation in insert size; consequently, benefits of increasing sequencing coverage drop off much more slowly.

**Figure 12. Distribution of Inferred Haplotype Sizes**
(A) Reverse cumulative distribution of haplotype spans (bp) (N50 ∼ 350 kb). (B) Reverse cumulative distribution of variants per haplotype (N50 ∼ 400 variants).

**Figure 13. Consistency of HuRef Haplotypes with HapMap Data**
Haplotypes inferred from the HuRef data are strongly consistent with HapMap haplotypes. The probability in the HapMap CEU panel of the observed genotypes being phased as per the HuRef haplotypes is high for variants in strong LD (as measured either by D′ or r ²).

**Figure 14. Distribution of HuRef Variants in OMIM and Ensembl Genes**
(A) The distribution of the OMIM genes in Ensembl version 41 protein coding genes that contain one or more SNP or indel in their coding and/or UTR regions. (B) A similar distribution for the variants found in coding and/or UTR regions for all Ensembl version 41 genes.

**Figure 15. Chromosome Y ethno-genogeographic lineage**
The HuRef donor Y chromosome haplotype suggests descent from several European/US groups given the Y chromosome ethno-geographic markers. The haplogroup membership is R1b6 with includes individuals from the United Kingdom, Germany, Russia, and the United States, which is consistent with the self-reported family tree provided by the HuRef donor. The thick red line denotes the markers needed to trace the haplotype from the mapping of the chromosome Y markers to the HuRef genome. Data and figure from the Y Chromosome Consortium; http://ycc.biosci.arizona.edu/nomenclature_system/frontpage.html.

See this image and copyright information in PMC

Comment in

A new human genome sequence paves the way for individualized genomics.
Gross L. Gross L. PLoS Biol. 2007 Oct;5(10):e266. doi: 10.1371/journal.pbio.0050266. Epub 2007 Sep 4. PLoS Biol. 2007. PMID: 20076646 Free PMC article. No abstract available.

References

1. Painter TS. The sex chromosomes of man. Am Nat. 1924;58:506–524.
1. Tjio TH, Levan A. The chromosome number of man. Hereditas. 1956;42:1.
1. Lejeune J, Turpin R. Chromosomal aberrations in man. Am J Hum Genet. 1961;13:175–184. - PMC - PubMed
1. Caspersson T, Zech L, Johansson C, Modest EJ. Identification of human chromosomes by DNA-binding fluorescent agents. Chromosoma. 1970;30:215–227. - PubMed
1. Fodor SP, Read JL, Pirrung MC, Stryer L, Lu AT, et al. Light-directed, spatially addressable parallel chemical synthesis. Science. 1991;251:767–773. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database
Molecular Biology Databases
- SILVA
Research Materials
- Coriell Cell Repositories
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The diploid genome sequence of an individual human

Affiliation

The diploid genome sequence of an individual human

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Miscellaneous