Challenges of sequencing human genomes

Daniel C Koboldt¹, Li Ding, Elaine R Mardis, Richard K Wilson

Affiliations

PMID: 20519329
PMCID: PMC2980933
DOI: 10.1093/bib/bbq016

Review

Challenges of sequencing human genomes

Daniel C Koboldt et al. Brief Bioinform. 2010 Sep.

. 2010 Sep;11(5):484-98.

doi: 10.1093/bib/bbq016. Epub 2010 Jun 2.

Authors

Daniel C Koboldt¹, Li Ding, Elaine R Mardis, Richard K Wilson

Affiliation

¹ The Genome Center at Washington University, St. Louis, Missouri 63108, USA. dkoboldt@genome.wustl.edu

PMID: 20519329
PMCID: PMC2980933
DOI: 10.1093/bib/bbq016

Abstract

Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35-250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.

PubMed Disclaimer

Figures

**Figure 1:**
Growth of public database dbSNP from 2002 to 2010. Note exponential growth in submissions following the first genome sequenced on next-generation technology (Watson) in 2007.

**Figure 2:**
Distribution of NGS instruments by country (March 2010). Courtesy of next-generation sequencing maps maintained by Nick Loman [70] and James Hadfield [71].

**Figure 3:**
The intersection of WGS, Target-Seq and RNA-Seq for the characterization of human genomes. Target-Seq of specific regions (selected by PCR or capture) serves primarily for the identification of SNPs and small indels. WGS enables detection not only of SNPs and indels, but also of CNVs and SV (often aided by *de novo* assembly). RNA-Seq provides digital gene expression information that can be used to validate SNP/indel calls in coding regions and assess the impact of genetic variation (CNV, SNPs and indels) on gene expression. RNA-Seq with paired-end libraries also enables the identification of chimeric transcripts, which serve to validate gene fusion events resulting from genomic structural variation.

**Figure 4:**
Performance metrics for sequence data quality. (A) Genotype quality control of sequencing runs. Concordance of per-lane SNP calls with high-density SNP array genotypes for 65 lanes of Illumina data. The low concordance of randomly mismatched controls (left) helps distinguish low-quality data (top right) from true sample mix-ups (right). (B) Error and mapping rates for five real flowcells sequenced on the Illumina platform (1 × 50 bp). Note the increased error rates and decreased alignment rates for poor-performing lanes 1 and 2 on flowcell 1.

**Figure 5:**
Basic workflows for next-generation sequencing. (A) Sequencing and alignment. Libraries constructed from genomic DNA or RNA are sequenced on massively parallel instruments (e.g. Illumina or SOLiD). The resulting NGS reads are mapped to a reference sequence. Mapped and unmapped reads are imported into SAM/BAM format and marked for PCR/optical duplicates. (B) Post-BAM downstream analysis. The FLAG field of the BAM file indicates the mapping status for each read. Mapped, properly paired reads (or mapped fragment-end reads) are used for SNP/indel detection and copy number estimation. Aberrantly mapped reads, in which reads in a pair map with unexpected distance or orientations, are mined for evidence of structural variation. Finally, *de novo* assembly of unmapped reads yields predictions of structural variants and novel insertions.

See this image and copyright information in PMC

References

1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24(3):133–41. - PubMed
1. Ahn SM, Kim TH, Lee S, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19(9):1622–9. - PMC - PubMed
1. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9. - PMC - PubMed
1. Drmanac R, Sparks AB, Callow MJ, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 327(5961):78–81. - PubMed
1. Kim JI, Ju YS, Park H, et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460(7258):1011–5. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

HG003079/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Challenges of sequencing human genomes

Affiliation

Challenges of sequencing human genomes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous