Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2012 Jun;13(8):901-15.
doi: 10.2217/pgs.12.72.

Next-generation sequencing and large genome assemblies

Affiliations
Review

Next-generation sequencing and large genome assemblies

Joseph Henson et al. Pharmacogenomics. 2012 Jun.

Abstract

The next-generation sequencing (NGS) revolution has drastically reduced time and cost requirements for sequencing of large genomes, and also qualitatively changed the problem of assembly. This article reviews the state of the art in de novo genome assembly, paying particular attention to mammalian-sized genomes. The strengths and weaknesses of the main sequencing platforms are highlighted, leading to a discussion of assembly and the new challenges associated with NGS data. Current approaches to assembly are outlined and the various software packages available are introduced and compared. The question of whether quality assemblies can be produced using short-read NGS data alone, or whether it must be combined with more expensive sequencing techniques, is considered. Prospects for future assemblers and tests of assembly performance are also discussed.

PubMed Disclaimer

Figures

Figure 1
Figure 1. An upper bound on assembly N50 against read length y
For a set of sequences, the N50 is the number of bases in the longest sequence such that 50% of the total bases are contained in this sequence or longer sequences. Here, the N50 is given for the set of contiguous sequences of bases in each genome that are covered by a unique segment of sequence at a given length y. Owing to the ambiguities in ordering caused by nonunique sequences, this provides an upper bound on the N50 that is possible for whole-genome sequencing assembly when using reads below this length, and gives an indication of the advantages to be gained from longer read length in some cases.
Figure 2
Figure 2. Graph structures for assembly
(A) Eight aligned reads are shown. (B) The corresponding overlap graph, in which nodes correspond to reads and edges to overlaps, in this case overlaps of five or more bases (transitive edges, meaning overlaps that are covered by a set of shorter overlaps, are shown as curved arrows). (C) The de Bruijn graph, in which nodes are k-mers and edges indicate that some read contains both k-mers consecutively. Note that reads such as two add nothing to the de Bruijn graph. The basic idea of the string graph is illustrated under (D). Here, the graph topology is the same as in (A) with transitive edges removed; however, but nodes correspond to the beginnings of reads, and edges are labeled by the string between these two points in the case that those reads overlap, rather than the whole read being associated to each node. All sequences supported by reads and overlaps can be recovered from this labeling (along with terminal reads such as read 8 in the example given in this figure): following the graph backwards adds the sequence necessary to complete the previous read. Both the de Bruijn and string graphs can be further simplified by merging linear subgraphs. Treatment of reverse complement sequences has been neglected here for clarity.

Similar articles

Cited by

References

    1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Wheeler DA, Srinivasan M, Egholm M, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. - PubMed
    1. Li R, Fan W, Tian G. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463:311–317. - PMC - PubMed
    1. Dalloul RA, Long JA, Zimin AV, et al. genome assembly and analysis. PLoS Biol. 2010;8(9):e1000475. - PMC - PubMed
    1. Wang J, Wang W, Li R, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456(7218):60–65. - PMC - PubMed

Websites

    1. 454 sequencing www.454.com.
    1. Illumina® www.illumina.com.
    1. Life Technologies: Applied Biosystems. www.appliedbiosystems.com.
    1. Pacific Biosciences® www.pacificbiosciences.com.
    1. Life Technologies: Ion Torrent www.iontorrent.com.

Publication types