Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2008 Mar;24(3):142-9.
doi: 10.1016/j.tig.2007.12.006. Epub 2008 Feb 11.

Bioinformatics challenges of new sequencing technology

Affiliations
Review

Bioinformatics challenges of new sequencing technology

Mihai Pop et al. Trends Genet. 2008 Mar.

Abstract

New DNA sequencing technologies can sequence up to one billion bases in a single day at low cost, putting large-scale sequencing within the reach of many scientists. Many researchers are forging ahead with projects to sequence a range of species using the new technologies. However, these new technologies produce read lengths as short as 35-40 nucleotides, posing challenges for genome assembly and annotation. Here we review the challenges and describe some of the bioinformatics systems that are being proposed to solve them. We specifically address issues arising from using these technologies in assembly projects, both de novo and for resequencing purposes, as well as efforts to improve genome annotation in the fragmented assemblies produced by short read lengths.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Effect of repeats on long and short read assemblies. The middle section of the figure represents a set of repeats longer than 30 bp within a 50-kbp region of the Yersinia pestis CO92 genome. The blue (red) boxes represent repeats longer (shorter) than 800 bp. An assembly of this region using Sanger data would result in the set of contigs represented at the top, correctly resolving the short repeats and only breaking at the boundaries of long repeats. Furthermore, paired-ends (indicated by thin lines connecting the contigs) would provide long range connectivity across repeats. The assembly generated from short read sequencing data (bottom line) is considerably more fragmented, breaking at all repeat boundaries, and lacking long range connectivity caused by the absence of paired-end data.
Figure 2
Figure 2
Repeats longer than 30 bp in the genomes of Bacillus anthracis Ames and Yersinia pestis CO92. Tic marks on the outer concentric circles correspond to direct (same strand) and palindromic (opposite strand) repeats; every tic mark represents a breakpoint (where a gap would occur) for an assembly based on reads of 30 bp or less. Contigs generated from short read sequencing data would cover 97% and 93% of these genomes, respectively, with N50 sizes of 30 387 and 25 894 bp. The fractions of these genomes covered by unique segments longer than 10 kbp (1 kbp) are 84% (96%) and 66% (91%), respectively.
Figure 3
Figure 3
Fragmentation of a gene caused by fragmented genome assembly. This example shows how a five-exon gene (blue), known from cDNA sequencing or a closely related species, maps to three different short contigs (red). The assembly fails to capture all of exons e2 and e5, which run off the ends of contigs. Exon e5 maps in reverse orientation to contig c2. The cDNA would allow us to order the contigs as c1-c3-c2r (where ‘r’ means reversed), but without a full-length cDNA, we would have no indication of how these contigs were related to one another.

References

    1. Check E. Celebrity genomes alarm researchers. Nature. 2007;447:358–359. - PubMed
    1. Mardis ER. The impact of next generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. - PubMed
    1. Miller RD, et al. Efficient high-throughput resequencing of genomic DNA. Genome Res. 2003;13:717–720. - PMC - PubMed
    1. Altschul SF, et al. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Kent WJ. BLAT – the BLAST-like alignment tool. Genome Res. 2002;12:656–664. - PMC - PubMed