Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan 30;10 Suppl 1(Suppl 1):S14.
doi: 10.1186/1471-2105-10-S1-S14.

Parallel short sequence assembly of transcriptomes

Affiliations

Parallel short sequence assembly of transcriptomes

Benjamin G Jackson et al. BMC Bioinformatics. .

Abstract

Background: The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes.

Results: We present a parallel method for transcriptome assembly from large short sequence data sets. Our solution uses a rigorous graph theoretic framework and tames the computational and space complexity using parallel computers. First, we construct a distributed bidirected graph that captures overlap information. Next, we compact all chains in this graph to determine long unique contigs using undirected parallel list ranking, a problem for which we present an algorithm. Finally, we process this compacted distributed graph to resolve unique regions that are separated by repeats, exploiting the naturally occurring coverage variations arising from differential expression.

Conclusion: We demonstrate the validity of our method using a synthetic high coverage data set generated from the predicted coding regions of Zea mays. We assemble 925 million sequences consisting of 40 billion nucleotides in a few minutes on a 1024 processor Blue Gene/L. Our method is the first fully distributed method for assembling a non-hierarchical short sequence data set and can scale to large problem sizes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Bidirected graph. The bidirected model for use in assembly. The figure shows the four edge types as described in the text, as well as the corresponding edge labels in the string graph.
Figure 2
Figure 2
Sparse ruling set algorithm. The recursive step of the sparse ruling set algorithm. a) The input lists. b) The marked list with edges drawn between marked nodes, as well as pointers from unmarked nodes to nearest marked nodes (dashed). Edge weights are shown via line widths. c) The recursive problem (notice that list on the right has been completed and does not form a recursive subproblem).
Figure 3
Figure 3
Graph reduction rules. The three graph reduction rules as described in the text: a) Y to V reduction b) loop reduction c) coverage matching. Each figure is labeled with a node identifier and chain identifier, and shows the structure of the graph before and after the reduction. It also shows how the underlying chains are concatenated for each type of reduction.
Figure 4
Figure 4
Error analysis of Illumina data. Error Analysis of Illumina Data by position. We analyzed the percentage of bad bases (center line), the percentage of bad bases, given some bad base in a previous position (top line), and the percentage of bad bases, given no bad base in a previous position (bottom line).

References

    1. Emrich S, Aluru S, Fu Y, Wen T, Narayanan M, Guo L, Ashlock D, Schnable P. A Strategy for Assembling the Maize (Zea mays L.) Genome. Bioinformatics. 2004;20:140–147. - PubMed
    1. Pevzner P, Tang H, Waterman M. Fragment assembly with double-barreled data. Proceedings of the National Academy of Sciences. 2001;98:9748–9753. - PMC - PubMed
    1. Myers E. The fragment assembly string graph. Bioinformatics. 2005;21:ii79–ii85. - PubMed
    1. Medvedev P, Brudno M. Ab Initio Whole Genome Shotgun Assembly with Mated Short Reads. Lecture Notes in Computer Science. 2008;4955:50–64.
    1. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Research. 2008;18:802–809. - PMC - PubMed

Publication types

MeSH terms