Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar 23:13:48.
doi: 10.1186/1471-2105-13-48.

Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer

Affiliations

Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer

Pierre Peterlongo et al. BMC Bioinformatics. .

Abstract

Background: The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing whole genomes/transcriptomes (de novo assemblers) are typically employed to process such data. However, these methods require large memory resources and computation time. Many basic biological questions could be answered targeting specific information in the reads, thus avoiding complete assembly.

Results: We present Mapsembler, an iterative micro and targeted assembler which processes large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest that can be constructed from reads and builds a short assembly around it, either as a plain sequence or as a graph, showing contextual structure. We introduce new algorithms to retrieve approximate occurrences of a sequence from reads and construct an extension graph. Among other results presented in this paper, Mapsembler enabled to retrieve previously described human breast cancer candidate fusion genes, and to detect new ones not previously known.

Conclusions: Mapsembler is the first software that enables de novo discovery around a region of interest of repeats, SNPs, exon skipping, gene fusion, as well as other structural events, directly from raw sequencing reads. As indexing is localized, the memory footprint of Mapsembler is negligible. Mapsembler is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/mapsembler/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Algorithm overview. Overview of the algorithm steps with reads of length 7, a minimal coverage of 2 and k-mers of length k=3. a) Representation of the sub-starter generation step. A set of reads is mapped to the starter s. First, reads are error-corrected according to a voting procedure (see lower right read for instance). Then, each sub-starter (s1 and s2) is computed from each perfect multiple read alignment. The Hamming distance between each sub-starter and s is required to be below a certain threshold. b) Representation of an extension. Three reads have prefix of length at least k mapping perfectly to the suffix of an extension s. All fragments of these reads longer than extension s are used for generating extension of s. As minimal coverage is 2, the last character of the first extending reads (T) is not stored for generating extension of s. The generated extension of s (ACT) is stored in a new node linked to extension s. Note that suffix of length k−1 of extension s (TC) is stored as prefix of extension of s (then called enriched extension). This avoids to omit overlapping k-mers between extensions such as TCA or CAC while mapping reads on extension of s.
Figure 2
Figure 2
Graph simplification. Graph simplification (Algorithm 1, Step 15). a) the graph before simplification. b) After removing the first k−1 characters of each internal node and after merging non branching nodes. c) After common prefix and suffix factorizations.
Figure 3
Figure 3
Repeated starter. Graph obtained using a repeat occurrence as starter. To be readable the prefixes of left extensions and the suffixes of right extensions, as well as the core or the starter are truncated.
Figure 4
Figure 4
Drosophila exon. Visualization of Mapsembler results on a drosophila read data set. Red characters correspond to splice sites found by mapping using Blat [17], while the circled characters is a codon stop.
Figure 5
Figure 5
Drosophila exon - blat result. Visualization of Blat results on sequences obtained from graph presented Figure 4. Shorter path corresponds to the concatenation of the sequences from starter node (blue node) and from the lowest node, while longer path corresponds to the concatenation of the sequences from starter node, left most node and lowest node. The central node includes, but is not limited to a known EST CO332306.
Figure 6
Figure 6
Drosophila SNPs. Visualization of Mapsembler results on a drosophila read data sets, looking for known SNPs. On this graph, 2 SNPs (circled nodes) in the right extensions are shown. Full sequences are truncated.
Figure 7
Figure 7
Gene fusion in human breast cancer. Extension graph of an extremity of an exon from the VAPB human gene located on chromosome 20. a): the raw graph produced by Mapsembler. b): the same graph manually curated by mapping the sequence of each node on the human genome. Nodes where moved in order to reflect their relative mapping position on the chromosomes. Nodes from the raw graph having sequences mapping at the same position where merged. For each node, the start and stop positions of the mapping are indicated. The presence of two start and stop positions reflects the presence of a central intron. Except for the purple node having multiple hits among the genome, 100% of the sequence of each node was mapped, either to an exon from gene VAPB on chromosome 20 or from gene IKZF3 on chromosome 17. The bold edge corresponds to the gene fusion found in [18], while the two other edges starting from the starter and targeting a chromosome 17 exon are new gene fusions.
Figure 8
Figure 8
Gene fusion in human breast cancer - Blat results. Blat [17] results obtained after mapping paths from the starter to a leaf of the graph presented Figure 7. Succession of nodes of each mapped path (black lines) are indicated by their identifiers (red letters in Figure 7). Path belonging to gene VAPB chromosome 20 are represented on the upper part of the figure (S_A_B_C_D, S_A_F_G_D and S_A_E) while those belonging to gene IKZF3 on chromosome 20 (H, I_J and K_J) are represented on the lower part. Note that the starter is not mapped on gene IKZF3 as it appears only on chromosome 17 on the genome. However, it is concatenated to rightmost exons of each of the three paths (H, I_J and K_J) in the transcripts.

References

    1. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20(2):265. doi: 10.1101/gr.097261.109. - DOI - PMC - PubMed
    1. Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat Meth. 2011;8:61–65. doi: 10.1038/nmeth.1527. [ http://dx.doi.org/10.1038/nmeth.1527], [ http://www.nature.com/nmeth/journal/v8/n1/abs/nmeth.1527.html∖#supplemen...] - DOI - PMC - PubMed
    1. Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng HW. Comparative Studies of de novo Assembly Tools for Next-generation Sequencing Technologies. Bioinformatics (Oxford, England) 2011;27(15):2031–2037. doi: 10.1093/bioinformatics/btr319. [ http://www.ncbi.nlm.nih.gov/pubmed/21636596] - DOI - PMC - PubMed
    1. Bastian M, Heymann S, Jacomy M. Gephi: An open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. 2009. pp. 361–362. [ http://www.aaai.org/ocs/index.php/ICWSM/09/paper/download/154/1009]
    1. Cline MS. et al. Integration of biological networks and gene expression data using Cytoscape. Nat Protoc. 2007;2(10):2366–2382. doi: 10.1038/nprot.2007.324. [ http://dx.doi.org/10.1038/nprot.2007.324] - DOI - PMC - PubMed

Publication types