Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul 15;32(14):2103-10.
doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Affiliations

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Heng Li. Bioinformatics. .

Abstract

Motivation: Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore technologies (ONT) produce reads over 10 kb in length, which have enabled high-quality genome assembly at an affordable cost. However, at present, long reads have an error rate as high as 10-15%. Complex and computationally intensive pipelines are required to assemble such reads.

Results: We present a new mapper, minimap and a de novo assembler, miniasm, for efficiently mapping and assembling SMRT and ONT reads without an error correction stage. They can often assemble a sequencing run of bacterial data into a single contig in a few minutes, and assemble 45-fold Caenorhabditis elegans data in 9 min, orders of magnitude faster than the existing pipelines, though the consensus sequence error rate is as high as raw reads. We also introduce a pairwise read mapping format and a graphical fragment assembly format, and demonstrate the interoperability between ours and current tools.

Availability and implementation: https://github.com/lh3/minimap and https://github.com/lh3/miniasm

Contact: hengli@broadinstitute.org

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Mapping between two reads. b[1] and e[1] are the 0-based starting and ending mapping coordinates of the first read v, respectively. b[2] and e[2] are the mapping coordinates of read w. Lightgray areas indicate overhang regions that should be mapped together if the overlap is real. If the overhang regions are small enough, the figure implies an edge vw with approximate length (vw)=b[1]b[2] and its complement edge w¯v¯ with (w¯v¯)=(l[2]e[2])(l[1]e[1])
Fig. 2.
Fig. 2.
Dotter plot comparing the miniasm assembly and the C.elegans reference genome. Thin gray lines mark the contig or chromosome boundaries. The three arrows indicate large-scale misassemblies visible from the plot. The mapping is done with ‘minimap-L500’

References

    1. Alkan C. et al. (2011) Limitations of next-generation genome sequence assembly. Nat. Methods, 8, 61–65. - PMC - PubMed
    1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Bashir A. et al. (2012) A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol., 30, 701. - PMC - PubMed
    1. Berlin K. et al. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630. - PubMed
    1. Brankovic L. et al. (2015) Linear-time superbubble identification algorithm for genome assembly. Theor. Comput. Sci, 609, 374–383.