Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(8):e23501.
doi: 10.1371/journal.pone.0023501. Epub 2011 Aug 18.

Meraculous: de novo genome assembly with short paired-end reads

Affiliations

Meraculous: de novo genome assembly with short paired-end reads

Jarrod A Chapman et al. PLoS One. 2011.

Abstract

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: Gary P. Schroth and Shujun Luo are employees of Illumina and are also shareholders in the company. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials related to this study.

Figures

Figure 1
Figure 1. Paired ends.
A. Fragment pair end separation distribution. Pairs are separated by 279±7 bp. B. Mate-pairs are produced by circularizing a genomic segment (vertical line indicates junction). End-sequences from sheared fragments that contain the junction (1) represent reads that point outward at the ends of the original segment. End-sequences from sheared fragments that do not contain the junction (2) are inwardly directed and adjacent on the original segment. C. Mate-pair end separation distribution. Two-thirds of all pairs are found to be divergently oriented and separated by 3.2±0.2 kb. An artifactual population of convergently oriented pairs separated by less than 500 bp is apparent, representing fragments of type (2) shown above in panel B.
Figure 2
Figure 2. Example of a 7-mer graph.
The node a is X-terminated to the left. The non-reciprocal linkage between nodes b and c is removed because the terminal base (lower case “a” in the sequence) of node c is low quality. Node e is F-terminated to the right. The resultant U-U contig is the union of nodes b and d: CTGCTGCT.
Figure 3
Figure 3. k-mer frequency and extension characteristics in Pichia.
A. 41-mer frequency distributions. The overall 41-mer distribution (green) is decomposed into genomic (red) and non-genomic (yellow) contributions. At fewer than ∼30 occurrences non-genomic (error-induced) 41-mers dominate. The modal frequency is ∼135. B. Graph features as functions of dmin. The total number of nodes (blue), total number of X-terminated nodes (red), and total number of F-terminated (yellow) nodes in the 41-mer graph are calculated as functions of the assembly parameter dmin. We find the optimal assembly to occur at dmin = 10.
Figure 4
Figure 4. Estimated gap sizes vs. actual contig separation in the Pichia genome.
75% of the initial inter-contig gaps are resolved during gap closing. 97% of gaps are found to be within 4 bp of their estimated size, and 58% within 1 bp.
Figure 5
Figure 5. Differences between E. coli meraculous and reference sequence identify transposon insertion.
Bottom line shows portion of the Genbank reference genome for E. coli str. K-12 substr. MG1655 produced by Sanger sequencing and directed finishing strain . Top shows alignment of the de novo meraculus contigs to reference sequence. Solid lines agree perfectly. Angled dashed lines represent unaligned meraculous contig-ends that correspond to the beginning and end of a transposable element. All short-read data supports the meraculous sequence, indicating either insertion of the transposon in the Illumina-sequenced lineage, or an error in the MG1655 reference.

References

    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. - PubMed
    1. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. - PubMed
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed

Publication types

Substances