Meraculous: de novo genome assembly with short paired-end reads

Jarrod A Chapman¹, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P Schroth, Daniel S Rokhsar

Affiliations

PMID: 21876754
PMCID: PMC3158087
DOI: 10.1371/journal.pone.0023501

Meraculous: de novo genome assembly with short paired-end reads

Jarrod A Chapman et al. PLoS One. 2011.

. 2011;6(8):e23501.

doi: 10.1371/journal.pone.0023501. Epub 2011 Aug 18.

Authors

Jarrod A Chapman¹, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P Schroth, Daniel S Rokhsar

Affiliation

¹ U.S. Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America. jchapman@lbl.gov

PMID: 21876754
PMCID: PMC3158087
DOI: 10.1371/journal.pone.0023501

Abstract

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: Gary P. Schroth and Shujun Luo are employees of Illumina and are also shareholders in the company. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials related to this study.

Figures

**Figure 1. Paired ends.**
**A. Fragment pair end separation distribution.** Pairs are separated by 279±7 bp. **B. Mate-pairs are produced by circularizing a genomic segment** (vertical line indicates junction). End-sequences from sheared fragments that contain the junction (1) represent reads that point outward at the ends of the original segment. End-sequences from sheared fragments that do not contain the junction (2) are inwardly directed and adjacent on the original segment. **C. Mate-pair end separation distribution.** Two-thirds of all pairs are found to be divergently oriented and separated by 3.2±0.2 kb. An artifactual population of convergently oriented pairs separated by less than 500 bp is apparent, representing fragments of type (2) shown above in panel B.

**Figure 2. Example of a 7-mer graph.**
The node a is X-terminated to the left. The non-reciprocal linkage between nodes b and c is removed because the terminal base (lower case “a” in the sequence) of node c is low quality. Node e is F-terminated to the right. The resultant U-U contig is the union of nodes b and d: CTGCTGCT.

**Figure 3. k-mer frequency and extension characteristics in *Pichia*.**
**A. 41-mer frequency distributions**. The overall 41-mer distribution (green) is decomposed into genomic (red) and non-genomic (yellow) contributions. At fewer than ∼30 occurrences non-genomic (error-induced) 41-mers dominate. The modal frequency is ∼135. **B. Graph features as functions of d_min**. The total number of nodes (blue), total number of X-terminated nodes (red), and total number of F-terminated (yellow) nodes in the 41-mer graph are calculated as functions of the assembly parameter d_min. We find the optimal assembly to occur at d_min = 10.

**Figure 4. Estimated gap sizes vs. actual contig separation in the *Pichia* genome.**
75% of the initial inter-contig gaps are resolved during gap closing. 97% of gaps are found to be within 4 bp of their estimated size, and 58% within 1 bp.

See this image and copyright information in PMC

References

1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. - PubMed
1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
1. Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. - PubMed
1. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. - PubMed
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Meraculous: de novo genome assembly with short paired-end reads

Affiliation

Meraculous: de novo genome assembly with short paired-end reads

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous