. 2014 May 3:15:126.

doi: 10.1186/1471-2105-15-126.

Automated ensemble assembly and validation of microbial genomes

Sergey Koren¹, Todd J Treangen, Christopher M Hill, Mihai Pop, Adam M Phillippy

Affiliations

PMID: 24884846
PMCID: PMC4030574
DOI: 10.1186/1471-2105-15-126

Automated ensemble assembly and validation of microbial genomes

Sergey Koren et al. BMC Bioinformatics. 2014.

. 2014 May 3:15:126.

doi: 10.1186/1471-2105-15-126.

Authors

Sergey Koren¹, Todd J Treangen, Christopher M Hill, Mihai Pop, Adam M Phillippy

Affiliation

¹ National Biodefense Analysis and Countermeasures Center, 110 Thomas Johnson Drive, Frederick, MD 21702, USA. korens@nbacc.net.

PMID: 24884846
PMCID: PMC4030574
DOI: 10.1186/1471-2105-15-126

Abstract

Background: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.

Results: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.

Conclusions: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.

PubMed Disclaimer

Figures

**Figure 1**
**iMetAMOS workflow and incorporated tools.** iMetAMOS currently incorporates 13 assemblers [21,33-38,40-45] and 7 validation tools [10-15,47]. Prokka [50] is used to predict genes and annotate all assembiles. Users can control the suite of assemblers and validation tools to be executed, as well as the scoring formula used to choose the best assembly. This assembly is evaluated for the presence of contamination.

**Figure 2**
**Comparison of corrected and raw N50 contig sizes for all assemblies of*R. sphaeroides***. Corrected N50 sizes were computed using the GAGE metrics [7]. The dashed vertical line indicates the auto-selected k of 35 chosen by KmerGenie. The individual points indicate assemblies from GAGE-B. The auto-selected k-mer provides the best overall corrected N50 on this dataset. One notable exception is SPAdes, for which the auto-selected k produced an N50 13% lower than the best. This is likely caused by SPAdes use of multiple k-mers for assembly, something that KmerGenie does not currently take into account. In all other cases, except when GAGE-B used EA-UTILS [58] to trim the input sequences, the automatically selected k-mer outperforms the k-mer choice from GAGE-B.

**Figure 3**
**Ratio of corrected N50 versus raw contig sizes for all assemblies of*R. sphaeroides***. Corrected N50 sizes were computed using the GAGE metrics [7]. The dashed vertical line indicates the auto-selected k of 35 chosen by KmerGenie. The individual points indicate assemblies from GAGE-B. For 3 of 11 assemblers, the automated k-mer selection provides the best corrected N50. Additionally, for 9 of the 11 assemblers, the automated k-mer selection provides a corrected to raw N50 ratio within 10% of the optimal.

**Figure 4**
**iMetAMOS validation output for an example dataset.** The leftmost tab allows navigation to view the output of each pipeline step. The selected “Validation” tab results are shown in the main window. These include the validation metrics of all successful assemblies, including a comparison and QUAST [13] report against an automatically recruited reference genome from NCBI RefSeq. The validation tab also indicates the sample shows signs of contamination. Finally, the rightmost tab shows a quick summary of the winning assembly (# reads, #contigs, # orfs). MaSuRCA did not run on this sample because it requires paired-end input.

**Figure 5**
**iMetAMOS classification output identifies possible contamination.** On sample ERR233356 retrieved from the Sequence Read Archive, the majority of data is clearly sourced from a Mycobacterium. However, a significant fraction of the data (~10% of reads or ~29% of assembly) belongs to other, mostly unidentified, organisms. A subset of 3% of the reads (1.81 Mbp of the assembly) is identified as *S. aureus* and covers over 60% of the *S. aureus* genome. iMetAMOS automatically identified this potential contaminant and binned the contigs by genus to facilitate easy confirmation and removal by the user.

See this image and copyright information in PMC

References

1. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95(6):315–327. doi: 10.1016/j.ygeno.2010.03.001. - DOI - PMC - PubMed
1. Nagarajan N, Pop M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol. 2009;16(7):897–908. doi: 10.1089/cmb.2009.0005. - DOI - PubMed
1. Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013;14(3):157–167. doi: 10.1038/nrg3367. - DOI - PubMed
1. Myers EW. Toward simplifying and accurately formulating fragment assembly. J Comput Biol. 1995;2(2):275–290. doi: 10.1089/cmb.1995.2.275. - DOI - PubMed
1. Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman J, Chapuis G, Chikhi R, Chitsaz H, Chou W-C, Corbeil J, Del Fabbro C, Docking T, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca N, Ganapathy G, Gibbs R, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt J, Ho I. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013;2(1):10. doi: 10.1186/2047-217X-2-10. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated ensemble assembly and validation of microbial genomes

Affiliation

Automated ensemble assembly and validation of microbial genomes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials