Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 3:15:126.
doi: 10.1186/1471-2105-15-126.

Automated ensemble assembly and validation of microbial genomes

Affiliations

Automated ensemble assembly and validation of microbial genomes

Sergey Koren et al. BMC Bioinformatics. .

Abstract

Background: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.

Results: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.

Conclusions: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.

PubMed Disclaimer

Figures

Figure 1
Figure 1
iMetAMOS workflow and incorporated tools. iMetAMOS currently incorporates 13 assemblers [21,33-38,40-45] and 7 validation tools [10-15,47]. Prokka [50] is used to predict genes and annotate all assembiles. Users can control the suite of assemblers and validation tools to be executed, as well as the scoring formula used to choose the best assembly. This assembly is evaluated for the presence of contamination.
Figure 2
Figure 2
Comparison of corrected and raw N50 contig sizes for all assemblies ofR. sphaeroides. Corrected N50 sizes were computed using the GAGE metrics [7]. The dashed vertical line indicates the auto-selected k of 35 chosen by KmerGenie. The individual points indicate assemblies from GAGE-B. The auto-selected k-mer provides the best overall corrected N50 on this dataset. One notable exception is SPAdes, for which the auto-selected k produced an N50 13% lower than the best. This is likely caused by SPAdes use of multiple k-mers for assembly, something that KmerGenie does not currently take into account. In all other cases, except when GAGE-B used EA-UTILS [58] to trim the input sequences, the automatically selected k-mer outperforms the k-mer choice from GAGE-B.
Figure 3
Figure 3
Ratio of corrected N50 versus raw contig sizes for all assemblies ofR. sphaeroides. Corrected N50 sizes were computed using the GAGE metrics [7]. The dashed vertical line indicates the auto-selected k of 35 chosen by KmerGenie. The individual points indicate assemblies from GAGE-B. For 3 of 11 assemblers, the automated k-mer selection provides the best corrected N50. Additionally, for 9 of the 11 assemblers, the automated k-mer selection provides a corrected to raw N50 ratio within 10% of the optimal.
Figure 4
Figure 4
iMetAMOS validation output for an example dataset. The leftmost tab allows navigation to view the output of each pipeline step. The selected “Validation” tab results are shown in the main window. These include the validation metrics of all successful assemblies, including a comparison and QUAST [13] report against an automatically recruited reference genome from NCBI RefSeq. The validation tab also indicates the sample shows signs of contamination. Finally, the rightmost tab shows a quick summary of the winning assembly (# reads, #contigs, # orfs). MaSuRCA did not run on this sample because it requires paired-end input.
Figure 5
Figure 5
iMetAMOS classification output identifies possible contamination. On sample ERR233356 retrieved from the Sequence Read Archive, the majority of data is clearly sourced from a Mycobacterium. However, a significant fraction of the data (~10% of reads or ~29% of assembly) belongs to other, mostly unidentified, organisms. A subset of 3% of the reads (1.81 Mbp of the assembly) is identified as S. aureus and covers over 60% of the S. aureus genome. iMetAMOS automatically identified this potential contaminant and binned the contigs by genus to facilitate easy confirmation and removal by the user.

References

    1. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95(6):315–327. doi: 10.1016/j.ygeno.2010.03.001. - DOI - PMC - PubMed
    1. Nagarajan N, Pop M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol. 2009;16(7):897–908. doi: 10.1089/cmb.2009.0005. - DOI - PubMed
    1. Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013;14(3):157–167. doi: 10.1038/nrg3367. - DOI - PubMed
    1. Myers EW. Toward simplifying and accurately formulating fragment assembly. J Comput Biol. 1995;2(2):275–290. doi: 10.1089/cmb.1995.2.275. - DOI - PubMed
    1. Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman J, Chapuis G, Chikhi R, Chitsaz H, Chou W-C, Corbeil J, Del Fabbro C, Docking T, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca N, Ganapathy G, Gibbs R, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt J, Ho I. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013;2(1):10. doi: 10.1186/2047-217X-2-10. - DOI - PMC - PubMed

Publication types

LinkOut - more resources