Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep 24:16:207.
doi: 10.1186/s13059-015-0764-4.

Metassembler: merging and optimizing de novo genome assemblies

Affiliations

Metassembler: merging and optimizing de novo genome assemblies

Alejandro Hernandez Wences et al. Genome Biol. .

Abstract

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net .

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Assemblathon 1 metassembly accuracy. Assembly contiguity and accuracy metrics are plotted at each merging step for all possible permutations of the five input assemblies: scaffold N50 (a), corrected contig N50 (b), duplicated reference bases (c), deleted reference bases (d), translocations (e), and relocations (f). For all plots, the x-axis represents the number of input assemblies being metassembled, with 1 being the starting assembly. The two horizontal red lines mark the final maximum and minimum value of the metric across all permutations. Most of the permutations are plotted in gray, while permutations of particular note are plotted with different colors: the pink line represents the permutation that has the maximum value in the final metassembly while the dark blue line represents the permutation with the minimum value. Also, the green line represents the permutation resulting from ordering the input assemblies by the overall rank reported in the Assemblathon 1 paper (Broad-BGI-WTSI-DOEJGI-CSHL), the light blue line represents the permutation obtained by ordering the input assemblies by scaffold N50 size (DOEJGI-Broad-WTSI-CSHL-BGI) while the brown line represents the order by contig N50 size (BGI-Broad-CSHL-WTSI-DOEJGI). Comp Ref Bases compressed reference bases, Dup Ref Bases duplicated reference bases
Fig. 2
Fig. 2
Boxplots of overall Z scores for Assemblathon 1 metassemblies grouped by initial assembly. Blue circles indicate the Z score of the corresponding initial assembly. Below each circle, the corresponding mean difference in Z scores between the final metassembly and the initial assembly (μ∆) is shown. The global mean difference is also shown at the top
Fig. 3
Fig. 3
Assemblathon 2 metassembly contiguity and accuracy metrics. Assembly contiguity and accuracy metrics are shown at each merging step of all metassemblies for the three species. The x-axis represents the number of assemblies merged with one being the initial input assembly. Ctg contig, Scf scaffold
Fig. 4
Fig. 4
Metassembly of fish BCM scaffold FISH00033861. A representation of the changes made to a single scaffold throughout the metassembler pipeline is shown. Scaffold FISH00033861 of the BCM fish assembly (bottom) is taken as the starting point in the metassembly corresponding to the Assemblathon 2 Z score ordering. Vertical blue and green lines represent indel corrections and gap closures made at each merging step
Fig. 5
Fig. 5
Schematic representation of the pairwise merging process. Dark color represents alignment blocks between the primary and secondary assemblies. Light color represents unaligned sequences. 1) For blocks of aligned sequence, the algorithm inserts the primary sequence to the new metassembly. 2) Insertion in the primary with respect to the secondary assembly: because the CE statistic is a large positive value (>3) for the primary sequence, the algorithm skips the primary insertion and chooses the secondary sequence instead. 3) Both assemblies have an unaligned insertion: because the primary insertion is shorter than the secondary insertion, and because the primary has a large negative CE statistic (< −3), the algorithm will choose the secondary insertion over the primary, thus correcting the CE statistic
Fig. 6
Fig. 6
Schematic diagram of the progressive metassembly of three assemblies. All three input assemblies have gap sequences and a variety of errors such that no pair of assemblies will create a perfect assembly. However, the final metassembly of all three assemblies together will reconstruct the entire correct genome. Gap Seq gap sequence, Scf scaffold

Similar articles

Cited by

References

    1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–41. doi: 10.1016/j.tig.2007.12.007. - DOI - PubMed
    1. Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-generation sequencing. Genome Res. 2010;20:1165–73. doi: 10.1101/gr.101360.109. - DOI - PMC - PubMed
    1. Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biol. 2013;14:405. doi: 10.1186/gb-2013-14-6-405. - DOI - PMC - PubMed
    1. Schatz MC, Witkowski J, McCombie WR. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 2012;13:243. doi: 10.1186/gb-2012-13-4-243. - DOI - PMC - PubMed
    1. Pop M. Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009;10:354–66. doi: 10.1093/bib/bbp026. - DOI - PMC - PubMed

Publication types

LinkOut - more resources