Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016 Sep 30;89(3):353-362.
eCollection 2016 Sep.

Metagenomic Assembly: Overview, Challenges and Applications

Affiliations
Review

Metagenomic Assembly: Overview, Challenges and Applications

Jay S Ghurye et al. Yale J Biol Med. .

Abstract

Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.

Keywords: Assembly; Metagenomics; Microbiome.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of different de novo assembly paradigms. Schematic representation of the three main paradigms for genome assembly – Greedy, Overlap-Layout-Consensus, and de Bruijn. In Greedy assembler, reads with maximum overlaps are iteratively merged into contigs. In Overlap-Layout-Consensus approach, a graph is constructed by finding overlaps between all pairs of reads. This graph is further simplified and contigs are constructed by finding branch-less paths in the graph, and taking the consensus sequence of the overlapping reads implied by the corresponding paths. Contigs are further organized and extended using mate pair information. In de Bruijn graph assemblers, reads are chopped into short overlapping segments (k-mers) which are organized in a de Bruijn graph structure based on their co-occurrence across reads. The graph is simplified to remove artifacts due to sequencing errors, and branch-less paths are reported as contigs.
Figure 2
Figure 2
Metagenomic assembly pipeline. Multiple bacterial genomes within a community are represented as circles of different colors indicating multiple individuals form a same organism. Note the different levels of sequencing coverage for the individual organisms' genomes, due to the different abundance of the organisms in the original sample. After sequencing redundant reads can be removed through digital normalization, reducing the computational needs for assembly. The filtered reads are then assembled into contigs and they are classified using k-mers and coverage statistics. Contigs in each group are then binned to form draft genome sequences for organisms within the population.

References

    1. Reich JG, Drabsch H, Diumler A. On the statistical assessment of similarities in DNA sequences. Nucleic Acids Res. 1984;12(13):5529–5543. - PMC - PubMed
    1. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2013;13(1):36–46. - PMC - PubMed
    1. Kececioglu JD, Myers EW. Combinatiorial Algorithms For Dna Sequence Assembly. Algorithmica. 1995;13:7–51.
    1. Medvedev P, Georgiou K, Myers G. et al. Computability of Models for Sequence Assembly. Gene. 2007;4645:289–301.
    1. Sutton GG, White O, Adams MD. et al. TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects. Genome Sci Technol. 1995;1(1):9–19.

LinkOut - more resources