Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Mar;30(3):315-333.
doi: 10.1101/gr.258640.119. Epub 2020 Mar 18.

Accurate and complete genomes from metagenomes

Affiliations
Review

Accurate and complete genomes from metagenomes

Lin-Xing Chen et al. Genome Res. 2020 Mar.

Abstract

Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures, and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs); but gaps, local assembly errors, chimeras, and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and, in some cases, achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of about 7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The low frequency of single-nucleotide variants (SNVs) of a recently published CMAG. A randomly chosen region, centered on position 123,456 (1100 bp in length) of the CMAG of Candidatus Fluviicola riflensis is shown with mapped reads (Banfield et al. 2017). SNVs that only occur once are indicated by black boxes, and the one replicated SNV is indicated by a red box. Clearly, the consensus sequence is well supported. The mapping of reads to the genome was performed by Bowtie 2 and visualized via Geneious.
Figure 2.
Figure 2.
Genome-resolved metagenomics is essential to better investigate microbial diversity. (A) The inner dendrogram displays the hierarchical clustering of 3761 “novel” Kowarsky et al. contigs based on their tetranucleotide frequency (using Euclidean distance and Ward clustering) with the set of contigs that identify the genome in these data that is a member of the Candidate Phyla Radiation (CPR). Although the two inner layers display the length and GC content of each contig, the outermost layer marks each contig that contains one or more bacterial single-copy core genes. Finally, the second most outer layer marks each contig that originates from the assemblies of pregnant women blood samples. Although the pregnant women cohort was only one of four cohorts of individuals in Kowarsky et al. (2017) (others being heart transplant, lung transplant, and bone marrow transplant patients), most ribosomal proteins we found in the assembly originated from contigs that were assembled from the pregnant women (Supplemental Table S1). The signal in this layer shows that contigs with bacterial single-copy core genes associate very closely with other contigs based on tetranucleotide frequencies, and most of these contigs are assembled from pregnant women blood metagenomes, providing additional confidence that this group of contigs represents a single microbial population genome within the “novel” set of contigs that were released by Kowarsky et al. (2017) in their original publication. (B) Comparison of the initial CPR bin we have identified in the “novel” set of contigs to the final CPR bin we have refined using the entire set of contigs, which included non-novel contigs we obtained from the authors of the original study (M Kowarsky, J Camunas-Soler, M Kertesz, et al., pers. comm.). (C) Phylogenetic analyses show the placement of the CPR bin in the context of CPR genomes released by Brown et al. (2015). More details of this case study are available at http://merenlab.org/data/parcubacterium-in-hbcfdna/.
Figure 3.
Figure 3.
Contamination in MAG without extra copies of SCGs. In the left panel, the half-circle displays the mean coverage of each contig in Pasolli MAG across three plaque metagenomes that belong to the same individual, for which the “star” symbol denotes the sample from which the original MAG was reconstructed. The dendrogram in the center represents the hierarchical clustering of the 57 contigs based on their sequence composition and differential mean coverage across the three metagenomes, and the innermost circle displays the GC content for each contig. The outermost circle marks two clusters: one with 46 contigs (green) and another one with 11 contigs (orange). The table underneath this display summarizes various statistics about these two clusters, including the best matching taxonomy, total length, completion and redundancy (C/R) estimations based on SCGs, and the average mean coverage of each cluster across metagenomes. In the right panel, the distribution of the same contigs and clusters are shown across 196 plaque (brown) and 217 tongue (blue) metagenomes generated by the Human Microbiome Project (HMP). Each concentric circle in this display represents a single metagenome, and data points display the detection of the contigs in Pasolli MAG.
Figure 4.
Figure 4.
Examples of probable assembly errors in RefSeq bacterial genomes. (A) Salmonella enterica subsp. enerica (CP009768.1). (B) Desulfitobacterium hafniense Y51 (NC_007907.1). The diagrams show the GC skew (gray) and cumulative GC skew (green line) of the original (left) and the modified (right) versions of the genomes (all calculated with window size of 1000 bp, and slide size of 10 bp). The location and direction of repeat sequences leading to the abnormal GC skew are indicated by red arrows. After flipping the repeat-bounded sequences, the genomes show the pattern expected for genomes that undergo bidirectional replication (right). For more examples, see Supplemental Figures S6 and S7.
Figure 5.
Figure 5.
The workflow for generating curated and complete genomes from metagenomes. Steps are shown in black, and the tools or information used in blue. Notes for procedures are shown in gray boxes. The detailed procedures for scaffold extension and gap closing are available in the Supplemental Methods and also online (https://ggkbase-help.berkeley.edu/genome_curation/scaffold-extension-and-gap-closing/).

References

    1. Ackelsberg J, Rakeman J, Hughes S, Petersen J, Mead P, Schriefer M, Kingry L, Hoffmaster A, Gee JE. 2015. Lack of evidence for plague or anthrax on the New York City subway. Cell Syst 1: 4–5. 10.1016/j.cels.2015.07.008 - DOI - PubMed
    1. Afshinnekoo E, Meydan C, Chowdhury S, Jaroudi D, Boyer C, Bernstein N, Maritz JM, Reeves D, Gandara J, Chhangawala S, et al. 2015. Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell Syst 1: 97–97.e3. 10.1016/j.cels.2015.07.006 - DOI - PubMed
    1. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol 31: 533–538. 10.1038/nbt.2579 - DOI - PubMed
    1. Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. 2019. A new genomic blueprint of the human gut microbiota. Nature 568: 499–504. 10.1038/s41586-019-0965-1 - DOI - PMC - PubMed
    1. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. 2014. Binning metagenomic contigs by coverage and composition. Nat Methods 11: 1144–1146. 10.1038/nmeth.3103 - DOI - PubMed

Publication types