Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 6;10(1):72.
doi: 10.1186/s40168-022-01259-2.

Evaluating metagenomic assembly approaches for biome-specific gene catalogues

Affiliations

Evaluating metagenomic assembly approaches for biome-specific gene catalogues

Luis Fernando Delgado et al. Microbiome. .

Abstract

Background: For many environments, biome-specific microbial gene catalogues are being recovered using shotgun metagenomics followed by assembly and gene calling on the assembled contigs. The assembly is typically conducted either by individually assembling each sample or by co-assembling reads from all the samples. The co-assembly approach can potentially recover genes that display too low abundance to be assembled from individual samples. On the other hand, combining samples increases the risk of mixing data from closely related strains, which can hamper the assembly process. In this respect, assembly on individual samples followed by clustering of (near) identical genes is preferable. Thus, both approaches have potential pros and cons, but it remains to be evaluated which assembly strategy is most effective. Here, we have evaluated three assembly strategies for generating gene catalogues from metagenomes using a dataset of 124 samples from the Baltic Sea: (1) assembly on individual samples followed by clustering of the resulting genes, (2) co-assembly on all samples, and (3) mix assembly, combining individual and co-assembly.

Results: The mix-assembly approach resulted in a more extensive nonredundant gene set than the other approaches and with more genes predicted to be complete and that could be functionally annotated. The mix assembly consists of 67 million genes (Baltic Sea gene set, BAGS) that have been functionally and taxonomically annotated. The majority of the BAGS genes are dissimilar (< 95% amino acid identity) to the Tara Oceans gene dataset, and hence, BAGS represents a valuable resource for brackish water research.

Conclusion: The mix-assembly approach represents a feasible approach to increase the information obtained from metagenomic samples. Video abstract.

Keywords: Assembly approach; Baltic Sea; Brackish water; Gene catalogue; Metagenomics; Mix assembly.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Gene length distributions of the three assembly approaches. a Co-assembly. b Individual assembly. c Mix assembly. Only genes ≤ 2500 bp are included in the histograms
Fig. 2
Fig. 2
Cumulative distribution of gene lengths for the three assembly approaches. a All genes. b Complete genes. c Partial genes. d Incomplete genes. Complete genes refers to genes predicted to be complete (having a predicted start codon and a stop codon), partial genes to genes that lack either a start or a stop, and incomplete genes to genes that lack both start and stop. Gene length is given in logarithmic scale
Fig. 3
Fig. 3
Read mapping rates to genes from the three assembly approaches. The boxplots show the distribution of mapping rate (% of reads) for the 124 samples, based on a random subset of 10,000 forward reads per sample. a When mapping to all genes. b When mapping to genes with Pfam annotation
Fig. 4
Fig. 4
Read mapping rate as a function of gene length cutoff. The plots show the proportion of reads mapping at different cutoffs on minimum gene length. a All genes. b Complete genes. c Partial genes. d Incomplete genes. Complete genes refer to genes predicted to be complete (having a predicted start codon and a stop codon), partial genes to genes that lack either a start or a stop, and incomplete genes to genes that lack both start and stop. Gene lengths are given in logarithmic scale
Fig. 5
Fig. 5
Contribution of genes from individual assembly and co-assembly to the mix-assembly gene set. a Cumulative distribution of gene lengths for the mix-assembly genes: for all (“All mix”) and for those derived from individual-assembly (“from Ind”) and co-assembly (“from Co”). Gene length is given in logarithmic scale. b Read mapping rate as a function of gene length cutoff. c Total number of reads mapping to mix-assembly genes derived from either individual assembly or co-assembly, for four bins of genes binned by their estimated coverage in the total metagenome (see “Methods”): low (0–50 ×), median (50–500 ×), high (500–5000 ×), and very high (5000–250,000 ×) read depth coverage

References

    1. Oulas A, Pavloudi C, Polymenakou P, Pavlopoulos GA, Papanikolaou N, Kotoulas G, et al. Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies. Bioinform Biol Insights. 2015;9:75–88. doi: 10.4137/BBI.S12462. - DOI - PMC - PubMed
    1. Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Structure and function of the global ocean microbiome. Science. 2015;348. American Association for the Advancement of Science. [cited 2021 Aug 11]. Available from: 10.1126/science.1261359. - PubMed
    1. Choi J, Yang F, Stepanauskas R, Cardenas E, Garoutte A, Williams R, et al. Strategies to improve reference databases for soil microbiomes. ISME J. 2017;11:829–834. doi: 10.1038/ismej.2016.168. - DOI - PMC - PubMed
    1. Li J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, et al. An integrated catalog of reference genes in the human gut microbiome. Nat Biotechnol. 2014;32:834–841. doi: 10.1038/nbt.2942. - DOI - PubMed
    1. Steinegger M. Ultrafast and sensitive sequence search and clustering methods in the era of next generation sequencing [Internet]. Technische Universität München; 2018. Available from: http://mediatum.ub.tum.de/doc/1435187/678546.pdf.

Publication types

LinkOut - more resources