Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 20;5(3):e00971-19.
doi: 10.1128/mSphere.00971-19.

To Dereplicate or Not To Dereplicate?

Affiliations

To Dereplicate or Not To Dereplicate?

Jacob T Evans et al. mSphere. .

Abstract

Metagenome-assembled genomes (MAGs) expand our understanding of microbial diversity, evolution, and ecology. Concerns have been raised on how sequencing, assembly, binning, and quality assessment tools may result in MAGs that do not reflect single populations in nature. Here, we reflect on another issue, i.e., how to handle highly similar MAGs assembled from independent data sets. Obtaining multiple genomic representatives for a species is highly valuable, as it allows for population genomic analyses; however, when retaining genomes of closely related populations, it complicates MAG quality assessment and abundance inferences. We show that (i) published data sets contain a large fraction of MAGs sharing >99% average nucleotide identity, (ii) different software packages and parameters used to resolve this redundancy remove very different numbers of MAGs, and (iii) the removal of closely related genomes leads to losses of population-specific auxiliary genes. Finally, we highlight some approaches that can infer strain-specific dynamics across a sample series without dereplication.

Keywords: MAG; binning; dereplication; metagenomics; population genomics; software.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Overview of dereplication approaches used in this study. All approaches first cluster similar genomes (Mash clusters are delineated with boxes) using a fast, less accurate approach (Mash), which is included in the dRep package but is a separate preprocessing step we carried out for the pyani analysis (indicated with the dotted line). Each cluster of MAGs then is separately dereplicated using pairwise alignments by identifying MAGs within each Mash cluster that share ANI above the specified threshold. These clusters are indicated by boxes, with Mash clusters split in two multiple cluster groups using the same line type (full or dashed lines). Which genomes end up in the same cluster varies depending on the approach used; only one clustering is shown. Finally, a representative MAG is selected, either as part of the package (dRep) or using a custom script (our approach that used pyani for pairwise comparisons, indicated by the dotted line), selecting the MAG with the highest estimated completion and lowest estimated contamination.
FIG 2
FIG 2
Effects of dereplication. Phylogenetic tree of a set of closely related MAGs (family Muribaculaceae) from Parks et al. (3), grouped based on sequence similarity by Mash. A box outline indicates the genome was preserved after dereplication, while white space indicates it was removed. The dRep default does not remove multiple nearly identical MAGs, while dRep-gANI removes MAGs that are more distantly related than the 99% or 96.5% ANI cutoff. Black bars show the average sequence read coverage across all contigs of each MAG, ranging from 0 to 2,000, when aligning a metagenomic data set (Sequence Read Archive accession no. SRR1702559) using all genomes in the tree (none) or dereplicated genome sets using different tools. Reads were mapped to each Multi-FASTA file of retained MAGs using BWA-MEM with default parameters (26). Average coverage per contig was computed with pileup.sh from bbtools (https://sourceforge.net/projects/bbmap/). The phylogenetic tree was created by searching for marker genes with PhyloSift (27) using its default set of marker genes. All MAGs had estimated completeness levels of >90% (3). The genes then were aligned with PhyloSift and the resulting alignments concatenated, and the tree was created with FastTree (28) using the -nt and -gtr parameters.

References

    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428:37–43. doi:10.1038/nature02340. - DOI - PubMed
    1. Anantharaman K, Brown CT, Hug LA, Sharon I, Castelle CJ, Probst AJ, Thomas BC, Singh A, Wilkins MJ, Karaoz U, Brodie EL, Williams KH, Hubbard SS, Banfield JF. 2016. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat Commun 7:13219. doi:10.1038/ncomms13219. - DOI - PMC - PubMed
    1. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN, Hugenholtz P, Tyson GW. 2017. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2:1533–1542. doi:10.1038/s41564-017-0012-7. - DOI - PubMed
    1. Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, Finn RD. 2019. A new genomic blueprint of the human gut microbiota. Nature 568:499–504. doi:10.1038/s41586-019-0965-1. - DOI - PMC - PubMed
    1. Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, Collado MC, Rice BL, DuLong C, Morgan XC, Golden CD, Quince C, Huttenhower C, Segata N. 2019. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176:649–662. doi:10.1016/j.cell.2019.01.001. - DOI - PMC - PubMed

Publication types