Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 14;15(1):9873.
doi: 10.1038/s41467-024-52212-w.

Missing microbial eukaryotes and misleading meta-omic conclusions

Affiliations

Missing microbial eukaryotes and misleading meta-omic conclusions

Arianna I Krinos et al. Nat Commun. .

Abstract

Meta-omics is commonly used for large-scale analyses of microbial eukaryotes, including species or taxonomic group distribution mapping, gene catalog construction, and inference on the functional roles and activities of microbial eukaryotes in situ. Here, we explore the potential pitfalls of common approaches to taxonomic annotation of protistan meta-omic datasets. We re-analyze three environmental datasets at three levels of taxonomic hierarchy in order to illustrate the crucial importance of database completeness and curation in enabling accurate environmental interpretation. We show that taxonomic membership of sequence clusters estimates community composition more accurately than returning exact sequence labels, and overlap between clusters can address database shortcomings. Clustering approaches can be applied to diverse environments while continuing to exploit the wealth of annotation data collated in databases, and selecting and evaluating these databases is a critical part of correctly annotating protistan taxonomy in environmental datasets. We argue that ongoing curation of genetic resources is crucial in accurately annotating protists in in situ meta-omic datasets. Moreover, we propose that precise taxonomic annotation of meta-omic data is a clustering problem rather than a feasible alignment problem.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Effect of different species-level references on the success of genus-level identification of Phaeocystis.
A Abundance of metagenomic proteins in each ocean basin coassembled from the Tara Oceans dataset annotated to be Phaeocystis by a combined database of the colony-forming references (left in each group; purple), a combined database of the free-living references (middle in each group; pink), a combined database of all Phaeocystis references (right in each group; black). Each group of bars represents either the large (>20 μm) or the small size (0.8–5 μm) fraction samples. Abundance is shown via read coverage (TPM) of annotated metagenomic contigs. B Phylogenetic tree of Phaeocystis references and genomic and transcriptomic outgroups. The bars to the right of the tree show the total number of orthogroups in each species that are a, pink or lavender: shared by other members of the same ecotype (colony-former or free-liver), b, maroon: shared among multiple Phaeocystis species regardless of ecotype, or c, white: present only within one species. C Percentage of sequences from the coassembly from the Southern Ocean Tara Oceans samples annotated to be Phaeocystis by any of the databases that were annotated as Phaeocystis using (top group of two bars) a combined reference database containing all of the free-living Phaeocystis references, (middle group of bars) a combined reference database containing all of the colony-forming Phaeocystis references, (bottom group of bars) a combined reference database containing all Phaeocystis references. The top bar in each group (brown) corresponds to the smallest Tara Oceans size fraction, while the bottom bar in each group (blue) corresponds to the largest Tara Oceans size fraction. D Identical to Panel C, but for the Tara Oceans samples from the Mediterranean Sea.
Fig. 2
Fig. 2. The effect of database composition on annotation of diatoms.
A Community composition of diatoms in Narragansett Bay based on light microscopy counts (top) compared to their metatranscriptomic activity (bottom). Lineage-conflicted refers to predicted proteins that were annotated as belonging to class Bacillariophyta, but had a conflict at the family level. “Other” refers to diatom families with associated TPM of less than 1000. Circles (top) indicate cells per L (right y-axis). B Mean percentage identity of non-self hits meeting a minimum bitscore value threshold (≥50) for diatom families represented in the MMETSP. C The bars to the right of the heatmap mean percentage identity plot indicate the total number of transcriptomes contained in the MMETSP for each family.
Fig. 3
Fig. 3. Effect of removing Radiolarian sequences from the database on the annotation of metatranscriptomic samples from the North Atlantic Ocean.
A Map of the BATS transect colored by the distance of each sample from the shore in kilometers. B Fraction of annotated scaled abundance of proteins that changed annotation before and after the radiolarian sequences were added, grouped by depth. C Among sequences that changed annotations, comparison of their annotation without radiolarian sequences (left axis) to with radiolarian sequences (right axis). In both cases the database contained the MMETSP and MarRef2 databases. While the majority category of putative Radiolarian sequences was those previously unannotated at the phylum level, some were previously classified as other phyla. Some phylum-level annotations were lost due to conflicts with added radiolarian sequences. D Comparison of the number of proteins that were taxonomically annotated (“Annotated”), taxonomically unannotated (“Unannotated”), or had conflicting taxonomy (“Conflicted”) according to whether they were also functionally annotated.
Fig. 4
Fig. 4. Schematic diagram of the tax-aliquots two-stage clustering workflow.
The workflow is intended to be used alongside the LCA algorithm to detect ambiguity in taxonomic assignment and identify possible taxonomic annotations of sequences which cannot be annotated using the short alignment method. By assessing similarity using subsequence patterns over the entire sequence length, tax-aliquots can also identify discrepancies in the taxonomic annotation selected by alignment and the LCA algorithm.
Fig. 5
Fig. 5. The utility of the tax-aliquots clustering approach is demonstrated on a simplified mock metatranscriptome, highlighting enhanced annotation at finer taxonomic resolution.
A Left panel: Workflow schematic; first, we annotated a “mock metatranscriptome” (a Phaeocystis pouchetii transcriptome) and filtered putative haptophyte sequences using EUKulele (Right panel: results of annotating the mock metatranscriptome with BLAST + LCA (EUKulele) as compared to mmseqs2). Then, we split the sequences into two parts, and annotated half of putative haptophyte sequences with a custom Phaeocystis-only reference database which excluded the half of P. pouchetii being tested (but included the other half as a simulated partial database transcriptome) using BLAST + LCA (EUKulele), mmseqs2, and tax-aliquots. B Tax-aliquots clusters using the “permissive” clustering scheme for the putative haptophyte sequences retrieved from the BLAST + LCA approach in panel B. C Comparison of the fate of the test putative haptophyte sequences between the BLAST + LCA, mmseqs2, and tax-aliquots approaches.

References

    1. Keeling, P. J. & Campo, J. D. Marine protists are not just big bacteria. Curr. Biol.27, R541–R549 (2017). - PubMed
    1. Cuddington, K., Byers, J.E., Wilson, W.G. & Hastings, A. Ecosystem Engineers: Plants to Protists. (Academic Press, 2011).
    1. Caron, D. A., Countway, P. D., Jones, A. C., Kim, D. Y. & Schnetzer, A. Marine protistan diversity. Ann. Rev. Mar. Sci.4, 467–493 (2012). - PubMed
    1. Sherr, E. B. & Sherr, B. F. Significance of predation by protists in aquatic microbial food webs. Antonie Van. Leeuwenhoek81, 293–308 (2002). - PubMed
    1. Worden, A. Z. et al. Environmental science. Rethinking the marine carbon cycle: factoring in the multifarious lifestyles of microbes. Science347, 1257594 (2015). - PubMed

Publication types