. 2023 Jun 16;19(6):e1011163.

doi: 10.1371/journal.pcbi.1011163. eCollection 2023 Jun.

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Matthys G Potgieter^{1

2}, Andrew J M Nel², Suereta Fortuin², Shaun Garnett², Jerome M Wendoh³, David L Tabb^{2

4}, Nicola J Mulder^{1

5}, Jonathan M Blackburn^{2

5}

Affiliations

¹ Computational Biology Division, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa.
² Division of Chemical and Systems Biology, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa.
³ Division of Immunology, Department of Pathology, University of Cape Town, Cape Town, South Africa.
⁴ Division of Molecular Biology and Human Genetics, Department of Biomedical Sciences; African Microbiome Institute; South African Tuberculosis Bioinformatics Initiative; Stellenbosch University, Cape Town, South Africa.
⁵ Institute of Infectious Disease & Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.

PMID: 37327214
PMCID: PMC10310047
DOI: 10.1371/journal.pcbi.1011163

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Matthys G Potgieter et al. PLoS Comput Biol. 2023.

. 2023 Jun 16;19(6):e1011163.

doi: 10.1371/journal.pcbi.1011163. eCollection 2023 Jun.

Authors

Matthys G Potgieter^{1

2}, Andrew J M Nel², Suereta Fortuin², Shaun Garnett², Jerome M Wendoh³, David L Tabb^{2

4}, Nicola J Mulder^{1

5}, Jonathan M Blackburn^{2

5}

Affiliations

¹ Computational Biology Division, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa.
² Division of Chemical and Systems Biology, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa.
³ Division of Immunology, Department of Pathology, University of Cape Town, Cape Town, South Africa.
⁴ Division of Molecular Biology and Human Genetics, Department of Biomedical Sciences; African Microbiome Institute; South African Tuberculosis Bioinformatics Initiative; Stellenbosch University, Cape Town, South Africa.
⁵ Institute of Infectious Disease & Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa.

PMID: 37327214
PMCID: PMC10310047
DOI: 10.1371/journal.pcbi.1011163

Abstract

Background: Microbiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Poor correlations typically observed between RNA and protein expression datasets make it hard to accurately infer microbial protein synthesis from metagenomic data. Additionally, mass spectrometry-based metaproteomic analyses typically rely on focused search sequence databases based on prior knowledge for protein identification that may not represent all the proteins present in a set of samples. Metagenomic 16S rRNA sequencing only targets the bacterial component, while whole genome sequencing is at best an indirect measure of expressed proteomes. Here we describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines.

Results: We compared MetaNovo to published results from the MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications, many shared peptide sequences and a similar bacterial taxonomic distribution compared to that found using a matched metagenome sequence database-but simultaneously identified many more non-bacterial peptides than the previous approaches. MetaNovo was also benchmarked on samples of known microbial composition against matched metagenomic and whole genomic sequence database workflows, yielding many more MS/MS identifications for the expected taxa, with improved taxonomic representation, while also highlighting previously described genome sequencing quality concerns for one of the organisms, and identifying an experimental sample contaminant without prior expectation.

Conclusions: By estimating taxonomic and peptide level information directly on microbiome samples from tandem mass spectrometry data, MetaNovo enables the simultaneous identification of peptides from all domains of life in metaproteome samples, bypassing the need for curated sequence databases to search. We show that the MetaNovo approach to mass spectrometry metaproteomics is more accurate than current gold standard approaches of tailored or matched genomic sequence database searches, can identify sample contaminants without prior expectation and yields insights into previously unidentified metaproteomic signals, building on the potential for complex mass spectrometry metaproteomic data to speak for itself.

Copyright: © 2023 Potgieter et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig 1. Visualisation of the MetaNovo workflow used to analyse the mass spectrometry data of 8 human mucosal-luminal interface samples.**
Raw mass-spectrometry data were analysed using the **MetaNovo** pipeline in MGF format, using *de novo* sequence tags to create a targeted FASTA file for target-decoy search.

**Fig 2. A Graphical representation of the MetaNovo algorithm applied for sequence database filtration.**
*Normalized spectral abundance factor* calculations include non-unique spectra. The magnitude of probabilities are represented by +’s. Proteins are ranked by the joint probability of organism and protein probabilities, represented by the arrow, in order of increasing probability. The number of unique spectra for each protein is determined based on its position in the ranked list, and only include spectra that do not appear in the set of proteins in the list above (but may include spectra that appear below), such as the spectra for **Peptide B** that are counted towards the first protein in the list, but not the second. Tie breaks for adjacent and nearly identical isoforms that share the same set of spectra, will be based on the shortest (most probable) sequence having a higher NSAF (and thus a higher protein probability) or a higher organism probability. Proteins in green will be selected for inclusion in the filtered sequence database, and proteins in red will be excluded (having no unique spectra). The colors shared by proteins, peptides and spectra above, illustrate the assignment of unique spectra and peptides, to the most probable protein in the ranked list.

**Fig 3. MLI dataset results.**
A. Bar chart of peptide identifications. The identification rates of **MetaNovo** are comparable to the previously published results of **MetaPro-IQ** using matched metagenome and *integrated gene catalog sequence* databases. B. Venn diagram showing large overlap in identified sequences using different approaches, with the highest number of sequences identified using **MetaNovo. C.** Peptide counts by *UniPept lowest common ancestor* showed similar taxonomic distributions obtained from different approaches. D. Peptides uniquely identified by **MaxQuant** using the **MetaNovo sequence** database had a significantly different distribution compared to reverse hits (p-value 6.33e-26). The boxes extend from the lower to the upper quartile, and the whiskers represent 1.5 times the interquartile range (IQR) below and above the first and third quartiles, respectively.

**Fig 4. 9MM dataset identification results.**
A. Number of peptides identified in each run. B. Number of protein groups identified in each run. C. Peptide identification overlap between the different approaches. D. Peptide PEP score distribution box plot for shared, exclusive and reverse hit peptides for each run.

**Fig 5. Percentages of misassigned peptides for all three 9MM runs.**
**A. MetaNovo** originally yielded a very high percentage of misassigned peptides at species level **UniPept** *pept2lca* analysis. B. Taxonomic breakdown of misassigned peptides C. Re-analysis after inclusion of plausible taxa yielded a species-level misassignment rate of only 1.04%, with 0% error for all approaches at genus and family level using the 0.5% taxon-specific peptide stringency cutoff. D. 9 *Acidobacteria bacterium* peptides making up the final misassignment percentage of **MetaNovo**.

See this image and copyright information in PMC

References

1. Ochoa-Hueso R. Global Change and the Soil Microbiome: A Human-Health Perspective. Front Ecol Evol [Internet]. 2017. Jul 6 [cited 2019 Apr 10];5:71. Available from: http://journal.frontiersin.org/article/10.3389/fevo.2017.00071/full - DOI
1. Blaser MJ, Cardon ZG, Cho MK, Dangl JL, Donohue TJ, Green JL, et al.. Toward a Predictive Understanding of Earth’s Microbiomes to Address 21st Century Challenges. mBio [Internet]. 2016. Jul 13 [cited 2019 Apr 10];7(3):e00714–16. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27178263 doi: 10.1128/mBio.00714-16 - DOI - PMC - PubMed
1. Liu Y, Beyer A, Aebersold R. Leading Edge Review On the Dependency of Cellular Protein Levels on mRNA Abundance. 2016. [cited 2021 Apr 16]; Available from: doi: 10.1016/j.cell.2016.03.014 - DOI - PubMed
1. Muth T, Kolmeder CA, Salojärvi J, Keskitalo S, Varjosalo M, Verdam FJ, et al.. Navigating through metaproteomics data: A logbook of database searching. Proteomics. 2015. Oct 1;15(20):3439–53. doi: 10.1002/pmic.201400560 - DOI - PubMed
1. Tanca A, Palomba A, Fraumene C, Pagnozzi D, Manghina V, Deligios M, et al.. The impact of sequence database choice on metaproteomic results in gut microbiota studies. Microbiome [Internet]. 2016. Sep 27 [cited 2021 May 25];4(1):1–13. Available from: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-016-... - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Affiliations

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources