Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 16;19(6):e1011163.
doi: 10.1371/journal.pcbi.1011163. eCollection 2023 Jun.

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Affiliations

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Matthys G Potgieter et al. PLoS Comput Biol. .

Abstract

Background: Microbiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Poor correlations typically observed between RNA and protein expression datasets make it hard to accurately infer microbial protein synthesis from metagenomic data. Additionally, mass spectrometry-based metaproteomic analyses typically rely on focused search sequence databases based on prior knowledge for protein identification that may not represent all the proteins present in a set of samples. Metagenomic 16S rRNA sequencing only targets the bacterial component, while whole genome sequencing is at best an indirect measure of expressed proteomes. Here we describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines.

Results: We compared MetaNovo to published results from the MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications, many shared peptide sequences and a similar bacterial taxonomic distribution compared to that found using a matched metagenome sequence database-but simultaneously identified many more non-bacterial peptides than the previous approaches. MetaNovo was also benchmarked on samples of known microbial composition against matched metagenomic and whole genomic sequence database workflows, yielding many more MS/MS identifications for the expected taxa, with improved taxonomic representation, while also highlighting previously described genome sequencing quality concerns for one of the organisms, and identifying an experimental sample contaminant without prior expectation.

Conclusions: By estimating taxonomic and peptide level information directly on microbiome samples from tandem mass spectrometry data, MetaNovo enables the simultaneous identification of peptides from all domains of life in metaproteome samples, bypassing the need for curated sequence databases to search. We show that the MetaNovo approach to mass spectrometry metaproteomics is more accurate than current gold standard approaches of tailored or matched genomic sequence database searches, can identify sample contaminants without prior expectation and yields insights into previously unidentified metaproteomic signals, building on the potential for complex mass spectrometry metaproteomic data to speak for itself.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig 1
Fig 1. Visualisation of the MetaNovo workflow used to analyse the mass spectrometry data of 8 human mucosal-luminal interface samples.
Raw mass-spectrometry data were analysed using the MetaNovo pipeline in MGF format, using de novo sequence tags to create a targeted FASTA file for target-decoy search.
Fig 2
Fig 2. A Graphical representation of the MetaNovo algorithm applied for sequence database filtration.
Normalized spectral abundance factor calculations include non-unique spectra. The magnitude of probabilities are represented by +’s. Proteins are ranked by the joint probability of organism and protein probabilities, represented by the arrow, in order of increasing probability. The number of unique spectra for each protein is determined based on its position in the ranked list, and only include spectra that do not appear in the set of proteins in the list above (but may include spectra that appear below), such as the spectra for Peptide B that are counted towards the first protein in the list, but not the second. Tie breaks for adjacent and nearly identical isoforms that share the same set of spectra, will be based on the shortest (most probable) sequence having a higher NSAF (and thus a higher protein probability) or a higher organism probability. Proteins in green will be selected for inclusion in the filtered sequence database, and proteins in red will be excluded (having no unique spectra). The colors shared by proteins, peptides and spectra above, illustrate the assignment of unique spectra and peptides, to the most probable protein in the ranked list.
Fig 3
Fig 3. MLI dataset results.
A. Bar chart of peptide identifications. The identification rates of MetaNovo are comparable to the previously published results of MetaPro-IQ using matched metagenome and integrated gene catalog sequence databases. B. Venn diagram showing large overlap in identified sequences using different approaches, with the highest number of sequences identified using MetaNovo. C. Peptide counts by UniPept lowest common ancestor showed similar taxonomic distributions obtained from different approaches. D. Peptides uniquely identified by MaxQuant using the MetaNovo sequence database had a significantly different distribution compared to reverse hits (p-value 6.33e-26). The boxes extend from the lower to the upper quartile, and the whiskers represent 1.5 times the interquartile range (IQR) below and above the first and third quartiles, respectively.
Fig 4
Fig 4. 9MM dataset identification results.
A. Number of peptides identified in each run. B. Number of protein groups identified in each run. C. Peptide identification overlap between the different approaches. D. Peptide PEP score distribution box plot for shared, exclusive and reverse hit peptides for each run.
Fig 5
Fig 5. Percentages of misassigned peptides for all three 9MM runs.
A. MetaNovo originally yielded a very high percentage of misassigned peptides at species level UniPept pept2lca analysis. B. Taxonomic breakdown of misassigned peptides C. Re-analysis after inclusion of plausible taxa yielded a species-level misassignment rate of only 1.04%, with 0% error for all approaches at genus and family level using the 0.5% taxon-specific peptide stringency cutoff. D. 9 Acidobacteria bacterium peptides making up the final misassignment percentage of MetaNovo.

References

    1. Ochoa-Hueso R. Global Change and the Soil Microbiome: A Human-Health Perspective. Front Ecol Evol [Internet]. 2017. Jul 6 [cited 2019 Apr 10];5:71. Available from: http://journal.frontiersin.org/article/10.3389/fevo.2017.00071/full - DOI
    1. Blaser MJ, Cardon ZG, Cho MK, Dangl JL, Donohue TJ, Green JL, et al.. Toward a Predictive Understanding of Earth’s Microbiomes to Address 21st Century Challenges. mBio [Internet]. 2016. Jul 13 [cited 2019 Apr 10];7(3):e00714–16. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27178263 doi: 10.1128/mBio.00714-16 - DOI - PMC - PubMed
    1. Liu Y, Beyer A, Aebersold R. Leading Edge Review On the Dependency of Cellular Protein Levels on mRNA Abundance. 2016. [cited 2021 Apr 16]; Available from: doi: 10.1016/j.cell.2016.03.014 - DOI - PubMed
    1. Muth T, Kolmeder CA, Salojärvi J, Keskitalo S, Varjosalo M, Verdam FJ, et al.. Navigating through metaproteomics data: A logbook of database searching. Proteomics. 2015. Oct 1;15(20):3439–53. doi: 10.1002/pmic.201400560 - DOI - PubMed
    1. Tanca A, Palomba A, Fraumene C, Pagnozzi D, Manghina V, Deligios M, et al.. The impact of sequence database choice on metaproteomic results in gut microbiota studies. Microbiome [Internet]. 2016. Sep 27 [cited 2021 May 25];4(1):1–13. Available from: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-016-... - DOI - PMC - PubMed