Microbiome-based classification models for fresh produce safety and quality evaluation
- PMID: 38445872
- PMCID: PMC10986475
- DOI: 10.1128/spectrum.03448-23
Microbiome-based classification models for fresh produce safety and quality evaluation
Abstract
Small sample sizes and loss of sequencing reads during the microbiome data preprocessing can limit the statistical power of differentiating fresh produce phenotypes and prevent the detection of important bacterial species associated with produce contamination or quality reduction. Here, we explored a machine learning-based k-mer hash analysis strategy to identify DNA signatures predictive of produce safety (PS) and produce quality (PQ) and compared it against the amplicon sequence variant (ASV) strategy that uses a typical denoising step and ASV-based taxonomy strategy. Random forest-based classifiers for PS and PQ using 7-mer hash data sets had significantly higher classification accuracy than those using the ASV data sets. We also demonstrated that the proposed combination of integrating multiple data sets and leveraging a 7-mer hash strategy leads to better classification performance for PS and PQ compared to the ASV method but presents lower PS classification accuracy compared to the feature-selected ASV-based taxonomy strategy. Due to the current limitation of generating taxonomy using the 7-mer hash strategy, the ASV-based taxonomy strategy with remarkably less computing time and memory usage is more efficient for PS and PQ classification and applicable for important taxa identification. Results generated from this study lay the foundation for future studies that wish and need to incorporate and/or compare different microbiome sequencing data sets for the application of machine learning in the area of microbial safety and quality of food.
Importance: Identification of generalizable indicators for produce safety (PS) and produce quality (PQ) improves the detection of produce contamination and quality decline. However, effective sequencing read loss during microbiome data preprocessing and the limited sample size of individual studies restrain statistical power to identify important features contributing to differentiating PS and PQ phenotypes. We applied machine learning-based models using individual and integrated k-mer hash and amplicon sequence variant (ASV) data sets for PS and PQ classification and evaluated their classification performance and found that random forest (RF)-based models using integrated 7-mer hash data sets achieved significantly higher PS and PQ classification accuracy. Due to the limitation of taxonomic analysis for the 7-mer hash, we also developed RF-based models using feature-selected ASV-based taxonomic data sets, which performed better PS classification than those using the integrated 7-mer hash data set. The RF feature selection method identified 480 PS indicators and 263 PQ indicators with a positive contribution to the PS and PQ classification.
Keywords: amplicon sequence variant; k-mer hash; machine learning; produce quality; produce safety; random forest.
Conflict of interest statement
The authors declare no conflict of interest.
Figures






Similar articles
-
Microbiome Preprocessing Machine Learning Pipeline.Front Immunol. 2021 Jun 18;12:677870. doi: 10.3389/fimmu.2021.677870. eCollection 2021. Front Immunol. 2021. PMID: 34220823 Free PMC article.
-
Performance of Microbiome Sequence Inference Methods in Environments with Varying Biomass.mSystems. 2019 Feb 19;4(1):e00163-18. doi: 10.1128/mSystems.00163-18. eCollection 2019 Jan-Feb. mSystems. 2019. PMID: 30801029 Free PMC article.
-
Machine learning strategy for identifying altered gut microbiomes for diagnostic screening in myasthenia gravis.Front Microbiol. 2023 Sep 27;14:1227300. doi: 10.3389/fmicb.2023.1227300. eCollection 2023. Front Microbiol. 2023. PMID: 37829445 Free PMC article.
-
Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring.Comput Struct Biotechnol J. 2021 Jan 27;19:1092-1107. doi: 10.1016/j.csbj.2021.01.028. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 33680353 Free PMC article. Review.
-
Compositionality, sparsity, spurious heterogeneity, and other data-driven challenges for machine learning algorithms within plant microbiome studies.Curr Opin Plant Biol. 2023 Feb;71:102326. doi: 10.1016/j.pbi.2022.102326. Epub 2022 Dec 18. Curr Opin Plant Biol. 2023. PMID: 36538837 Free PMC article. Review.
Cited by
-
Applying machine learning to classify table olives using bacterial metataxonomic data.NPJ Sci Food. 2025 Jul 4;9(1):121. doi: 10.1038/s41538-025-00496-7. NPJ Sci Food. 2025. PMID: 40615468 Free PMC article.
-
Next-generation sequencing applications in food science: fundamentals and recent advances.Front Bioeng Biotechnol. 2025 Aug 20;13:1638957. doi: 10.3389/fbioe.2025.1638957. eCollection 2025. Front Bioeng Biotechnol. 2025. PMID: 40909218 Free PMC article. Review.
References
-
- Jackson C, Stone B, Tyler H. 2015. Emerging perspectives on the natural microbiome of fresh produce vegetables. Agriculture 5:170–187. doi:10.3390/agriculture5020170 - DOI
-
- Ceuppens S, Delbeke S, De Coninck D, Boussemaere J, Boon N, Uyttendaele M. 2015. Characterization of the bacterial community naturally present on commercially grown basil leaves: evaluation of sample preparation prior to culture-independent techniques. Int J Environ Res Public Health 12:10171–10197. doi:10.3390/ijerph120810171 - DOI - PMC - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous