Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 3;15(1):23771.
doi: 10.1038/s41598-025-07952-0.

Machine learning-based identification of wastewater treatment plant-specific microbial indicators using 16S rRNA gene sequencing

Affiliations

Machine learning-based identification of wastewater treatment plant-specific microbial indicators using 16S rRNA gene sequencing

Jaana Jurvansuu et al. Sci Rep. .

Abstract

Effluent released from municipal wastewater treatment plants reflects the microbial communities responsible for degrading and removing contaminants within the plants. Monitoring this effluent offers essential insights into its environmental impacts, the efficiency of treatment processes, and the presence of emerging contaminants. To support improved monitoring and source attribution, our study employed a machine-learning framework to identify microbial indicators capable of distinguishing between municipal treatment plants based on effluent microbiota. We collected 57 effluent samples for sequencing of the V4 region of the 16S rRNA gene from six treatment plants in the Pirkanmaa region in Finland between 2016 and 2018. Characterising the microbiome revealed that although each plant had unique microbial profiles, their overall diversity and richness were similar. This provided a robust foundation for identifying plant-specific microbes. Using ANOVA-F for feature selection, we focused on the genus level due to its informative prevalence. Among various models tested, the Gaussian Naive Bayes model yielded the highest accuracy with the fewest relevant microbes. We identified nine bacterial genera and one archaeon, whose relative abundances predicted the origin of the effluent with 92% accuracy. Our study outlines a framework for the cost-effective and rapid identification of the origin of effluent or changes in the treatment process, demonstrating the power of machine learning in environmental monitoring and management.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Kirsi-Maarit Lehto, Heikki Hyöty and Sami Oikarinen are the stakeholders of GreenSeq Ltd. Finland.

Figures

Fig. 1
Fig. 1
We collected effluents from WWTPs in the Tampere region, Finland. The circles represent the plants’ approximate locations, and each circle’s size corresponds to the number of residents served by the WWTP. WWTP6 (light blue) and WWTP4 (orange) are the biggest and serve over 200,000 residents, whereas WWTP3 (yellow), WWTP5 (purple), WWTP2 (blue), and WWTP1 (pink) are small and serve between 1000 and 29,000 residents.
Fig. 2
Fig. 2
Microbial community composition of the six Pirkanmaa region WWTP effluents. (a) Bar charts show the relative abundance (%) of bacterial phyla in individual samples (N = 57), grouped by WWTP. The y-axis indicates the WWTP for each group; multiple samples per WWTP are represented by individual bars without separate sample labels. A total of 37 bacterial and archaeal phyla were detected. (b) A principal coordinate analysis, based on a Bray–Curtis dissimilarity matrix, is depicted using tree axes, explaining 29.9% of the ASV variance.
Fig. 3
Fig. 3
Alpha-diversity metrics were calculated on rarefied sample reads: WWTP1 (N = 19), WWTP2 (N = 10), WWTP3 (N = 11), WWTP4 (N = 5), WWTP5 (N = 5) and WWTP6 (N = 7). The metrics include (a) Observed AVS richness, (b) Phylogenetic richness by Faith’s phylogenetic diversity, and (c) Shannon’s entropy estimation of richness and diversity.
Fig. 4
Fig. 4
The relevant bacteria and archaea that classify wastewater treatment plants using the Gaussian Naive Bayes model. (a) Relative abundance of the ten identified relevant bacteria and archaea (Metanocorpusculum) in the WWTP effluents. (b) SHAP (SHapley Additive exPlanations) summary plots for each WWTP. The SHAP values indicate the feature’s average impact on the Gaussian Naive Bayes model’s output magnitude. Each subplot corresponds to one WWTP and shows the mean absolute SHAP values. The higher the mean absolute SHAP values are, the more critical the microbe is for the model’s predictions for the respective WWTP. Taxonomic assignment of uncultured taxa: midas_g_19012—Microscillaceae, DMER64—Rikenellaceae, Ca_Cloacimonas—Cloacimonadaceae, and OLB12—Microscillaceae.

References

    1. Cai, L., Ju, F. & Zhang, T. Tracking human sewage microbiome in a municipal wastewater treatment plant. Appl. Microbiol. Biotechnol.98(7), 3317–3326 (2014). - PubMed
    1. Lee, S. H., Kang, H. J. & Park, H. D. Influence of influent wastewater communities on temporal variation of activated sludge communities. Water Res.73, 132–144 (2015). - PubMed
    1. Hultman, J. et al. Host range of antibiotic resistance genes in wastewater treatment plant influent and effluent. FEMS Microbiol. Ecol.94(4), fiy038 (2018). - PMC - PubMed
    1. Dueholm, M. K. D. et al. MiDAS 4: A global catalogue of full-length 16S rRNA gene sequences and taxonomy for studies of bacterial communities in wastewater treatment plants. Nat. Commun.13(1), 1908 (2022). - PMC - PubMed
    1. Dueholm, M. K. D. et al. MiDAS 5: Global diversity of bacteria and archaea in anaerobic digesters. Nat. Commun.15(1), 5361 (2024). - PMC - PubMed

LinkOut - more resources