. 2025 Feb 4;41(2):btaf014.

doi: 10.1093/bioinformatics/btaf014.

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation

Yifan Jiang¹, Disen Liao¹, Qiyun Zhu², Yang Young Lu¹

Affiliations

¹ Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada.
² School of Life Sciences, Arizona State University, Tempe, AZ, 85281, United States.

PMID: 39799515
PMCID: PMC11849959
DOI: 10.1093/bioinformatics/btaf014

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation

Yifan Jiang et al. Bioinformatics. 2025.

. 2025 Feb 4;41(2):btaf014.

doi: 10.1093/bioinformatics/btaf014.

Authors

Yifan Jiang¹, Disen Liao¹, Qiyun Zhu², Yang Young Lu¹

Affiliations

¹ Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada.
² School of Life Sciences, Arizona State University, Tempe, AZ, 85281, United States.

PMID: 39799515
PMCID: PMC11849959
DOI: 10.1093/bioinformatics/btaf014

Abstract

Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.

Results: Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.

Availability and implementation: The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).

PubMed Disclaimer

Figures

**Figure 1.**
Overview of PhyloMix. (a) PhyloMix generates synthetic samples with artificial labels based on existing data and incorporates them into the training data to enhance the performance of general ML models. (b) PhyloMix leverages phylogeny to generate synthetic samples by removing a subtree from one sample and recombining it with the counterpart from another sample. The involved subtree, along with the affected probabilistic phylogenetic profile, is highlighted in orange.

**Figure 2.**
Data augmentation performance in the supervised learning setting. (a) The evaluation was performed on real microbiome datasets with varying microbial taxa sizes. PhyloMix is evaluated alongside five ML models with varying predictive capabilities and compared against four distinct baseline methods. The augmentation generated by PhyloMix preserves or notably enhances predictive performance, regardless of the dataset complexity or the sophistication of the ML models used. For scientific rigor, the performance comparison between PhyloMix and other baseline methods is quantified using one-tailed two-sample t-tests to calculate P-values: ****P-value ≤ 0.0001; ***P-value ≤ 0.001; **P-value ≤ 0.01; *P-value ≤ 0.05; ns: P-value > 0.05. (b) PhyloMix is compared to baseline methods using radar plots, with performance evaluated based on relative AUPRC, normalized against the best-performing method across all methods. Each dot corresponds to the performance with or without data augmentation for a given method under a random seed.

**Figure 3.**
Data augmentation performance in the representation learning setting. The evaluation was conducted on six publicly available microbiome datasets and compared against six distinct baseline methods. The representations obtained through PhyloMix consistently outperform those generated by other baseline methods in predicting disease status, measured by both AUPRC and AUROC. The performance comparison between PhyloMix and other baseline methods is quantified using one-tailed two-sample t-tests to calculate P-values.

**Figure 4.**
Dissecting the performance of PhyloMix through control studies. (a) The data augmentation by PhyloMix incurs only moderate computational overhead, consisting of two main components: the generation of synthetic samples and the additional training time required to incorporate these synthetic samples. (b) The choice of the Beta distribution is crucial in determining the sample mixing weights. In general, this choice does not negatively impact data augmentation performance. The only exception is the logistic regression model, where data augmentation transforms logistic regression into linear regression, leading to a performance downgrade if the Beta distribution is not chosen properly. (c) PhyloMix is robust to the number of augmented samples. Data augmentation consistently improves predictive performance. However, increasing the number of augmented samples inevitably leads to higher computational overhead. (d) PhyloMix demonstrates more significant improvements when the training data size is small. (e) PhyloMix smooths the decision boundary of the SVM model on the IBD dataset.

See this image and copyright information in PMC

References

1. Adebayo J, Gilmer J, Muelly M et al. Sanity checks for saliency maps. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal Canada, 2013, pp. 9525–36. Red Hook, NY, United States: Curran Associates Inc., 2018.
1. Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B (Methodol) 1982;44:139–60.
1. Albanese D, Filippo CD, Cavalieri D et al. Explaining diversity in metagenomic datasets by phylogenetic-based feature weighting. PLoS Comput Biol 2015;11:e1004186. - PMC - PubMed
1. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013;35:1798–828. - PubMed
1. Boktor JC, Sharon G, Verhagen Metman LA et al. Integrated multi-cohort analysis of the Parkinson’s disease gut metagenome. Mov Disord 2023;38:399–409. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

RGPIN-03270-2023/Canadian NSERC Discovery

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation

Affiliations

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation

Authors

Affiliations

Abstract

Figures

Similar articles

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous