Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 4;41(2):btaf014.
doi: 10.1093/bioinformatics/btaf014.

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation

Affiliations

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation

Yifan Jiang et al. Bioinformatics. .

Abstract

Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.

Results: Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.

Availability and implementation: The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of PhyloMix. (a) PhyloMix generates synthetic samples with artificial labels based on existing data and incorporates them into the training data to enhance the performance of general ML models. (b) PhyloMix leverages phylogeny to generate synthetic samples by removing a subtree from one sample and recombining it with the counterpart from another sample. The involved subtree, along with the affected probabilistic phylogenetic profile, is highlighted in orange.
Figure 2.
Figure 2.
Data augmentation performance in the supervised learning setting. (a) The evaluation was performed on real microbiome datasets with varying microbial taxa sizes. PhyloMix is evaluated alongside five ML models with varying predictive capabilities and compared against four distinct baseline methods. The augmentation generated by PhyloMix preserves or notably enhances predictive performance, regardless of the dataset complexity or the sophistication of the ML models used. For scientific rigor, the performance comparison between PhyloMix and other baseline methods is quantified using one-tailed two-sample t-tests to calculate P-values: ****P-value ≤ 0.0001; ***P-value ≤ 0.001; **P-value ≤ 0.01; *P-value ≤ 0.05; ns: P-value > 0.05. (b) PhyloMix is compared to baseline methods using radar plots, with performance evaluated based on relative AUPRC, normalized against the best-performing method across all methods. Each dot corresponds to the performance with or without data augmentation for a given method under a random seed.
Figure 3.
Figure 3.
Data augmentation performance in the representation learning setting. The evaluation was conducted on six publicly available microbiome datasets and compared against six distinct baseline methods. The representations obtained through PhyloMix consistently outperform those generated by other baseline methods in predicting disease status, measured by both AUPRC and AUROC. The performance comparison between PhyloMix and other baseline methods is quantified using one-tailed two-sample t-tests to calculate P-values.
Figure 4.
Figure 4.
Dissecting the performance of PhyloMix through control studies. (a) The data augmentation by PhyloMix incurs only moderate computational overhead, consisting of two main components: the generation of synthetic samples and the additional training time required to incorporate these synthetic samples. (b) The choice of the Beta distribution is crucial in determining the sample mixing weights. In general, this choice does not negatively impact data augmentation performance. The only exception is the logistic regression model, where data augmentation transforms logistic regression into linear regression, leading to a performance downgrade if the Beta distribution is not chosen properly. (c) PhyloMix is robust to the number of augmented samples. Data augmentation consistently improves predictive performance. However, increasing the number of augmented samples inevitably leads to higher computational overhead. (d) PhyloMix demonstrates more significant improvements when the training data size is small. (e) PhyloMix smooths the decision boundary of the SVM model on the IBD dataset.

Similar articles

References

    1. Adebayo J, Gilmer J, Muelly M et al. Sanity checks for saliency maps. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal Canada, 2013, pp. 9525–36. Red Hook, NY, United States: Curran Associates Inc., 2018.
    1. Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B (Methodol) 1982;44:139–60.
    1. Albanese D, Filippo CD, Cavalieri D et al. Explaining diversity in metagenomic datasets by phylogenetic-based feature weighting. PLoS Comput Biol 2015;11:e1004186. - PMC - PubMed
    1. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 2013;35:1798–828. - PubMed
    1. Boktor JC, Sharon G, Verhagen Metman LA et al. Integrated multi-cohort analysis of the Parkinson’s disease gut metagenome. Mov Disord 2023;38:399–409. - PubMed