Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec;11(45):e2404277.
doi: 10.1002/advs.202404277. Epub 2024 Oct 15.

DeepPhylo: Phylogeny-Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis

Affiliations

DeepPhylo: Phylogeny-Aware Microbial Embeddings Enhanced Predictive Accuracy in Human Microbiome Data Analysis

Bin Wang et al. Adv Sci (Weinh). 2024 Dec.

Abstract

Microbial data analysis poses significant challenges due to its high dimensionality, sparsity, and compositionality. Recent advances have shown that integrating abundance and phylogenetic information is an effective strategy for uncovering robust patterns and enhancing the predictive performance in microbiome studies. However, existing methods primarily focus on the hierarchical structure of phylogenetic trees, overlooking the evolutionary distances embedded within them. This study introduces DeepPhylo, a novel method that employs phylogeny-aware amplicon embeddings to effectively integrate abundance and phylogenetic information. DeepPhylo improves both the unsupervised discriminatory power and supervised predictive accuracy of microbiome data analysis. Compared to the existing methods, DeepPhylo demonstrates superiority in informing biologically relevant insights across five real-world microbiome use cases, including clustering of skin microbiomes, prediction of host chronological age and gender, diagnosis of inflammatory bowel disease (IBD) across 15 studies, and multilabel disease classification.

Keywords: beta‐diversity; deep learning; microbiome; phylogeny.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The overall framework of DeepPhylo. A) Enhancement of unsupervised learning with β‐diversity measures using phylogenetic information: Phylogenetic embeddings for each OTU are derived through PCA‐based dimensionality reduction of the phylogenetic distance matrix. These embeddings are then aggregated via summation pooling to encapsulate the phylogenetic relationships within the samples. Simultaneously, the sample abundance matrix undergoes dimensionality reduction using RPCA to extract abundance‐related features. The resulting features from both sets are then concatenated to create a fused feature embedding that integrates phylogenetic and abundance information. B) Supervised deep learning model integrating both abundance and phylogenetic information: the model is structured with two primary input modules: a linear input module that processes sample abundance data, and a convolutional input module that processes phylogenetic OTU embeddings. The outputs from these modules are combined to form a comprehensive feature representation, which is then used for the downstream predictive modeling tasks.
Figure 2
Figure 2
PCA scatter plot visualizations of samples using the benchmarked methods.
Figure 3
Figure 3
Quantitative evaluation of clustering performance: A) PERMANOVA F‐statistic. B) Adjusted Rand Index of K‐means clustering. C,D) Area Under the Precision‐Recall Curve (AUPRC) and Average Precision Score (APS) of tenfold KNN classification.
Figure 4
Figure 4
Prediction performance of chronological age using gut microbiome of individuals: A) R 2 of different methods. B) Effect of reducing the proportion of training data on model performance.
Figure 5
Figure 5
Receiver Operating Characteristic (ROC) curves and Precision‐Recall curves for binary prediction of host gender.
Figure 6
Figure 6
Evaluation of prediction performance for the gut microbiome‐based diagnosis of IBD. A) Comparison of performance metrics (ACC, ROC‐AUC, AUPR, F1) across various methods. B) Analysis of performance impact when varying proportions (10%, 25%, 50%, 75%, and 100%) of phylogenetic signals are removed. The analysis was repeated ten times to account for the randomness in selecting OTUs for the removal of phylogenetic signal at each proportion.

Similar articles

References

    1. Nature 2012, 486, 207. - PMC - PubMed
    1. Pflughoeft K. J., Versalovic J., Annu. Rev. Pathol. Mech. Dis. 2012, 7, 99. - PubMed
    1. Hu Y., Amir A., Huang X., Li Y., Huang S., Wolfe E., Weiss S., Knight R., Xu Z. Z., Genome Res. 2022, 32, 1112. - PMC - PubMed
    1. Gill S. R., Pop M., DeBoy R. T., Eckburg P. B., Turnbaugh P. J., Samuel B. S., Gordon J. I., Relman D. A., Fraser‐Liggett C. M., Nelson K. E., Science 2006, 312, 1355. - PMC - PubMed
    1. Gomaa E. Z., Antonie Van Leeuwenhoek 2020, 113, 2019. - PubMed

LinkOut - more resources