Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 6;8(1):1159.
doi: 10.1038/s42003-025-08590-y.

Chronological age estimation from human microbiomes with transformer-based Robust Principal Component Analysis

Affiliations

Chronological age estimation from human microbiomes with transformer-based Robust Principal Component Analysis

Tyler Myers et al. Commun Biol. .

Abstract

Deep learning for microbiome analysis has shown potential for understanding microbial communities and human phenotypes. Here, we propose an approach, Transformer-based Robust Principal Component Analysis(TRPCA), which leverages the strengths of transformer architectures and interpretability of Robust Principal Component Analysis. To investigate benefits of TRPCA over conventional machine learning models, we benchmarked performance on age prediction from three body sites(skin, oral, gut), with 16S rRNA gene amplicon(16S) and whole-genome sequencing(WGS) data. We demonstrated prediction of age from longitudinal samples and combined classification and regression tasks via multi-task learning(MTL). TRPCA improves age prediction accuracy from human microbiome samples, achieving the largest reduction in Mean Absolute Error for WGS skin (MAE: 8.03, 28% reduction) and 16S skin (MAE: 5.09, 14% reduction) samples, compared to conventional approaches. Additionally, TRPCA's MTL approach achieves an accuracy of 89% for birth country prediction across 5 countries, while improving age prediction from WGS stool samples. Notably, TRPCA uncovers a link between subject and error prediction through residual analysis for paired samples across sequencing method (16S/WGS) and body site(oral/gut). These findings highlight TRPCA's utility in improving age prediction while maintaining feature-level interpretability, and elucidating connections between individuals and microbiomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: G.R., M.L., and S.A.S. are Employees of Danone. D.M. is a consultant for BiomeSense, Inc., has equity and receives income. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. A.B. is a founder of Guilden Corporation and is an equity owner. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant and scientific advisory board member for DayTwo and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a co-founder of Micronoma and has equity and is a scientific advisory board member. The terms of this arrangement have been reviewed and approved by the University of California, San Diego, in accordance with its conflict of interest policies. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Preprocessing and model overview to transformer-based RPCA.
Samples represented as count tables are visualized and converted to RPCA vectors. RPCA vectors are input as sequences into a transformer encoder model with multi-head attention. The transformer model outputs are provided to a classification (CLS) or regression (REG) head for classification, regression, or Multi-task learning.
Fig. 2
Fig. 2. Regression for age prediction by sequencing method and body site.
ac are regressions for 16S data and df for WGS data. a, d, b, e, c, f indicate regressions for skin, oral, and gut microbiome samples respectively.
Fig. 3
Fig. 3. MTL performance on country of birth classification and age prediction for WGS Stool samples from curatedMetagenomicData.
a, b are the confusion matrix and age prediction regression for the TRPCA MTL model. c, d are RF models trained individually for the same classification and regression tasks as the TRPCA MTL model.
Fig. 4
Fig. 4. Top 50 normalized and clustered feature importances for 16S Skin microbiome features.
The feature importances are derived from the dot product of the PCA and SHAP matrices with rows as samples and columns as assigned taxonomy from influential ASVs. Redundant taxonomy values are indicative of AVSs which map to the same taxonomy classification. Clustering of columns indicate influential taxa groupings for age prediction, whereas clustering of rows highlight individuals with similar significant features.
Fig. 5
Fig. 5. Interpretable feature importances.
a Sample level feature importances for WGS Skin microbiome samples. Features colored in blue indicate a feature that influences the prediction of that sample to be younger, whereas red is indicative of a feature that drives the prediction to be older. b Feature level comparison of feature importances for WGS skin between TRPCA and RF model. c A pairwise comparison of the Pearson correlation between feature importances for each model architecture. SVR, GBR, KNN, RF, and TRPCA feature importances are highly correlated for WGS Skin microbiome feature importances.
Fig. 6
Fig. 6. Paired sample age residual errors.
a TRPCA predictions for 16S stool samples from the THDMI and FINRISK cohorts. b TRPCA predictions for WGS stool samples from the THDMI and FINRISK cohorts. c The correlation between prediction errors in paired 16S and WGS stool microbiome samples (R2 = 0.632). d TRPCA predictions for WGS oral microbiome samples from curatedMetagenomicData. e TRPCA predictions for WGS stool microbiome samples from curatedMetagenomicData. f The correlation between prediction errors in paired WGS oral and WGS stool microbiome samples (R2 = 0.339). The error in model prediction for paired samples (WGS Stool/16S Stool and WGS Stool/WGS oral) implies host associated attributes may be associated with residual error.

References

    1. Ghosh, T. S., Shanahan, F. & O’Toole, P. W. Toward an improved definition of a healthy microbiome for healthy aging. Nat. Aging2, 1054–1069 (2022). - PMC - PubMed
    1. López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M. & Kroemer, G. Hallmarks of aging: an expanding universe. Cell186, 243–278 (2023). - PubMed
    1. Shibagaki, N. et al. Aging-related changes in the diversity of women’s skin microbiomes associated with oral bacteria. Sci. Rep.7, 10567 (2017). - PMC - PubMed
    1. Larson, P. J. et al. Associations of the skin, oral and gut microbiome with aging, frailty and infection risk reservoirs in older adults. Nat. Aging2, 941–955 (2022). - PMC - PubMed
    1. Chaudhari, D. S. et al. Gut, oral and skin microbiome of Indian patrilineal families reveal perceptible association with age. Sci. Rep.10, 5685 (2020). - PMC - PubMed