Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov;647(8088):248-256.
doi: 10.1038/s41586-025-09529-3. Epub 2025 Sep 17.

Learning the natural history of human disease with generative transformers

Affiliations

Learning the natural history of human disease with generative transformers

Artem Shmatko et al. Nature. 2025 Nov.

Erratum in

Abstract

Decision-making in healthcare relies on understanding patients' past and current health states to predict and, ultimately, change their future course1-3. Artificial intelligence (AI) methods promise to aid this task by learning patterns of disease progression from large corpora of health records4,5. However, their potential has not been fully investigated at scale. Here we modify the GPT6 (generative pretrained transformer) architecture to model the progression and competing nature of human diseases. We train this model, Delphi-2M, on data from 0.4 million UK Biobank participants and validate it using external data from 1.9 million Danish individuals with no change in parameters. Delphi-2M predicts the rates of more than 1,000 diseases, conditional on each individual's past disease history, with accuracy comparable to that of existing single-disease models. Delphi-2M's generative nature also enables sampling of synthetic future health trajectories, providing meaningful estimates of potential disease burden for up to 20 years, and enabling the training of AI models that have never seen actual data. Explainable AI methods7 provide insights into Delphi-2M's predictions, revealing clusters of co-morbidities within and across disease chapters and their time-dependent consequences on future health, but also highlight biases learnt from training data. In summary, transformer-based models appear to be well suited for predictive and generative health-related tasks, are applicable to population-scale datasets and provide insights into temporal dependencies between disease events, potentially improving the understanding of personalized health risks and informing precision medicine approaches.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A patent has been filed for the use of generative transformer architectures to model competing risk and timings of diseases (application number: PCT/EP2025/065771; applicants: DKFZ, EMBL), with M.G., A.S., T.F., E.B., K.G. and A.W.J. listed as inventors. S.B. has ownership interests in Hoba Therapeutics Aps, Novo Nordisk, Lundbeck and Eli Lilly. E.B. is a consultant and shareholder of Oxford Nanopore. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Delphi, a modified GPT architecture, models health trajectories.
a, Schematic of health trajectories based on ICD-10 diagnoses, lifestyle and healthy padding tokens, each recorded at a distinct age. b, Training, validation and testing data derived from the UK Biobank (left) and Danish disease registries (right). c, The Delphi model architecture. The red elements indicate changes compared with the underlying GPT-2 model. ‘N ×’ denotes applying the transformer block sequentially N times. d, Example model input (prompt) and output (samples) comprising (age:token) pairs. e, Scaling laws of Delphi, showing the optimal validation loss as a function of model parameters for different training data sizes. f, Ablation results measured by the cross-entropy differences relative to an age- and sex-based baseline (y axis) for different ages (x axis). g, The accuracy of predicted time to event. The observed (y axis) and expected (x axis) time to events are shown for each next token prediction (grey dots). The blue line shows the average across consecutive bins of the x axis. Source Data
Fig. 2
Fig. 2. Delphi-2M accurately models the rates of a wide range of diseases.
a, The predicted rates for nine exemplary diagnoses and death (y axis) as a function of age (x axis). The points show predictions at each recorded input token. Colours separate biological sex; the darker colours indicate predictions immediately before the diagnosis in question. The purple and turquoise lines are disease rates observed for each yearly age bin in the training data. The solid black line connects consecutive predictions for one randomly selected case throughout age. b, Average age–sex-stratified AUC values (y axis) as a function of training occurrences (x axis). Shown are data for n = 906 diagnoses for male individuals and n = 957 diagnoses for female individuals for which a sufficient number of events was recorded in the validation data to evaluate AUC values. c, The same as b, but aggregated by the ICD-10 chapter. d, The same as b, aggregated by sex. e, AUC values of all diagnoses in b for different time gaps between prediction and diagnoses (x axis). f, ROC curves for Delphi and other clinical or machine learning methods for three selected end points evaluated on the internal longitudinal testing set. g, AUC values of MILTON, a biomarker-based machine learning model (x axis), in prognostic mode, compared with Delphi-2M AUC values from the UK Biobank validation set (y axis) for n = 410 diagnoses. The box plots in ce show the median (centre line), the first to the third quartile (box limits) and the 0.025 and 0.975 quantiles (whiskers). Source Data
Fig. 3
Fig. 3. Generative modelling with Delphi-2M informs future outcomes.
a, Schematic of the experiment design. Delphi-2M is used to simulate health trajectories using validation data (n = 63,622 individuals with disease records both before and after 60 years of age) observed until the age of 60. A single trajectory is simulated per individual. Subsequently, simulated trajectories are compared to the observed outcomes for the same person. b, Delphi-2M-modelled disease rates at ages 70–75 years (x axis) compared with observed rates at the same ages (y axis). c, The fraction of correctly predicted diagnoses (y axis) per 1-year age bin as a function of the years after simulation started at age 60 years (x axis). Delphi-2M, orange. The blue curve uses age and sex as a prediction baseline. d, Simulated (x axis) and observed (y axis) fold changes of disease rates for high versus low smoking, alcohol consumption and BMI groups. The evaluation period included ages 70–75 years and used simulations from the age of 60 years. e, The same as b, evaluated for simulations from birth. f, The AUC values of disease risk prediction (n = 1,334 disease–sex pairs) for Delphi when trained on UKB and Delphi-2M-sampled synthetic data (Methods). The box plots show the median (centre line), the first to the third quartile (box limits) and the 0.025 and 0.975 quantiles (whiskers). Source Data
Fig. 4
Fig. 4. Explainable AI offers insights into disease progression.
a, UMAP projection of token embeddings. Selected diseases are shown in the magnified areas. Colours define disease chapters. b, SHAP-explained token risk contributions for individual trajectories. Top, the risk of pancreatic cancer immediately before diagnosis at age 68.2 years, which was found to be 19× increased. Bottom, the SHAP estimates of contributions to estimated mortality at age 63.5 years, which was greatly increased, in large part due to the preceding diagnosis of pancreatic cancer. c, The average SHAP effect of each of n = 778 disease tokens with more than 5 occurrences and grouped by chapter (y axis) on the same set of tokens plus death (x axis). The red colours indicate a risk increase, whereas blue indicates a decrease. d, Rate change (SHAP value) of mortality (y axis) as a function of time after diagnosis (x axis) for selected diseases. Source Data
Fig. 5
Fig. 5. Epidemiological biases in UK Biobank data reflected by Delphi-2M.
a, Comparison between AUC values in the UKB longitudinal testing and the external testing using Danish data. b, Yearly mortality estimates by Delphi-2M (UK validation cohort), observed rates in the UK Biobank and Office for National Statistics national estimates across the entire British population. As only living individuals between 40 and 70 years of age (black line) were recruited to the UK Biobank, many deaths are missing compared with the Office for National Statistics population estimate (grey shaded area). c, UpSet plot of disease data availability in the UK Biobank validation cohort (n = 100,639) (top). Bottom, the data source distribution for records per disease token (token position sorted). d, Data source distribution for records per disease token (token position sorted). e, Hospital-record missingness bias (relative rate after first hospital token; y axis) as a function of token exclusivity to hospital records (x axis) for each relevant diagnosis (points). Points are coloured by the ICD-10 chapter, the overall trend is shown in black using a nonparametric (loess) curve with 95% confidence intervals shown in grey (UK validation cohort). f, Primary care missingness bias (y axis) as a function of primary care token exclusivity, coloured by the ICD-10 chapter. The trend is shown in black using a nonparametric (loess) curve with 95% confidence intervals shown in grey (UK validation cohort). g, SHAP value matrix, similar to Fig. 4c. The columns and rows correspond to different diseases and are sorted by the dominating source, then ICD-10 chapter. A dominating source is defined as the origin of more than 65% of records for a given disease; diseases without a dominating source are not shown. SHAP values indicate the greater influence of diseases on other diseases from the same group. SR, self-reported. Source Data
Extended Data Fig. 1
Extended Data Fig. 1. Effect of the “no event” padding token.
a, Boxplots (n = 3 model replicates trained with different seeds) of the average loss (y-axis; lower is better) for Delphi-2M trained with different “no event” padding rates (inverse scale, x-axis). The y-axis shows the average cross-entropy loss, calculated over disease tokens only - that is, without padding tokens, sex and lifestyle tokens. UK Biobank validation data was used to calculate the reported losses. The boxplots feature the median as the center line, the box from the first to the third quartile and the whiskers for 1.5x IQR. b, Average cross-entropy loss, aggregated over 5-year age bins. A higher rate of “no event” tokens lowers the loss, especially for younger ages, during which generally few disease tokens are recorded, prohibiting the model from adjusting predictions for advancing age. c, “No event” token rate estimated by Delphi (y-axis) vs the true rate at which tokens were added to the training data. The boxplots feature the median as the center line, the box from the first to the third quartile and the whiskers for 1.5x IQR. n = 4000 random timepoints from the validation dataset trajectories, selected for “no event” token rate evaluation.
Extended Data Fig. 2
Extended Data Fig. 2. Parameter screen.
a, Validation cross-entropy (rightmost axis) for models trained with different architectural hyperparameter values (other axes). b, Same data as a, showing validation loss (y-axis) against each model parameter (x-axis). The boxplots (n = 486 independently trained models within each panel in total) feature the median as the center line, the box from the first to the third quartile and the whiskers for 1.5x IQR, clipped at min/max data points. c, Random-forest-based importance of different hyperparameters and their correlation with validation loss.
Extended Data Fig. 3
Extended Data Fig. 3. Calibration of Delphi-2M’s instantaneous predictions.
a. Shown are results for 9 selected diseases and death on validation data for age groups of 5 years and both sexes. Predictions in each age-sex stratum are grouped into bins of powers of 10 (x-axis, average within each bin, and observed rates are calculated from validation data for predictions falling into each bin (y-axis). b, Calibration plots on the Danish longitudinal testing data. Each line represents an ICD-10 disease evaluated for each decile of the Delphi rate and compared against the observed rate in the population.
Extended Data Fig. 4
Extended Data Fig. 4. Assessment of Delphi-2M in relation to other baseline models and stratifications.
a, Comparison of Delphi-2M against clinical biomarkers for selected diseases performed using the UKB validation dataset. Predictions are based on the information available at recruitment and evaluated over the subsequent 5 years. CLD: Chronic liver disease. Mod: Logistic regression model of several clinical markers. MCV: Mean corpuscular volume. b, AUC results comparing Delphi-2M to a simple disease predictor of Overall health rating UKB data field 2178. AUC values for field 2178 as a predictor for future health events (after the date of recruitment) (x-axis) against the AUC values from Delphi using the UKB validation data. c. Boxplot, showing the prediction AUCs for Delphi, split over sex, disease chapter and lifestyle factors, such as alcohol consumption, smoking and BMI. The boxplots feature the median as the center line, the box from the first to the third quartile and the whiskers for 1.5x IQR, clipped at min/max data points. Shown are data for n = 906 diagnoses for males and n = 957 diagnoses for females for which sufficiently many events were recorded in the validation data to evaluate AUCs.
Extended Data Fig. 5
Extended Data Fig. 5. Integrating Delphi-2M predictions with other data types.
Results of a linear regression model that uses Delphi logits and additional features to predict 5-year disease occurrence for selected diseases. Shown is the average validation AUC across 5-year age groups ranging from 40 to 80 years of age, additionally stratified by sex. All models use sex and age as additional covariates. For prediction, only data before recruitment was used. As additional features, models use polygenic risk scores (PRS, a), 57 biomarkers used in the MILTON study (b) and UKB field 2178 Overall health rating status (c).
Extended Data Fig. 6
Extended Data Fig. 6. Assessment of simulated health trajectories.
All simulations are from the age of 60 onwards and use validation data. a, Simulated (x-axis) and observed (y-axis) annual disease rates during ages 70–75 for high and low smoking, alcohol consumption and BMI groups. b, Simulated and observed incidences for selected prior diseases. Same data as in a, but grouped for different prior diseases. c, Fold changes for the groups with and without prior diseases shown in b. d, Delphi accurately stratifies trajectories into low-, mid- and high-risk groups for selected diagnoses and death. Cumulative incidence (y-axis) as a function of age (x-axis). Risk groups are based on the top 1% and bottom 5% risk at the age of 60 years when simulations started. The low-risk group percentile was chosen to be larger to include sufficient cases for evaluation. Orange curves denote Delphi-2M simulations, blue observed data.
Extended Data Fig. 7
Extended Data Fig. 7. Comparison of SHAP values and Cox proportional hazards coefficients.
Shown are analyses for 10 selected diseases, as stated in the titles. SHAP values (x-axis) are estimated by averaging individual values from different trajectories. Cox proportional hazard coefficients (y-axis) are estimated using a proportional hazards model with parameter regularization, resulting in a high number of zero coefficients. The non-zero Cox coefficients and SHAP show a high correlation.
Extended Data Fig. 8
Extended Data Fig. 8. Relation of token embedding space and SHAP effects.
a. Disease embedding UMAP, coloured by the disease ICD-10 chapter. b. UMAP scatter plot, coloured by the SHAP disease rate change for the disease of interest, denoted by a cross marker. According to the SHAP analysis, diseases with similar embeddings tend to have a greater effect on the predicted rate of each other. Top row, the effect of the selected disease on the rate of other subsequent diseases. Bottom row, the effect of other diseases on the selected disease. c. Same as b, more diseases.
Extended Data Fig. 9
Extended Data Fig. 9. Token source-related biases.
Non-random missingness may cause biases in predictions even when sources are not explicitly provided to the model. a. Disease embedding UMAP for a Delphi model with explicit token sources (e.g. “Common cold (self-reported)” and “Common cold (hospital records)” are separate tokens), tokens coloured by ICD-10 chapters. b. Same as a, coloured by token source. c. Same as a, but for the standard Delphi-2M model. Only tokens with more than 75% of all entries from one source are shown. d. Same as c, coloured by primary token source. e. SHAP value matrix (similar to Fig. 4c), with tokens grouped by chapter and primary source.
Extended Data Fig. 10
Extended Data Fig. 10. Effects of ethnicity and deprivation.
a, Modelled rate per year separated by sex and ethnic background. b, Modelled rate per year separated by sex and Townsend deprivation index bins (increasing for greater deprivation index values). The boxplots in a and b use the entire validation cohort (n = 100639 individual trajectories) and feature median as the center line, the box from the first to the third quartile, the whiskers for 1.5x IQR and the outliers. c-d, Average number of disease tokens per year, shown for different ethnicities (c) and deprivation indices (d). e-f, Age and sex stratified AUCs for 10 selected diseases. AUCs are averaged across 5-year age groups ranging from 40 to 80 years of age. The same average is used as the center for error bars. AUCs for individual age and sex brackets are shown as grey dots. 95% confidence intervals are calculated using DeLong’s method. g-h, Width of DeLong’s 95% confidence intervals for AUC vs number of cases, shown for different ethnicities and deprivation strata. For rare diseases, AUC estimates have high variance. i, Standard deviation between AUC estimates for different strata vs number of cases of this disease for the training dataset. Each dot represents a disease. j, Average validation AUC across 5-year age groups ranging from 40 to 80 years of age, aggregated by the corresponding ICD chapters. Difference between average AUCs calculated for participants with birth years before 1944 and after 1960. The boxplots feature the median as the center line, the box from the first to the third quartile and the whiskers for 1.5x IQR, clipped at min/max data points. Shown are data for n = 906 diagnoses for males and n = 957 diagnoses for females for which sufficiently many events were recorded in the validation data to evaluate AUCs.

References

    1. Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat. Commun.9, 224 (2018). - PMC - PubMed
    1. Link, B. G. & Phelan, J. Social conditions as fundamental causes of disease. J. Health Soc. Behav.10.2307/2626958 (1995).
    1. Nyberg, S. T. et al. Association of healthy lifestyle with years lived without major chronic diseases. JAMA Intern. Med.180, 760–768 (2020). - PMC - PubMed
    1. Kraljevic, Z., Yeung, J. A., Bean, D., Teo, J. & Dobson, R. J. Large language models for medical forecasting—foresight 2. Preprint at https://arxiv.org/abs/2412.10848 (2024).
    1. Yang, L. et al. Advancing multimodal medical capabilities of Gemini. Preprint at https://arxiv.org/abs/2405.03162 (2024).

LinkOut - more resources