Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 21;15(1):1603.
doi: 10.1038/s41467-024-45879-8.

Large language models streamline automated machine learning for clinical studies

Affiliations

Large language models streamline automated machine learning for clinical studies

Soroosh Tayebi Arasteh et al. Nat Commun. .

Abstract

A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study's training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (p ≥ 0.072). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.

PubMed Disclaimer

Conflict of interest statement

J.N.K. declares consulting services for Owkin, France; DoMore Diagnostics, Norway, and Panakeia, UK. Furthermore, J.N.K. holds shares in StratifAI GmbH and has received honoraria for lectures by Bayer, Eisai, MSD, BMS, Roche, Pfizer, and Fresenius. D.T. holds shares in StraifAI GmbH, Germany, and received honoraria for lectures by Bayer. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Study design.
Real-world datasets and study details from four large clinical trials were collected and input into the ChatGPT Advanced Data Analysis (ADA) tool. The tool autonomously selected the appropriate machine-learning models for the analysis following prompting. The models were expert-checked and comprehensively evaluated. The ChatGPT ADA-based predictions were compared to the original studies (benchmark publication) and the validatory predictions following the re-implementation of the models. Figure 1 was provided by a freelancer service (fiverr.com). Copyright rests with the authors. The figure constitutes original material and has not been published before.
Fig. 2
Fig. 2. Screenshots of an example interaction with ChatGPT ADA to analyze the endocrinologic oncology dataset.
ChatGPT ADA autonomously selects and applies the appropriate ML model for the provided dataset, generating predictions for the test data. The model also displays deeper insights in response to follow-up queries about the reasoning and parameters guiding its choices. Note: The “Show work” option visible in the images allows users to view the intermediary Python code offered by the tool.
Fig. 3
Fig. 3. Benchmark validatory re-implementation—receiver operating characteristic (ROC) curves of ML models as a function of the clinical-trial dataset.
The ROC curves of the ChatGPT ADA-based ML model (blue, solid curve) and the validatory ML model as re-implemented by a seasoned data scientist (red, dotted curve) are shown. The True Positive Rate (sensitivity) is plotted versus the False Positive Rate (1-specificity). The diagonal gray line represents the line of no discrimination. Source data are provided as a Source Data file. Bootstrapping with replacements and 1000 redraws on the test sets (number of independent samples: Endocrinologic Oncology dataset, n = 295; Gastrointestinal Oncology dataset, n = 6698, Otolaryngology dataset, n = 569; Cardiology dataset, n = 430) was applied to determine means and measures of statistical spread, i.e., standard deviations and 95% confidence intervals (CI). AUROC area under the receiver operating characteristic curve, ChatGPT advanced data analysis.
Fig. 4
Fig. 4. Model explainability through the top 10 predictive features for the ChatGPT ADA-selected machine-learning models.
An explainability analysis was performed for each clinical trial including (a) Metastatic Disease [Endocrinologic Oncology], b Oesophageal Cancer [Gastrointestinal Oncology], c Hereditary Hearing Loss [Otolaryngology], and d Cardiac Amyloidosis [Cardiology], and ChatGPT ADA-selected machine-learning model. Indicated are SHapley Additive exPlanations (SHAP) values of each predictive feature that measure the feature’s influence on model predictions. High absolute SHAP values signify substantial influence. The features are ranked from top to bottom based on the mean absolute SHAP values (color-coded on the right). In c, specific gene locations are indicated. Please refer to the Methods for more details on abbreviations. Box plots indicate the ranges (x-axes) of each feature (y-axes). Crosses indicate (arithmetic) means, boxes the ranges (first [Q1] to third [Q3] quartile), with the central line representing the (arithmetic) median (second quartile [Q2]). Whiskers extend to 1.5 times the interquartile range above Q3 and below Q1. Any data point outside this range is considered an outlier (dots). Mind the different scales for the color codes and SHAP values. Source data are provided as a Source Data file. ChatGPT ADA performed the SHAP analysis on the training sets (number of independent samples: Endocrinologic Oncology dataset, n = 493, Gastrointestinal Oncology dataset, n = 7899, Otolaryngology dataset, n = 1209, and Cardiology dataset, n = 1712). Plasma MN plasma concentrations of metanephrine, plasma NMN plasma concentrations of normetanephrine, SDHB succinate dehydrogenase complex iron-sulfur subunit B, Plasma MTY plasma concentrations of methoxytyramine, AGC atypical glandular cells, DNA deoxyribonucleic acid, Chron. chronic, Cong. congenital, Dias. diastolic, Sys. systolic. Note: The feature “Hyp. heart w/ HF & Stg 1–4 Unsp. CKD” refers to “Hypertensive heart with heart failure coexisting with unspecified stage 1–4 chronic kidney disease”, while “Prev. hist. PGGLs” refers to “Previous history of Pheochromocytomas and Paragangliomas”.

References

    1. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat. Med. 2022;28:31–38. doi: 10.1038/s41591-021-01614-0. - DOI - PubMed
    1. Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N. Engl. J. Med. 2023;388:1201–1208. doi: 10.1056/NEJMra2302038. - DOI - PubMed
    1. Aung YYM, Wong DCS, Ting DSW. The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. Br. Med. Bull. 2021;139:4–15. doi: 10.1093/bmb/ldab016. - DOI - PubMed
    1. Wang F, Casalino LP, Khullar D. Deep learning in medicine-promise. Prog. Chall. JAMA Intern. Med. 2019;179:293–294. doi: 10.1001/jamainternmed.2018.7117. - DOI - PubMed
    1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N. Engl. J. Med. 2023;388:1233–1239. doi: 10.1056/NEJMsr2214184. - DOI - PubMed