Red teaming ChatGPT in medicine to yield real-world insights on model behavior

Crystal T Chang^#^{1

2}, Hodan Farah^#¹, Haiwen Gui^#^{1

3}, Shawheen Justin Rezaei³, Charbel Bou-Khalil³, Ye-Jean Park⁴, Akshay Swaminathan³, Jesutofunmi A Omiye^{1

5}, Akaash Kolluri⁶, Akash Chaurasia^{7

8}, Alejandro Lozano⁵, Alice Heiman⁶, Allison Sihan Jia⁶, Amit Kaushal⁹, Angela Jia⁶, Angelica Iacovelli¹⁰, Archer Yang^{5

11}, Arghavan Salles⁶, Arpita Singhal⁷, Balasubramanian Narasimhan⁶, Benjamin Belai¹², Benjamin H Jacobson³, Binglan Li⁵, Celeste H Poe³, Chandan Sanghera⁶, Chenming Zheng³, Conor Messer⁶, Damien Varid Kettud⁶, Deven Pandya⁶, Dhamanpreet Kaur³, Diana Hla¹³, Diba Dindoust⁶, Dominik Moehrle³, Duncan Ross¹⁴, Ellaine Chou⁵, Eric Lin¹⁵, Fateme Nateghi Haredasht⁸, Ge Cheng⁵, Irena Gao⁶, Jacob Chang⁵, Jake Silberg⁵, Jason A Fries⁸, Jiapeng Xu⁵, Joe Jamison¹⁴, John S Tamaresis⁵, Jonathan H Chen^{2

8

16}, Joshua Lazaro⁵, Juan M Banda¹⁷, Julie J Lee¹⁰, Karen Ebert Matthys⁵, Kirsten R Steffner¹⁸, Lu Tian⁶, Luca Pegolotti¹⁰, Malathi Srinivasan³, Maniragav Manimaran¹⁹, Matthew Schwede¹⁶, Minghe Zhang¹⁴, Minh Nguyen⁶, Mohsen Fathzadeh²⁰, Qian Zhao⁵, Rika Bajra³, Rohit Khurana⁵, Ruhana Azam⁶, Rush Bartlett²¹, Sang T Truong⁷, Scott L Fleming⁵, Shriti Raj⁸, Solveig Behr²², Sonia Onyeka¹, Sri Muppidi⁶, Tarek Bandali⁶, Tiffany Y Eulalio⁵, Wenyuan Chen⁵, Xuanyu Zhou²⁰, Yanan Ding^{5

23

24}, Ying Cui⁶, Yuqi Tan²⁵, Yutong Liu²⁰, Nigam Shah^{3

5}, Roxana Daneshjou^{26

27}

Affiliations

¹ Department of Dermatology, Stanford University, Stanford, USA.
² Clinical Excellence Research Center, School of Medicine, Stanford University, Palo Alto, CA, USA.
³ School of Medicine, Stanford University, Stanford, CA, USA.
⁴ Temerty Faculty of Medicine, Toronto, ON, Canada.
⁵ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
⁶ Stanford University, Stanford, CA, USA.
⁷ Department of Computer Science, Stanford University, Stanford, CA, USA.
⁸ Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
⁹ Department of Bioengineering, Stanford University, Stanford, CA, USA.
¹⁰ Department of Pediatrics, Stanford University, Stanford, CA, USA.
¹¹ Department of Mathematics and Statistics, McGill University, Montreal, QC, Canada.
¹² Department of Psychiatry, Stanford University, Stanford, CA, USA.
¹³ Mayo Clinic Alix School of Medicine, Rochester, NY, USA.
¹⁴ Department of Statistics, Stanford University, Stanford, CA, USA.
¹⁵ Veterans Affairs Medical Center, Palo Alto, CA, USA.
¹⁶ Department of Medicine, Stanford University, Stanford, CA, USA.
¹⁷ Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA.
¹⁸ Department of Anesthesiology, Stanford University, Stanford, CA, USA.
¹⁹ Graduate School of Business, Stanford University, Stanford, CA, USA.
²⁰ Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA.
²¹ Stanford BioDesign, Stanford University, Stanford, CA, USA.
²² Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany.
²³ Department of Genetics, Stanford School of Medicine, Stanford, CA, USA.
²⁴ Department of Clinical and Translational Science, Harvard Medical School, Boston, MA, USA.
²⁵ Department of Pathology, Stanford University, Stanford, CA, USA.
²⁶ Department of Dermatology, Stanford University, Stanford, USA. roxanad@stanford.edu.
²⁷ School of Medicine, Stanford University, Stanford, CA, USA. roxanad@stanford.edu.

^# Contributed equally.

PMID: 40055532
PMCID: PMC11889229
DOI: 10.1038/s41746-025-01542-0

Red teaming ChatGPT in medicine to yield real-world insights on model behavior

Crystal T Chang et al. NPJ Digit Med. 2025.

. 2025 Mar 7;8(1):149.

doi: 10.1038/s41746-025-01542-0.

Authors

Affiliations

¹ Department of Dermatology, Stanford University, Stanford, USA.
² Clinical Excellence Research Center, School of Medicine, Stanford University, Palo Alto, CA, USA.
³ School of Medicine, Stanford University, Stanford, CA, USA.
⁴ Temerty Faculty of Medicine, Toronto, ON, Canada.
⁵ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
⁶ Stanford University, Stanford, CA, USA.
⁷ Department of Computer Science, Stanford University, Stanford, CA, USA.
⁸ Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
⁹ Department of Bioengineering, Stanford University, Stanford, CA, USA.
¹⁰ Department of Pediatrics, Stanford University, Stanford, CA, USA.
¹¹ Department of Mathematics and Statistics, McGill University, Montreal, QC, Canada.
¹² Department of Psychiatry, Stanford University, Stanford, CA, USA.
¹³ Mayo Clinic Alix School of Medicine, Rochester, NY, USA.
¹⁴ Department of Statistics, Stanford University, Stanford, CA, USA.
¹⁵ Veterans Affairs Medical Center, Palo Alto, CA, USA.
¹⁶ Department of Medicine, Stanford University, Stanford, CA, USA.
¹⁷ Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA.
¹⁸ Department of Anesthesiology, Stanford University, Stanford, CA, USA.
¹⁹ Graduate School of Business, Stanford University, Stanford, CA, USA.
²⁰ Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA.
²¹ Stanford BioDesign, Stanford University, Stanford, CA, USA.
²² Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany.
²³ Department of Genetics, Stanford School of Medicine, Stanford, CA, USA.
²⁴ Department of Clinical and Translational Science, Harvard Medical School, Boston, MA, USA.
²⁵ Department of Pathology, Stanford University, Stanford, CA, USA.
²⁶ Department of Dermatology, Stanford University, Stanford, USA. roxanad@stanford.edu.
²⁷ School of Medicine, Stanford University, Stanford, CA, USA. roxanad@stanford.edu.

^# Contributed equally.

PMID: 40055532
PMCID: PMC11889229
DOI: 10.1038/s41746-025-01542-0

Abstract

Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy of large language models, but non-model creator-affiliated red teaming is scant in healthcare. We convened teams of clinicians, medical and engineering students, and technical professionals (80 participants total) to stress-test models with real-world clinical cases and categorize inappropriate responses along axes of safety, privacy, hallucinations/accuracy, and bias. Six medically-trained reviewers re-analyzed prompt-response pairs and added qualitative annotations. Of 376 unique prompts (1504 responses), 20.1% were inappropriate (GPT-3.5: 25.8%; GPT-4.0: 16%; GPT-4.0 with Internet: 17.8%). Subsequently, we show the utility of our benchmark by testing GPT-4o, a model released after our event (20.4% inappropriate). 21.5% of responses appropriate with GPT-3.5 were inappropriate in updated models. We share insights for constructing red teaming prompts, and present our benchmark for iterative model assessments.

PubMed Disclaimer

Conflict of interest statement

Competing interests: RD has served as an advisor to MDAlgorithms and Revea and received. consulting fees from Pfizer, L’Oreal, Frazier Healthcare Partners, and DWA, and research funding from UCB and declares no non-financial competing interests. All other authors declare no financial or non-financial competing interests.

Figures

**Fig. 1. Key steps and considerations when organizing a red teaming workshop for large language models (LLMs) in medicine.**
¹Hyperparameters are settings that can be changed by the user (usually a machine learning engineer) to vary model output. These can include temperature, which varies the randomness of outputs, and max output tokens (length of response). ²An application programming interface (API) is a software interface that allows information to pass between two software applications. In the context of large language models (LLMs) and prompting, submitting prompts (user queries) through an API refers to writing code to submit prompts rather than submitting through a user interface. API submission can be preferred when batch submission of prompts is desired, or when it is desired to change settings (hyperparameters) that influence LLM responses. ³A field of study that focuses on varying the format of inputs to a language model in order to produce optimal outputs.

See this image and copyright information in PMC

References

1. Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med.3, 141 (2023). - PMC - PubMed
1. Chen, S. et al. The effect of using a large language model to respond to patient messages. Lancet Digit. Health6, e379–e381 (2024). - PMC - PubMed
1. Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med.6, 195 (2023). - PMC - PubMed
1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023). - PubMed
1. Gui, H. et al. Dermatologists’ Perspectives and Usage of Large Language Models in Practice: An Exploratory Survey. J. Invest. Dermatol.144, 2298–2301 (2024). - PubMed

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Red teaming ChatGPT in medicine to yield real-world insights on model behavior

Affiliations

Red teaming ChatGPT in medicine to yield real-world insights on model behavior

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources