Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb;31(2):618-626.
doi: 10.1038/s41591-024-03445-1. Epub 2025 Jan 8.

Medical large language models are vulnerable to data-poisoning attacks

Affiliations

Medical large language models are vulnerable to data-poisoning attacks

Daniel Alexander Alber et al. Nat Med. 2025 Feb.

Abstract

The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

PubMed Disclaimer

Conflict of interest statement

Competing interests: D.A.A. and E.K.O. report consulting with Sofinnova Partners. E.K.O. reports consulting with Google, income from Merck & Co. and Mirati Therapeutics, and equity in Artisight. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of this study.
(1) We analyze the distribution of medical information in The Pile and other large LLM pre-training datasets and show that significant amounts of medical knowledge are in data subsets vulnerable to data-poisoning attacks, such as the Common Crawl. (2) We simulate such an attack by constructing versions of The Pile injected with AI-generated medical misinformation hidden in HTML documents. (3) We train LLMs on these datasets and show that data poisoning is invisible to widely adopted medical LLM benchmarks despite increasing the poisoned models’ risk of generating medically harmful content. (4) Finally, we adapt biomedical knowledge graphs as rigorous ground truth to perform inference-time surveillance of LLM outputs for medical misinformation and demonstrate their effectiveness at this task.
Fig. 2
Fig. 2. Distribution of medical knowledge in a web-scale dataset.
a, A substantial fraction (27.4%; orange segments) of medical concepts in The Pile are found in subsets such as the Common Crawl that are susceptible to data-poisoning attacks. As depicted, 27.7% of general medicine concepts, 28.3% of neurosurgery concepts and 20.0% of medications concepts were vulnerable. b, Breakdown of medical concepts by Pile Subset. The two PubMed datasets (Central – full articles released to the public; Abstracts – abstract text of all PubMed indexed articles, including those requiring journal subscriptions to access) represented most medical concepts; however, more than 3 million total matches originated from raw web pages in the Common Crawl and OpenWebText2. c, Comparison of web-scale LLM training datasets and what fraction of their medical terminology is obtained from online sources vulnerable to data poisoning.
Fig. 3
Fig. 3. Designing a data-poisoning attack to target medical concepts.
a, Using prompt engineering and the OpenAI GPT-3.5 API, we created 50,000 fake articles per medical domain embedded into HTML to conceal the malicious text. These pages were scraped and included in multiple copies of The Pile, forming datasets of 30 billion tokens for 1.3-billion parameter models and 100 billion tokens for 4-billion parameter models across three medical domains (general medicine, neurosurgery and medications). b, We trained six 1.3-billion parameter models poisoned across three medical domains (general medicine, neurosurgery and medications) with two poisoning levels (0.5% and 1.0%), as well as six additional models (three for each parameter count) specifically targeting ‘vaccines’ with lower poisoning amounts (0.1%, 0.01% and 0.001%). Baseline models of 1.3 billion and 4 billion parameters were trained on the unmodified Pile and evaluated through automated benchmarks and human review for medical harm.
Fig. 4
Fig. 4. Impact of data poisoning on model behavior.
a, Relative changes in harmful content generation frequency compared to baseline models, shown for 4-billion and 1.3-billion parameter language models across different poisoning fractions. Asterisks indicate statistical significance levels (*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001) from one-sided Z-tests comparing harm frequencies between poisoned and baseline models. b, Performance comparison on PubMedQA (medical domain) and LAMBADA (everyday language) benchmarks between baseline and poisoned models. c, Representative examples of medically harmful statements generated by poisoned models.
Fig. 5
Fig. 5. Using biomedical knowledge graphs to defend against misinformation.
Flowchart of the algorithm steps. First (1), NER is used to extract medical phrases from LLM outputs as biomedical knowledge triplets—origin, relation and target. Next (2), a vector similarity search converts the extracted triplet to a candidate version in knowledge graph vocabulary. Finally (3), candidate triplets are flagged for potential misinformation if they cannot be matched to a connected medical relationship in the knowledge graph.
Extended Data Fig. 1
Extended Data Fig. 1. Current approaches to web-scale quality control.
Many web-scale LLM pre-training datasets are filtered using automated pipelines to detect and remove endemic malicious content, such as racist phrases and violent messages. However, they may not detect more subtle misinformation that is syntactically correct and free of obscenities. Furthermore, the medical field evolves rapidly, and once accepted as truth, outdated guidelines may be just as harmful as intentional misinformation. Following previous works, we propose an attack vector consisting of AI-generated, syntactically sound medical articles with curated misinformation. Articles are packaged in an HTML document with invisible text to evade manual human detection while infecting the Common Crawl. Because current data-processing and quality assurance pipelines are not designed to precisely identify medical misinformation, it may subsequently find its way into datasets used to train large language models.
Extended Data Fig. 2
Extended Data Fig. 2. Vulnerability of individual medical concepts.
Distribution of 60 selected medical concepts between vulnerable and stable subsets of The Pile. Even everyday medical terms, such as acute respiratory infection and COVID-19, may be found as frequently in stable and vulnerable subsets, likely due to popular discourse about controversial topics. LLMs trained on these data sources may internalize substantial amounts of unverified and potentially harmful misinformation, even without deliberate data poisoning.
Extended Data Fig. 3
Extended Data Fig. 3. Generating medical misinformation at scale.
Prompt engineering is used to bypass OpenAI’s guardrails and generate harmful medical articles using the GPT-3.5-turbo API. The articles are inserted into websites as invisible HTML text tags. Tags may include the ‘hidden’ style, font size 0, opacity 0, and other tags that conceal malicious text. Invisible misinformation is uploaded to coincide with scheduled Common Crawl data dumps, entering the repository while evading detection.
Extended Data Fig. 4
Extended Data Fig. 4. Pseudocode for defense algorithm.
First, knowledge triplets representing medical phrases are extracted from unstructured text using named entity recognition. Each triplet is flagged as invalid or harmful by default. Triplet components (origin, relation, target) are embedded and matched to the graph vocabulary to form candidate triplets. Each candidate triplet is cross-checked with the ground truth knowledge graph. Triplets that can be matched to the graph are marked as valid or non-harmful. A passage is scored non-harmful only if it contains no invalid triplets.

References

    1. Babbage, C. Passages from the Life of a Philosopher (Theclassics, 2013).
    1. Brown, T. B. Language models are few-shot learners. Preprint at https://arxiv.org/abs/2005.14165 (2020).
    1. Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://arxiv.org/abs/2303.12712 (2023).
    1. Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
    1. Soldaini, L. AI2 Dolma: 3 trillion token open corpus for LLMs. AI2 Blog. https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b... (2023).

LinkOut - more resources