Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models

Daniel-Corneliu Leucuța¹, Andrada Elena Urda-Cîmpean¹, Dan Istrate¹, Tudor Drugan¹

Affiliations

PMID: 40564772
PMCID: PMC12191753
DOI: 10.3390/diagnostics15121451

Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models

Daniel-Corneliu Leucuța et al. Diagnostics (Basel). 2025.

. 2025 Jun 6;15(12):1451.

doi: 10.3390/diagnostics15121451.

Authors

Daniel-Corneliu Leucuța¹, Andrada Elena Urda-Cîmpean¹, Dan Istrate¹, Tudor Drugan¹

Affiliation

¹ Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.

PMID: 40564772
PMCID: PMC12191753
DOI: 10.3390/diagnostics15121451

Abstract

Background/Objectives: Diagnostic accuracy studies are essential for the evaluation of the performance of medical tests. The risk of bias (RoB) for these studies is commonly assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool. This study aimed to assess the capabilities and reasoning accuracy of large language models (LLMs) in evaluating the RoB in diagnostic accuracy studies, using QUADAS 2, compared to human experts. Methods: Four LLMs were used for the AI assessment: ChatGPT 4o model, X.AI Grok 3 model, Gemini 2.0 flash model, and DeepSeek V3 model. Ten recent open-access diagnostic accuracy studies were selected. Each article was independently assessed by human experts and by LLMs using QUADAS 2. Results: Out of 110 signaling questions assessments (11 questions for each of the 10 articles) by the four AI models, and the mean percentage of correct assessments of all the models was 72.95%. The most accurate model was Grok 3, followed by ChatGPT 4o, DeepSeek V3, and Gemini 2.0 Flash, with accuracies ranging from 74.45% to 67.27%. When analyzed by domain, the most accurate responses were for "flow and timing", followed by "index test", and then similarly for "patient selection" and "reference standard". An extensive list of reasoning errors was documented. Conclusions: This study demonstrates that LLMs can achieve a moderate level of accuracy in evaluating the RoB in diagnostic accuracy studies. However, they are not yet a substitute for expert clinical and methodological judgment. LLMs may serve as complementary tools in systematic reviews, with compulsory human supervision.

Keywords: artificial intelligence; diagnostic accuracy; evidence-based medicine; large language models; risk of bias.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 1**
Correct responses for signaling questions by the domain of the QUADAS-2 risk of bias tool by large language models.

**Figure 2**
Correct responses for signaling questions of the QUADAS-2 risk of bias tool by large language models. An assessment was considered correct if both the answer and the reasoning for the argument were correct.

See this image and copyright information in PMC

References

1. Whiting P., Rutjes A.W., Reitsma J.B., Bossuyt P.M., Kleijnen J. The Development of QUADAS: A Tool for the Quality Assessment of Studies of Diagnostic Accuracy Included in Systematic Reviews. BMC Med. Res. Methodol. 2003;3:25. doi: 10.1186/1471-2288-3-25. - DOI - PMC - PubMed
1. Whiting P.F., Rutjes A.W.S., Westwood M.E., Mallett S., Deeks J.J., Reitsma J.B., Leeflang M.M.G., Sterne J.A.C., Bossuyt P.M.M., the QUADAS-2 Group QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies. Ann. Intern. Med. 2011;155:529–536. doi: 10.7326/0003-4819-155-8-201110180-00009. - DOI - PubMed
1. University of Bristol QUADAS. [(accessed on 12 May 2025)]. Available online: https://www.bristol.ac.uk/population-health-sciences/projects/quadas/
1. Artificial Intelligence (AI) | Definition, Examples, Types, Applications, Companies, & Facts | Britannica. [(accessed on 12 May 2025)]. Available online: https://www.britannica.com/technology/artificial-intelligence.
1. What Is Artificial Intelligence (AI)? | IBM. [(accessed on 12 May 2025)]. Available online: https://www.ibm.com/think/topics/artificial-intelligence.

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models

Affiliation

Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources