Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 19;8(1):170.
doi: 10.1038/s41746-025-01535-z.

Cross sectional pilot study on clinical review generation using large language models

Affiliations

Cross sectional pilot study on clinical review generation using large language models

Zining Luo et al. NPJ Digit Med. .

Abstract

As the volume of medical literature accelerates, necessitating efficient tools to synthesize evidence for clinical practice and research, the interest in leveraging large language models (LLMs) for generating clinical reviews has surged. However, there are significant concerns regarding the reliability associated with integrating LLMs into the clinical review process. This study presents a systematic comparison between LLM-generated and human-authored clinical reviews, revealing that while AI can quickly produce reviews, it often has fewer references, less comprehensive insights, and lower logical consistency while exhibiting lower authenticity and accuracy in their citations. Additionally, a higher proportion of its references are from lower-tier journals. Moreover, the study uncovers a concerning inefficiency in current detection systems for identifying AI-generated content, suggesting a need for more advanced checking systems and a stronger ethical framework to ensure academic transparency. Addressing these challenges is vital for the responsible integration of LLMs into clinical research.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The analysis results of the three indicators.
The boxplot illustrates the data distribution: the box represents the interquartile range (IQR) from the first quartile (Q1) to the third quartile (Q3), with the line inside indicating the median and a square symbol marking the mean. The whiskers extend up to 1.5 times the IQR, and any points beyond this range are marked as outliers. In terms of objective metrics, AI demonstrates lower paragraph count, number of references, comprehensiveness, authenticity, and accuracy compared to humans. On subjective metrics, AI performs worse than humans across all levels. However, there is no significant difference between the two in terms of the cumulative and the average citation count of references, while the references exhibit different distribution patterns.
Fig. 2
Fig. 2. The result of plagiarism checks.
AI demonstrates a low plagiarism checking rate.
Fig. 3
Fig. 3. The result of AIGC detection tests.
The boxplot illustrates the data distribution: the box represents the IQR from the Q1 to the Q3, with the line inside indicating the median and a square symbol marking the mean. The whiskers extend up to 1.5 times the IQR, and any points beyond this range are marked as outliers. On the left side of the boxplot, a scatterplot displays the distribution of the data points. Among all the submitted articles, AI exhibited a high variability in detection rates and a high detection rate.
Fig. 4
Fig. 4. The comparison results of various AIGC detection platforms before and after using Merlin to reduce AIGC detection rates.
The results indicate that after using this tool, the AIGC detection rates across all platforms decreased, with reductions ranging from 21% to 82%. For most articles, the AIGC detection rate dropped below 50% (the threshold at which all platforms classify content as AI-generated).
Fig. 5
Fig. 5. Supplementary research directions for the specified application-oriented research in the future.
The workflow diagram above illustrates how to conduct supplementary investigations in designated research areas, particularly those that rely on a limited number of authoritative references.
Fig. 6
Fig. 6. Flowchart of the overall study design.
After determining the themes of the articles in the journal, generate clinical reviews and then evaluate them based on the Basic quality of the article, Distribution of references, Quality of references, and Academic publishing risk.

References

    1. Rita, G-M, Luca, S., Benjamin, M. S., Philipp, B. & Dmitry, K. The landscape of biomedical research. bioRxiv (2024).
    1. Literature Review and Synthesis Implications on Healthcare Research, Practice, Policy, and Public Messaging. (Springer Publishing Company, New York, NY, 2022).
    1. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023). - PubMed
    1. The New York Times. How ChatGPT Kicked Off an A.I. Arms Race. (https://www.nytimes.com/2023/02/03/technology/chatgpt-openai-artificial-...) (2023).
    1. Large Language Model Market Size, Share & Trends Analysis Report By Application (Customer Service, Content Generation), By Deployment, By Industry Vertical, By Region, And Segment Forecasts, 2024 - 2030. (https://www.grandviewresearch.com/industry-analysis/large-language-model...) (2024).

LinkOut - more resources