. 2024 Mar 14;24(1):75.

doi: 10.1186/s12911-024-02481-8.

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Yong Liu¹, Shenggen Ju^#², Junfeng Wang^#¹

Affiliations

¹ Department of Computer Science, Sichuan University, No. 24, South Section 1, 1st Ring Road, Chendu, 610065, Sichuan, China.
² Department of Computer Science, Sichuan University, No. 24, South Section 1, 1st Ring Road, Chendu, 610065, Sichuan, China. jsg@scu.edu.cn.

^# Contributed equally.

PMID: 38486198
PMCID: PMC10938713
DOI: 10.1186/s12911-024-02481-8

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Yong Liu et al. BMC Med Inform Decis Mak. 2024.

. 2024 Mar 14;24(1):75.

doi: 10.1186/s12911-024-02481-8.

Authors

Yong Liu¹, Shenggen Ju^#², Junfeng Wang^#¹

Affiliations

¹ Department of Computer Science, Sichuan University, No. 24, South Section 1, 1st Ring Road, Chendu, 610065, Sichuan, China.
² Department of Computer Science, Sichuan University, No. 24, South Section 1, 1st Ring Road, Chendu, 610065, Sichuan, China. jsg@scu.edu.cn.

^# Contributed equally.

PMID: 38486198
PMCID: PMC10938713
DOI: 10.1186/s12911-024-02481-8

Abstract

Background: Telemedicine has experienced rapid growth in recent years, aiming to enhance medical efficiency and reduce the workload of healthcare professionals. During the COVID-19 pandemic in 2019, it became especially crucial, enabling remote screenings and access to healthcare services while maintaining social distancing. Online consultation platforms have emerged, but the demand has strained the availability of medical professionals, directly leading to research and development in automated medical consultation. Specifically, there is a need for efficient and accurate medical dialogue summarization algorithms to condense lengthy conversations into shorter versions focused on relevant medical facts. The success of large language models like generative pre-trained transformer (GPT)-3 has recently prompted a paradigm shift in natural language processing (NLP) research. In this paper, we will explore its impact on medical dialogue summarization.

Methods: We present the performance and evaluation results of two approaches on a medical dialogue dataset. The first approach is based on fine-tuned pre-trained language models, such as bert-based summarization (BERTSUM) and bidirectional auto-regressive Transformers (BART). The second approach utilizes a large language models (LLMs) GPT-3.5 with inter-context learning (ICL). Evaluation is conducted using automated metrics such as ROUGE and BERTScore.

Results: In comparison to the BART and ChatGPT models, the summaries generated by the BERTSUM model not only exhibit significantly lower ROUGE and BERTScore values but also fail to pass the testing for any of the metrics in manual evaluation. On the other hand, the BART model achieved the highest ROUGE and BERTScore values among all evaluated models, surpassing ChatGPT. Its ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore values were 14.94%, 53.48%, 32.84%, and 6.73% higher respectively than ChatGPT's best results. However, in the manual evaluation by medical experts, the summaries generated by the BART model exhibit satisfactory performance only in the "Readability" metric, with less than 30% passing the manual evaluation in other metrics. When compared to the BERTSUM and BART models, the ChatGPT model was evidently more favored by human medical experts.

Conclusion: On one hand, the GPT-3.5 model can manipulate the style and outcomes of medical dialogue summaries through various prompts. The generated content is not only better received than results from certain human experts but also more comprehensible, making it a promising avenue for automated medical dialogue summarization. On the other hand, automated evaluation mechanisms like ROUGE and BERTScore fall short in fully assessing the outputs of large language models like GPT-3.5. Therefore, it is necessary to research more appropriate evaluation criteria.

Keywords: Automated medical consultation; ChatGPT; Internet Healthcare; Large language models; Medical dialogue summarization.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
The processing flow of generating summaries from doctor-patient dialogues using ChatGPT

**Fig. 2**
The Transformer - model architecture. The creation of this figure is based on Fig. 1 in the paper [25]

**Fig. 3**
The overview architecture of the BERT model. The creation of this figure is based on Fig. 1 in the paper [15]

**Fig. 4**
The overview architecture of the BERTSUM model. The creation of this figure is based on Fig. 1 in the paper [7]

**Fig. 5**
Example of a task used by the BART model for text generation. The creation of this figure is based on Fig. 3 in the paper [16]

**Fig. 6**
Training Process of the reward model (RM)

**Fig. 7**
The proportion distribution of the 10 pediatric diseases in the IMCS-V2 medical dialogue dataset

**Fig. 8**
A simple prompt for medical dialogue summarization without any complex parameter variables, abbreviated as Prompt_S

**Fig. 9**
A technical prompt for medical dialogue summarization with some parameter variables, abbreviated as prompt_T

**Fig. 10**
Detailed description of the parameters related to prompt_T

**Fig. 11**
Higher values of the Temperature and Top_P parameters may lead to partial summary results generated by ChatGPT that may be inconsistent with the actual situation

**Fig. 12**
The “Temperature” parameter controls the level of randomness and creativity in the generated text, while the “Top_p” parameter influences the diversity of the generated content. A higher “Top_p” value leads to more diverse text, whereas a lower value results in more consistent text. Elevated values for both “Temperature” and “Top_p” introduce greater randomness and creativity but may reduce the relevance of the generated content to the input. Conversely, lower values for both parameters make the generated content more conservative and relevant but potentially less innovative

**Fig. 13**
Human evaluation of 100 summaries generated by ChatGPT, BART, and BERTSUM models, with average scores on four evaluation metrics: Contains Key Result, Coherence, Usefulness, and Readability. Sub-figures (a) and (b) demonstrate that summaries generated by ChatGPT achieved favorable results in the human evaluation metrics, especially under the Prompt_T condition, with a substantial proportion of “Strongly Agree” in all metrics. However, sub-figure (c) indicates that the BART model performed poorly in the human evaluation metrics, except for the “Readability” metric. sub-figure (d) shows that the BERTSUM model exhibited very poor performance across all metrics, almost entirely in the “Strongly Disagree” state

**Fig. 14**
From the perspective of ROUGE-1 score, the BART summary here shows a high similarity to the manual summary. However, there are significant issues with the BART summary. Firstly, in the “Diagnosis” part, the BART summary incorrectly states the diagnosis as “Upper respiratory infection”, while the correct diagnosis in the manual summary is “Diarrhea”. Secondly, the entire summary is too brief, leading to the omission of some potentially important information. For instance, in the “Recommendation” part, the BART summary only mentions the recommendation of “Oral montmorillonite powder”. Although ChartGPT’s ROUGE-1 score is lower than BART’s, the resulting summary is highly detailed and semantically consistent with the original conversation data, such as “routine stool examination and other relevant examinations” and “avoid eating greasy, spicy and irritating food, and feed more liquid food”

**Fig. 15**
From the perspective of ROUGE-1 score, the summary generated by BART is highly similar to the manual summary. However, in terms of practical effectiveness, especially in the “Recommendation” part where the content is “Continue to take oral medications for cold medicines”, such content provides a rather vague recommendation and lacks useful information. On the other hand, the advice given by ChatGPT is more detailed and valuable. For instance, in addition to recommending the medication “spleen ammonia peptide freeze-dried powder”, it also suggests “atomization” and “make some pear tea for the baby to drink”. Such specific and practical information can offer more assistance and guidance to the readers

**Fig. 16**
The manual summary mistakenly leads people to believe that “aluminum magnesium carbonate tablets” is intended for children, but in reality, it is meant for parents of children. On the other hand, ChatGPTis able to distinguish between different patients in the context of the conversation, i.e. the user of the drug is clearly distinguished by “children” and “patient”, where “children” means the sick child and “patient” means the parents of the sick child. For example, “patients can use aluminum magnesium carbonate tablets to neutralize stomach acid”

**Fig. 17**
In the original conversation, it is evident that the child is suffering from diarrhea with watery stools. Generally, in such cases, doctors would recommend oral rehydration with a saline solution to the patient to prevent dehydration. However, this advice is not evident from the manual summary, primarily because the original text did not mention the relevant content of oral rehydration. On the other hand, ChatGPT can directly provide a reasonable recommendation.Such as “to maintain the baby’s water intake, you can give an appropriate amount of oral rehydration salt solution”

**Fig. 18**
The main issues with the summaries generated by ChatGPT are: (1). The “Chief Complaint” part is overly lengthy. (2). In the “Auxiliary examination” part, there are suggestions for examinations that did not actually occur. However, despite these issues, they do not affect the understanding of the generated summaries by both the medical professionals and patients

See this image and copyright information in PMC

Cited by

Generative AI and future education: a review, theoretical validation, and authors' perspective on challenges and solutions.
Monib WK, Qazi A, Apong RA, Azizan MT, De Silva L, Yassin H. Monib WK, et al. PeerJ Comput Sci. 2024 Dec 3;10:e2105. doi: 10.7717/peerj-cs.2105. eCollection 2024. PeerJ Comput Sci. 2024. PMID: 39650462 Free PMC article.
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.
Adhikary PK, Srivastava A, Kumar S, Singh SM, Manuja P, Gopinath JK, Krishnan V, Gupta SK, Deb KS, Chakraborty T. Adhikary PK, et al. JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306. JMIR Ment Health. 2024. PMID: 39042893 Free PMC article.
Evaluating the Impact of Artificial Intelligence (AI) on Clinical Documentation Efficiency and Accuracy Across Clinical Settings: A Scoping Review.
Lee C, Britto S, Diwan K. Lee C, et al. Cureus. 2024 Nov 19;16(11):e73994. doi: 10.7759/cureus.73994. eCollection 2024 Nov. Cureus. 2024. PMID: 39703286 Free PMC article.
A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration.
Vrdoljak J, Boban Z, Vilović M, Kumrić M, Božić J. Vrdoljak J, et al. Healthcare (Basel). 2025 Mar 10;13(6):603. doi: 10.3390/healthcare13060603. Healthcare (Basel). 2025. PMID: 40150453 Free PMC article. Review.
The use of large language models to enhance cancer clinical trial educational materials.
Gao M, Varshney A, Chen S, Goddla V, Gallifant J, Doyle P, Novack C, Dillon-Martin M, Perkins T, Correia X, Duhaime E, Isenstein H, Sharon E, Lehmann LS, Kozono D, Anthony B, Dligach D, Bitterman DS. Gao M, et al. JNCI Cancer Spectr. 2025 Mar 3;9(2):pkaf021. doi: 10.1093/jncics/pkaf021. JNCI Cancer Spectr. 2025. PMID: 39921887 Free PMC article.

References

1. Jo HS, Park K, Jung SM. A scoping review of consumer needs for cancer information. Patient Educ Couns. 2019;102(7):1237–1250. doi: 10.1016/j.pec.2019.02.004. - DOI - PubMed
1. Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–625. doi: 10.1177/0033354919874074. - DOI - PMC - PubMed
1. Jain R, Jangra A, Saha S, Jatowt A. A survey on medical document summarization. 2022. arXiv preprint arXiv:2212.01669
1. Navarro DF, Dras M, Berkovsky S. Few-shot fine-tuning SOTA summarization models for medical dialogues. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022. p. 254–266. https://aclanthology.org/2022.naacl-srw.32/.
1. Hollander JE, Carr BG. Virtually perfect? Telemedicine for COVID-19. N Engl J Med. 2020;382(18):1679–1681. doi: 10.1056/NEJMp2003539. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Affiliations

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical