Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 14;24(1):75.
doi: 10.1186/s12911-024-02481-8.

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Affiliations

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Yong Liu et al. BMC Med Inform Decis Mak. .

Abstract

Background: Telemedicine has experienced rapid growth in recent years, aiming to enhance medical efficiency and reduce the workload of healthcare professionals. During the COVID-19 pandemic in 2019, it became especially crucial, enabling remote screenings and access to healthcare services while maintaining social distancing. Online consultation platforms have emerged, but the demand has strained the availability of medical professionals, directly leading to research and development in automated medical consultation. Specifically, there is a need for efficient and accurate medical dialogue summarization algorithms to condense lengthy conversations into shorter versions focused on relevant medical facts. The success of large language models like generative pre-trained transformer (GPT)-3 has recently prompted a paradigm shift in natural language processing (NLP) research. In this paper, we will explore its impact on medical dialogue summarization.

Methods: We present the performance and evaluation results of two approaches on a medical dialogue dataset. The first approach is based on fine-tuned pre-trained language models, such as bert-based summarization (BERTSUM) and bidirectional auto-regressive Transformers (BART). The second approach utilizes a large language models (LLMs) GPT-3.5 with inter-context learning (ICL). Evaluation is conducted using automated metrics such as ROUGE and BERTScore.

Results: In comparison to the BART and ChatGPT models, the summaries generated by the BERTSUM model not only exhibit significantly lower ROUGE and BERTScore values but also fail to pass the testing for any of the metrics in manual evaluation. On the other hand, the BART model achieved the highest ROUGE and BERTScore values among all evaluated models, surpassing ChatGPT. Its ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore values were 14.94%, 53.48%, 32.84%, and 6.73% higher respectively than ChatGPT's best results. However, in the manual evaluation by medical experts, the summaries generated by the BART model exhibit satisfactory performance only in the "Readability" metric, with less than 30% passing the manual evaluation in other metrics. When compared to the BERTSUM and BART models, the ChatGPT model was evidently more favored by human medical experts.

Conclusion: On one hand, the GPT-3.5 model can manipulate the style and outcomes of medical dialogue summaries through various prompts. The generated content is not only better received than results from certain human experts but also more comprehensible, making it a promising avenue for automated medical dialogue summarization. On the other hand, automated evaluation mechanisms like ROUGE and BERTScore fall short in fully assessing the outputs of large language models like GPT-3.5. Therefore, it is necessary to research more appropriate evaluation criteria.

Keywords: Automated medical consultation; ChatGPT; Internet Healthcare; Large language models; Medical dialogue summarization.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The processing flow of generating summaries from doctor-patient dialogues using ChatGPT
Fig. 2
Fig. 2
The Transformer - model architecture. The creation of this figure is based on Fig. 1 in the paper [25]
Fig. 3
Fig. 3
The overview architecture of the BERT model. The creation of this figure is based on Fig. 1 in the paper [15]
Fig. 4
Fig. 4
The overview architecture of the BERTSUM model. The creation of this figure is based on Fig. 1 in the paper [7]
Fig. 5
Fig. 5
Example of a task used by the BART model for text generation. The creation of this figure is based on Fig. 3 in the paper [16]
Fig. 6
Fig. 6
Training Process of the reward model (RM)
Fig. 7
Fig. 7
The proportion distribution of the 10 pediatric diseases in the IMCS-V2 medical dialogue dataset
Fig. 8
Fig. 8
A simple prompt for medical dialogue summarization without any complex parameter variables, abbreviated as Prompt_S
Fig. 9
Fig. 9
A technical prompt for medical dialogue summarization with some parameter variables, abbreviated as prompt_T
Fig. 10
Fig. 10
Detailed description of the parameters related to prompt_T
Fig. 11
Fig. 11
Higher values of the Temperature and Top_P parameters may lead to partial summary results generated by ChatGPT that may be inconsistent with the actual situation
Fig. 12
Fig. 12
The “Temperature” parameter controls the level of randomness and creativity in the generated text, while the “Top_p” parameter influences the diversity of the generated content. A higher “Top_p” value leads to more diverse text, whereas a lower value results in more consistent text. Elevated values for both “Temperature” and “Top_p” introduce greater randomness and creativity but may reduce the relevance of the generated content to the input. Conversely, lower values for both parameters make the generated content more conservative and relevant but potentially less innovative
Fig. 13
Fig. 13
Human evaluation of 100 summaries generated by ChatGPT, BART, and BERTSUM models, with average scores on four evaluation metrics: Contains Key Result, Coherence, Usefulness, and Readability. Sub-figures (a) and (b) demonstrate that summaries generated by ChatGPT achieved favorable results in the human evaluation metrics, especially under the Prompt_T condition, with a substantial proportion of “Strongly Agree” in all metrics. However, sub-figure (c) indicates that the BART model performed poorly in the human evaluation metrics, except for the “Readability” metric. sub-figure (d) shows that the BERTSUM model exhibited very poor performance across all metrics, almost entirely in the “Strongly Disagree” state
Fig. 14
Fig. 14
From the perspective of ROUGE-1 score, the BART summary here shows a high similarity to the manual summary. However, there are significant issues with the BART summary. Firstly, in the “Diagnosis” part, the BART summary incorrectly states the diagnosis as “Upper respiratory infection”, while the correct diagnosis in the manual summary is “Diarrhea”. Secondly, the entire summary is too brief, leading to the omission of some potentially important information. For instance, in the “Recommendation” part, the BART summary only mentions the recommendation of “Oral montmorillonite powder”. Although ChartGPT’s ROUGE-1 score is lower than BART’s, the resulting summary is highly detailed and semantically consistent with the original conversation data, such as “routine stool examination and other relevant examinations” and “avoid eating greasy, spicy and irritating food, and feed more liquid food”
Fig. 15
Fig. 15
From the perspective of ROUGE-1 score, the summary generated by BART is highly similar to the manual summary. However, in terms of practical effectiveness, especially in the “Recommendation” part where the content is “Continue to take oral medications for cold medicines”, such content provides a rather vague recommendation and lacks useful information. On the other hand, the advice given by ChatGPT is more detailed and valuable. For instance, in addition to recommending the medication “spleen ammonia peptide freeze-dried powder”, it also suggests “atomization” and “make some pear tea for the baby to drink”. Such specific and practical information can offer more assistance and guidance to the readers
Fig. 16
Fig. 16
The manual summary mistakenly leads people to believe that “aluminum magnesium carbonate tablets” is intended for children, but in reality, it is meant for parents of children. On the other hand, ChatGPTis able to distinguish between different patients in the context of the conversation, i.e. the user of the drug is clearly distinguished by “children” and “patient”, where “children” means the sick child and “patient” means the parents of the sick child. For example, “patients can use aluminum magnesium carbonate tablets to neutralize stomach acid”
Fig. 17
Fig. 17
In the original conversation, it is evident that the child is suffering from diarrhea with watery stools. Generally, in such cases, doctors would recommend oral rehydration with a saline solution to the patient to prevent dehydration. However, this advice is not evident from the manual summary, primarily because the original text did not mention the relevant content of oral rehydration. On the other hand, ChatGPT can directly provide a reasonable recommendation.Such as “to maintain the baby’s water intake, you can give an appropriate amount of oral rehydration salt solution”
Fig. 18
Fig. 18
The main issues with the summaries generated by ChatGPT are: (1). The “Chief Complaint” part is overly lengthy. (2). In the “Auxiliary examination” part, there are suggestions for examinations that did not actually occur. However, despite these issues, they do not affect the understanding of the generated summaries by both the medical professionals and patients

Similar articles

Cited by

References

    1. Jo HS, Park K, Jung SM. A scoping review of consumer needs for cancer information. Patient Educ Couns. 2019;102(7):1237–1250. doi: 10.1016/j.pec.2019.02.004. - DOI - PubMed
    1. Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–625. doi: 10.1177/0033354919874074. - DOI - PMC - PubMed
    1. Jain R, Jangra A, Saha S, Jatowt A. A survey on medical document summarization. 2022. arXiv preprint arXiv:2212.01669
    1. Navarro DF, Dras M, Berkovsky S. Few-shot fine-tuning SOTA summarization models for medical dialogues. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022. p. 254–266. https://aclanthology.org/2022.naacl-srw.32/.
    1. Hollander JE, Carr BG. Virtually perfect? Telemedicine for COVID-19. N Engl J Med. 2020;382(18):1679–1681. doi: 10.1056/NEJMp2003539. - DOI - PubMed