Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr;31(2):166-174.
doi: 10.4258/hir.2025.31.2.166. Epub 2025 Apr 30.

Advancing Korean Medical Large Language Models: Automated Pipeline for Korean Medical Preference Dataset Construction

Affiliations

Advancing Korean Medical Large Language Models: Automated Pipeline for Korean Medical Preference Dataset Construction

Jean Seo et al. Healthc Inform Res. 2025 Apr.

Abstract

Objectives: Developing large language models (LLMs) in biomedicine requires access to high-quality training and alignment tuning datasets. However, publicly available Korean medical preference datasets are scarce, hindering the advancement of Korean medical LLMs. This study constructs and evaluates the efficacy of the Korean Medical Preference Dataset (KoMeP), an alignment tuning dataset constructed with an automated pipeline, minimizing the high costs of human annotation.

Methods: KoMeP was generated using the DAHL score, an automated hallucination evaluation metric. Five LLMs (Dolly-v2-3B, MPT-7B, GPT-4o, Qwen-2-7B, Llama-3-8B) produced responses to 8,573 biomedical examination questions, from which 5,551 preference pairs were extracted. Each pair consisted of a "chosen" response and a "rejected" response, as determined by their DAHL scores. The dataset was evaluated when trained through two different alignment tuning methods, direct preference optimization (DPO) and odds ratio preference optimization (ORPO) respectively across five different models. The KorMedMCQA benchmark was employed to assess the effectiveness of alignment tuning.

Results: Models trained with DPO consistently improved KorMedMCQA performance; notably, Llama-3.1-8B showed a 43.96% increase. In contrast, ORPO training produced inconsistent results. Additionally, English-to-Korean transfer learning proved effective, particularly for English-centric models like Gemma-2, whereas Korean-to-English transfer learning achieved limited success. Instruction tuning with KoMeP yielded mixed outcomes, which suggests challenges in dataset formatting.

Conclusions: KoMeP is the first publicly available Korean medical preference dataset and significantly improves alignment tuning performance in LLMs. The DPO method outperforms ORPO in alignment tuning. Future work should focus on expanding KoMeP, developing a Korean-native dataset, and refining alignment tuning methods to produce safer and more reliable Korean medical LLMs.

Keywords: Informatics; Large Language Models; Medical Informatics; Natural Language Models; Natural Language Processing.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest

Jinwook Choi is an editor of Healthcare Informatics Research; however, he was not involved in this article’s peer reviewer selection, evaluation, and decision process. Otherwise, no potential conflict of interest relevant to this article was reported.

Figures

Figure 1
Figure 1
Data construction pipeline of the Korean Medical Preference (KoMeP). Responses for each question in the DAHL dataset were generated using five different large language models. Each response was evaluated using the DAHL score as the preference label. The response with the highest score was labeled “chosen,” while the one with the lowest score was labeled “rejected.” This process was repeated for all 8,573 questions in the DAHL dataset. After a filtering process, 5,551 entries remained. DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text.
Figure 2
Figure 2
Categorical distribution of the Korean Medical Preference (KoMeP).
Figure 3
Figure 3
Example and translation of “prompt,” “chosen,” and “rejected.”
Figure 4
Figure 4
Transforming KoMeP into an instruction-tuning dataset. The “prompt,” “chosen,” and “rejected” columns were repurposed. The “prompt” contained the user’s question, while the “chosen” column provided the expected answer. The data were then formatted according to the specific template of each tested model. The example format shown here follows the Llama-3 Instruct template. KoMeP: Korean Medical Preference.
Figure 5
Figure 5
Performance on KorMedMCQA on models before and after alignment tuning with KoMeP. Scores highlighted in blue indicate improved performance, red indicates decreased performance, and green indicates unchanged performance. Most models demonstrated enhanced performance on KorMedMCQA when DPO is trained. KorMedMCQA: Korean Medical Multiple-Choice Question Answering, KoMeP: Korean Medical Preference, DPO: direct preference optimization, ORPO: odds ratio preference optimization.
Figure 6
Figure 6
Transfer learning from Korean to English. DPO training with KoMeP occasionally improves performance on MedMCQA, an English medical benchmark dataset, but generally results in performance drops on PubMedQA, another English medical benchmark. This suggests that transfer learning from Korean to English during alignment tuning is not consistently effective. KorMedMCQA: Korean Medical Multiple-Choice Question Answering, KoMeP: Korean Medical Preference, DPO: direct preference optimization.
Figure 7
Figure 7
Transfer learning from English to Korean. DPO training with our English medical preference data improves performance on KorMedMCQA, a Korean medical benchmark dataset, for all models except Llama-3.1-8b. This highlights the effectiveness of transfer learning from English to Korean, in contrast to the limited success observed with transfer learning from Korean to English. Notably, for Gemma-2, DPO training with the English dataset resulted in even greater improvements on KorMedMCQA than training with the Korean dataset (KoMeP). KorMedMCQA: Korean Medical Multiple- Choice Question Answering, KoMeP: Korean Medical Preference, DPO: direct preference optimization, ORPO: odds ratio preference optimization.
Figure 8
Figure 8
Model performance on KorMedMCQA, when instruction-tuned with KoMeP, adapted in the instruction format. Gemma-2-2b shows little drop in performance, and Qwen-2-1.5b and 7b show slight performance improvement. KorMedMCQA: Korean Medical Multiple-Choice Question Answering, KoMeP: Korean Medical Preference.

References

    1. Wang C, Li M, He J, Wang Z, Darzi E, Chen Z, et al. A survey for large language models in biomedicine [Internet] Ithaca (NY): arXiv.org; 2024. [cited at 2025 Apr 13]. Available from: - DOI
    1. Lu Z, Peng Y, Cohen T, Ghassemi M, Weng C, Tian S. Large language models in biomedicine and health: current research landscape and future directions. J Am Med Inform Assoc. 2024;31(9):1801–11. doi: 10.1093/jamia/ocae202. - DOI - PMC - PubMed
    1. Bi Z, Dip SA, Hajialigol D, Kommu S, Liu H, Lu M, et al. AI for biomedicine in the era of large language models [Internet] Ithaca (NY): arXiv.org; 2024. [cited at 2025 Apr 13]. Available from: - DOI
    1. Seo J, Lim J, Jang D, Shin H. DAHL: domain-specific automated hallucination evaluation of long-form text through a benchmark dataset in biomedicine [Internet] Ithaca (NY): arXiv.org; 2024. [cited at 2025 Apr 13]. Available from: - DOI
    1. Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: your language model is secretly a reward model. Adv Neural Inf Process Syst. 2023;36:53728–41.

LinkOut - more resources