Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 1;31(9):1821-1832.
doi: 10.1093/jamia/ocae122.

BioInstruct: instruction tuning of large language models for biomedical natural language processing

Affiliations

BioInstruct: instruction tuning of large language models for biomedical natural language processing

Hieu Tran et al. J Am Med Inform Assoc. .

Abstract

Objectives: To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles.

Materials and methods: We created the BioInstruct, comprising 25 005 instructions to instruction-tune LLMs (LLaMA 1 and 2, 7B and 13B version). The instructions were created by prompting the GPT-4 language model with 3-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into 3 major categories: question answering (QA), information extraction (IE), and text generation (GEN). We also examined whether categories (eg, QA, IE, and generation) of instructions impact model performance.

Results and discussion: Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA on average accuracy metric, 5.7% in IE on average F1 metric, and 96% in Generation tasks on average GPT-4 score metric. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between 2 tasks.

Conclusion: The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

Keywords: information extraction; instruction tuning; large language models; multi-task learning; natural language inference; question answering; text generation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.
Distribution of our BioInstruct dataset. (A) Task type distribution of 25 005 natural language instructions. (B) The top 20 most common root verbs (inner circle) and their top 4 direct noun objects (outer circle) in the generated instructions.
Figure 2.
Figure 2.
Performance of different tasks in BioInstruct. Each scatter corresponds to a subtask to evaluate. Each colored dot inside the scatter represents a different training task. The black dot represents the baseline performance of LLaMA 2 7B without BioInstruct fine-tuning. The purple dot represents the performance of LLaMA 2 7B fine-tuned on all BioInstruct tasks. We then ablate BioInstruct. Above each scatter, we provide the best single task fine-tuned in the first row. In the second row, we also provide the best fine-tuning task in addition to the specific task A, where task A is the same as the evaluation task.
Figure 3.
Figure 3.
Performance on different evaluation tasks when LLaMA 2 7B is fine-tuned on varying number of instruction samples in BioInstruct.

Similar articles

Cited by

References

    1. Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems. Vol 33. 2020:1877-1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac....
    1. Sanh V, Webson A, Raffel C, et al. 2021. Multitask prompted training enables zero-shot task generalization. CoRR; abs/2110.08207. https://arxiv.org/abs/2110.08207.
    1. Chowdhery A, Narang S, Devlin J, et al. 2022. PaLM: scaling language modeling with pathways. arXiv, arXiv:220402311, preprint: not peer reviewed. https://arxiv.org/abs/2204.02311.
    1. Longpre S, Hou L, Vu T, et al. 2023. The flan collection: designing data and methods for effective instruction tuning. https://arxiv.org/abs/2301.13688.
    1. OpenAI. 2023. GPT-4 Technical Report. arXiv, arXiv:230308774, preprint: not peer reviewed. https://api.semanticscholar.org/CorpusID:257532815.

MeSH terms