Analysis of large-language model versus human performance for genetics questions
- PMID: 37246194
- PMCID: PMC10999420
- DOI: 10.1038/s41431-023-01396-8
Analysis of large-language model versus human performance for genetics questions
Abstract
Large-language models like ChatGPT have recently received a great deal of attention. One area of interest pertains to how these models could be used in biomedical contexts, including related to human genetics. To assess one facet of this, we compared the performance of ChatGPT versus human respondents (13,642 human responses) in answering 85 multiple-choice questions about aspects of human genetics. Overall, ChatGPT did not perform significantly differently (p = 0.8327) than human respondents; ChatGPT was 68.2% accurate, compared to 66.6% accuracy for human respondents. Both ChatGPT and humans performed better on memorization-type questions versus critical thinking questions (p < 0.0001). When asked the same question multiple times, ChatGPT frequently provided different answers (16% of initial responses), including for both initially correct and incorrect answers, and gave plausible explanations for both correct and incorrect answers. ChatGPT's performance was impressive, but currently demonstrates significant shortcomings for clinical or other high-stakes use. Addressing these limitations will be important to guide adoption in real-life situations.
© 2023. This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply.
Conflict of interest statement
The authors receive salary and research support from the intramural program of the National Human Genome Research Institute. BDS is the co-Editor-in-Chief of the American Journal of Medical Genetics, and has published some of the questions mentioned in this study in a book, as well as other questions [12]. Both editing/publishing activities are conducted as an approved outside activity, separate from his US Government role.
Figures
Update of
-
Analysis of large-language model versus human performance for genetics questions.medRxiv [Preprint]. 2023 Jan 28:2023.01.27.23285115. doi: 10.1101/2023.01.27.23285115. medRxiv. 2023. Update in: Eur J Hum Genet. 2024 Apr;32(4):466-468. doi: 10.1038/s41431-023-01396-8. PMID: 36789422 Free PMC article. Updated. Preprint.
Comment in
-
Can ChatGPT understand genetics?Eur J Hum Genet. 2024 Apr;32(4):371-372. doi: 10.1038/s41431-023-01419-4. Epub 2023 Jul 5. Eur J Hum Genet. 2024. PMID: 37407734 Free PMC article. No abstract available.
References
-
- Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:221213138. 2022.
-
- Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR, Collaborators F-AS. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379:e072826. doi: 10.1136/bmj-2022-072826. - DOI - PMC - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical