Analysis of large-language model versus human performance for genetics questions

Dat Duong¹, Benjamin D Solomon²

Affiliations

¹ Medical Genomics Unit, Medical Genetics Branch, National Human Genome Research Institute, Bethesda, MD, USA.
² Medical Genomics Unit, Medical Genetics Branch, National Human Genome Research Institute, Bethesda, MD, USA. solomonb@mail.nih.gov.

PMID: 37246194
PMCID: PMC10999420
DOI: 10.1038/s41431-023-01396-8

Analysis of large-language model versus human performance for genetics questions

Dat Duong et al. Eur J Hum Genet. 2024 Apr.

. 2024 Apr;32(4):466-468.

doi: 10.1038/s41431-023-01396-8. Epub 2023 May 29.

Authors

Dat Duong¹, Benjamin D Solomon²

Affiliations

¹ Medical Genomics Unit, Medical Genetics Branch, National Human Genome Research Institute, Bethesda, MD, USA.
² Medical Genomics Unit, Medical Genetics Branch, National Human Genome Research Institute, Bethesda, MD, USA. solomonb@mail.nih.gov.

PMID: 37246194
PMCID: PMC10999420
DOI: 10.1038/s41431-023-01396-8

Abstract

Large-language models like ChatGPT have recently received a great deal of attention. One area of interest pertains to how these models could be used in biomedical contexts, including related to human genetics. To assess one facet of this, we compared the performance of ChatGPT versus human respondents (13,642 human responses) in answering 85 multiple-choice questions about aspects of human genetics. Overall, ChatGPT did not perform significantly differently (p = 0.8327) than human respondents; ChatGPT was 68.2% accurate, compared to 66.6% accuracy for human respondents. Both ChatGPT and humans performed better on memorization-type questions versus critical thinking questions (p < 0.0001). When asked the same question multiple times, ChatGPT frequently provided different answers (16% of initial responses), including for both initially correct and incorrect answers, and gave plausible explanations for both correct and incorrect answers. ChatGPT's performance was impressive, but currently demonstrates significant shortcomings for clinical or other high-stakes use. Addressing these limitations will be important to guide adoption in real-life situations.

PubMed Disclaimer

Conflict of interest statement

The authors receive salary and research support from the intramural program of the National Human Genome Research Institute. BDS is the co-Editor-in-Chief of the American Journal of Medical Genetics, and has published some of the questions mentioned in this study in a book, as well as other questions [12]. Both editing/publishing activities are conducted as an approved outside activity, separate from his US Government role.

Figures

**Fig. 1. Summary of ChatGPT’s responses.**
The Sankey plot (constructed via Flourish, https://app.flourish.studio/projects) shows ChatGPT’s initial and second responses to the 85 questions used in the study.

See this image and copyright information in PMC

Update of

Analysis of large-language model versus human performance for genetics questions.
Duong D, Solomon BD. Duong D, et al. medRxiv [Preprint]. 2023 Jan 28:2023.01.27.23285115. doi: 10.1101/2023.01.27.23285115. medRxiv. 2023. Update in: Eur J Hum Genet. 2024 Apr;32(4):466-468. doi: 10.1038/s41431-023-01396-8. PMID: 36789422 Free PMC article. Updated. Preprint.

Comment in

Can ChatGPT understand genetics?
Emmert-Streib F. Emmert-Streib F. Eur J Hum Genet. 2024 Apr;32(4):371-372. doi: 10.1038/s41431-023-01419-4. Epub 2023 Jul 5. Eur J Hum Genet. 2024. PMID: 37407734 Free PMC article. No abstract available.

References

1. Ledgister Hanchard SE, Dwyer MC, Liu S, Hu P, Tekendo-Ngongang C, Waikel RL, et al. Scoping review and classification of deep learning in medical genetics. Genet Med. 2022;24:1593–603. doi: 10.1016/j.gim.2022.04.025. - DOI - PMC - PubMed
1. Schaefer J, Lehne M, Schepers J, Prasser F, Thun S. The use of machine learning in rare diseases: a scoping review. Orphanet J Rare Dis. 2020;15:145. doi: 10.1186/s13023-020-01424-6. - DOI - PMC - PubMed
1. Dias R, Torkamani A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019;11:70. doi: 10.1186/s13073-019-0689-8. - DOI - PMC - PubMed
1. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:221213138. 2022.
1. Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR, Collaborators F-AS. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379:e072826. doi: 10.1136/bmj-2022-072826. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions

Grants and funding

N/A/U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- The YODA Project

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Analysis of large-language model versus human performance for genetics questions

Affiliations

Analysis of large-language model versus human performance for genetics questions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Comment in

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical