Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians

Takanobu Hirosawa¹, Kazuya Mizuta², Yukinori Harada², Taro Shimizu²

Affiliations

¹ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan. Electronic address: hirosawa@dokkyomed.ac.jp.
² Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.

PMID: 37643659
DOI: 10.1016/j.amjmed.2023.08.003

Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians

Takanobu Hirosawa et al. Am J Med. 2023 Nov.

. 2023 Nov;136(11):1119-1123.e18.

doi: 10.1016/j.amjmed.2023.08.003. Epub 2023 Aug 27.

Authors

Takanobu Hirosawa¹, Kazuya Mizuta², Yukinori Harada², Taro Shimizu²

Affiliations

¹ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan. Electronic address: hirosawa@dokkyomed.ac.jp.
² Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.

PMID: 37643659
DOI: 10.1016/j.amjmed.2023.08.003

Abstract

Background: In this study, we evaluated the diagnostic accuracy of Google Bard, a generative artificial intelligence (AI) platform.

Methods: We searched published case reports from our department for difficult or uncommon case descriptions and mock cases created by physicians for common case descriptions. We entered the case descriptions into the prompt of Google Bard to generate the top 10 differential-diagnosis lists. As in previous studies, other physicians created differential-diagnosis lists by reading the same clinical descriptions.

Results: A total of 82 clinical descriptions (52 case reports and 30 mock cases) were used. The accuracy rates of physicians were still higher than Google Bard in the top 10 (56.1% vs 82.9%, P < .001), the top 5 (53.7% vs 78.0%, P = .002), and the top differential diagnosis (40.2% vs 64.6%, P = .003). Even within the specific context of case reports, physicians consistently outperformed Google Bard. When it came to mock cases, the performances of the differential-diagnosis lists by Google Bard were no different from those of the physicians in the top 10 (80.0% vs 96.6%, P = .11) and the top 5 (76.7% vs 96.6%, P = .06), except for those in the top diagnoses (60.0% vs 90.0%, P = .02).

Conclusion: While physicians excelled overall, and particularly with case reports, Google Bard displayed comparable diagnostic performance in common cases. This suggested that Google Bard possesses room for further improvement and refinement in its diagnostic capabilities. Generative AIs, including Google Bard, are anticipated to become increasingly beneficial in augmenting diagnostic accuracy.

Keywords: Clinical decision supporting system; Diagnosis; Diagnostic excellence; Generative artificial intelligence; Large language model; Natural language processing.

PubMed Disclaimer

LinkOut - more resources

Full Text Sources
- ClinicalKey
- Elsevier Science
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians

Affiliations

Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians

Authors

Affiliations

Abstract

LinkOut - more resources

Full Text Sources

Research Materials