Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Oct 1:9:e70107.
doi: 10.2196/70107.

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study

Affiliations
Comparative Study

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study

Noriyuki Okuyama et al. JMIR Form Res. .

Abstract

Background: Recent studies have demonstrated that large language models exhibit exceptional performance in medical examinations. However, there is a lack of reports assessing their capabilities in specific domains or their application in practical data analysis using code interpreters. Furthermore, comparative analyses across different large language models have not been extensively conducted.

Objective: The purpose of this study was to evaluate whether advanced artificial intelligence (AI) models can analyze data from template-based input and demonstrate basic knowledge of reproductive medicine. Four AI models (GPT-4, GPT-4o, Claude 3.5 Sonnet, and Gemini Pro 1.5) were evaluated for their data analytical capabilities through numerical calculations and graph rendering. Their knowledge of infertility treatment was assessed using 10 examination questions developed by experts.

Methods: First, we uploaded data to the AI models and furnished instruction templates using the chat interface. The study investigated whether the AI models could perform pregnancy rate analysis and graph rendering, based on blastocyst grades according to Gardner criteria. Second, we assessed model diagnostic capabilities based on specialized knowledge. This evaluation used 10 questions derived from the Japanese Fertility Specialist Examination and the Embryologist Certification Exam, along with chromosome imaging. These materials were curated under the supervision of certified embryologists and fertility specialists. All procedures were repeated 10 times per AI model.

Results: GPT-4o achieved grade A output (defined as achieving the objective with a single output attempt) in 9 out of 10 trials, outperforming GPT-4, which achieved grade A in 7 out of 10. The average processing times for data analysis were 26.8 (SD 3.7) seconds for GPT-4o and 36.7 (SD 3) seconds for GPT-4, whereas Claude failed in all 10 attempts. Gemini achieved an average processing time of 23 (SD 3) seconds and received grade A in 6 out of 10 trials, though occasional manual corrections were needed. Embryologists required an average of 358.3 (SD 9.7) seconds for the same tasks. In the knowledge-based assessment, GPT-4o, Claude, and Gemini achieved perfect scores (9/9) on multiple-choice questions, while GPT-4 showed a 60% (6/10) success rate on 1 question. None of the AI models could reliably diagnose chromosomal abnormalities from karyotype images, with the highest image diagnostic accuracy being 70% (7/10) for Claude and Gemini.

Conclusions: This rapid processing demonstrates the potential for these AI models to significantly expedite data-intensive tasks in clinical settings. This performance underscores their potential utility as educational tools or decision support systems in reproductive medicine. However, none of the models were able to accurately interpret and diagnose using medical images.

Keywords: artificial intelligence; data analysis; education; infertility; large language model.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1.
Figure 1.. Data analysis procedure and evaluation framework for large language models in assisted reproductive technology clinical data processing. (A) Sample dataset structure showing patient treatment data from frozen-thawed embryo transfer cycles (January 2017-July 2024; 5361 cycles from 2276 patients) formatted for artificial intelligence (AI) model input with variables including patient age, embryo quality, and pregnancy outcomes. (B) Study workflow showing systematic evaluation protocol where 4 AI models (GPT-4, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) were tested 10 times each using standardized template prompts, with performance graded as A, B, or C. (C) Target visualization output showing pregnancy rates stratified by Gardner criteria trophectoderm grades (AA, BA, AB, BB, and BC) from clinic data. ET: embryo transfer; GS: gestational sac.
Figure 2.
Figure 2.. Knowledge assessment framework for large language models in reproductive medicine education. (A) Study protocol for evaluating artificial intelligence (AI) model performance on fertility specialist examination questions, with 10 independent trials per model using fresh chat sessions. (B) Question distribution showing sources: 3 questions from a senior embryologist, 6 questions from board-certified specialists (Japan Society for Reproductive Medicine gynecology and urology specialist exams, 2016‐2018), and 1 image-based karyotype diagnosis question. (C) Karyotype analysis test image (600×450 pixels) used for chromosomal abnormality diagnosis assessment across all AI models.

References

    1. Miyazaki K, Sato R. Analyses of the technological accumulation over the 2nd and the 3rd AI boom and the issues related to AI adoption by firms. 2018 Portland International Conference on Management of Engineering and Technology (PICMET); Aug 19-23, 2018; Honolulu, HI. pp. 1–7. Presented at. doi. - DOI
    1. Tang D. What is digital transformation? EDPACS. 2021 Jun 3;64(1):9–13. doi: 10.1080/07366981.2020.1847813. doi. - DOI
    1. Abd-rabo AM, Hashaikeh SA. The digital transformation revolution. Int J Humanit Educ Res. 2021;3(4):124–128. doi: 10.47832/2757-5403.4-3.11. doi. - DOI
    1. Althubaiti A, Tirksstani JM, Alsehaibany AA, Aljedani RS, Mutairii AM, Alghamdi NA. Digital transformation in medical education: factors that influence readiness. Health Informatics J. 2022;28(1):14604582221075554. doi: 10.1177/14604582221075554. doi. Medline. - DOI - PubMed
    1. Macdonald C, Adeloye D, Sheikh A, Rudan I. Can ChatGPT draft a research article? An example of population-level vaccine effectiveness analysis. J Glob Health. 2023 Feb 17;13:01003. doi: 10.7189/jogh.13.01003. doi. Medline. - DOI - PMC - PubMed

Publication types

LinkOut - more resources