Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 23;13(24):7864.
doi: 10.3390/jcm13247864.

ChatGPT's Performance in Spinal Metastasis Cases-Can We Discuss Our Complex Cases with ChatGPT?

Affiliations

ChatGPT's Performance in Spinal Metastasis Cases-Can We Discuss Our Complex Cases with ChatGPT?

Stephan Heisinger et al. J Clin Med. .

Abstract

Background: The integration of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT-4, is transforming healthcare. ChatGPT's potential to assist in decision-making for complex cases, such as spinal metastasis treatment, is promising but widely untested. Especially in cancer patients who develop spinal metastases, precise and personalized treatment is essential. This study examines ChatGPT-4's performance in treatment planning for spinal metastasis cases compared to experienced spine surgeons. Materials and Methods: Five spine metastasis cases were randomly selected from recent literature. Consequently, five spine surgeons and ChatGPT-4 were tasked with providing treatment recommendations for each case in a standardized manner. Responses were analyzed for frequency distribution, agreement, and subjective rater opinions. Results: ChatGPT's treatment recommendations aligned with the majority of human raters in 73% of treatment choices, with moderate to substantial agreement on systemic therapy, pain management, and supportive care. However, ChatGPT's recommendations tended towards generalized statements, with raters noting its generalized answers. Agreement among raters improved in sensitivity analyses excluding ChatGPT, particularly for controversial areas like surgical intervention and palliative care. Conclusions: ChatGPT shows potential in aligning with experienced surgeons on certain treatment aspects of spinal metastasis. However, its generalized approach highlights limitations, suggesting that training with specific clinical guidelines could potentially enhance its utility in complex case management. Further studies are necessary to refine AI applications in personalized healthcare decision-making.

Keywords: ChatGPT; decision making; spinal metastasis; spine; spine surgery.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Flow chart of the study design.
Figure 2
Figure 2
Response form used to collect data from five spine surgeons and ChatGPT.
Figure 3
Figure 3
Relative frequencies of overall positive answers to the 100 questions asked by each rater. Rater 1 to 5 were human, rater 6 was ChatGPT. Fisher’s exact test showed significant difference in distribution at p < 0.001, with a Bonferroni-corrected post hoc analysis determining rater five to have given significantly fewer “yes” answers (**). ChatGPT has 52% positive answers above the average of 45.6%.
Figure 4
Figure 4
Bar charts for total number of votes per question block for each intervention/therapy decision in case 1. Bar chart length shows the sum of cast votes, and each color block within the bar shows the absolute number of votes for each available option. Each rater could choose none, all, or any combination of options in each question block. The multiple colors in the bars indicate that raters considered more than one answer appropriate (panels 1 to 6). The different sizes of color-blocks in each bar indicate raters’ propensity towards a certain answer (e.g., panel 6, raters favoring psychological counseling over nutritional support). Evenly distributed colors could either mean total agreement that all chosen answers were correct or total disagreement, each rater choosing a different answer (panel 3, in this case disagreement). The mono-colored bar chart in panel 7 indicates total rater agreement on end-of-life care planning with no votes for palliative radiation. Overall rater agreement was the lowest on this case compared to all other cases (Fleiss’ Kappa = 0.42, “moderate agreement”, p < 0.001).
Figure 5
Figure 5
Bar charts for total number of votes per question block for each intervention/therapy decision in case 2. Bar chart length shows the sum of cast votes, and each color block within the bar shows the absolute number of votes for each available option. Each rater could choose none, all, or any combination of options in each question block. The multiple colors in the bars indicate that raters considered more than one answer appropriate (panels 1 to 6). The different sizes of color-blocks in each bar indicate raters’ propensity towards a certain answer (e.g., panel 6, raters favoring psychological counseling over nutritional support). Evenly distributed colors could either mean total agreement that all chosen answers were correct or total disagreement, each rater choosing a different answer (panel 4, in this case agreement). The mono-colored bar chart in panel 7 indicates only 1 vote cast with 5 voters seeing no need for palliative care and 1 human voter opting for palliative radiation. Overall rater agreement was the highest on this case compared to all other cases (Fleiss’ Kappa = 0.60, “substantial agreement”, p < 0.001).
Figure 6
Figure 6
Bar charts for total number of votes per question block for each intervention/therapy decision in case 3. Bar chart length shows the sum of cast votes, and each color block within the bar shows the absolute number of votes for each available option. Each rater could choose none, all, or any combination of options in each question block. The multiple colors in the bars indicate that raters considered more than one answer appropriate (panels 1 to 6). The different sizes of color-blocks in each bar indicate raters’ propensity towards a certain answer (e.g., panels 3, 5, and 6: raters favoring chemotherapy, physical therapy, and psychological counseling, respectively, over answer alternatives). Evenly distributed colors could either mean total agreement that all chosen answers were correct or total disagreement, each rater choosing a different answer (panel 1, in this case agreement). The mono-colored bar chart in panel 7 indicates only 1 vote cast with 5 voters seeing no need for palliative care and 1 human voter opting for end-of-live care planning. Overall rater agreement was high on this case compared to all other cases (Fleiss’ Kappa = 0.60, p < 0.001) and highest among all cases if ChatGPT was left out of the calculation (Fleiss’ Kappa = 0.73, “substantial agreement”, p < 0.001).
Figure 7
Figure 7
Bar charts for total number of votes per question block for each intervention/therapy decision in case 4. Bar chart length shows the sum of cast votes and each color block within the bar the absolute number of votes for each available option. Each rater could choose none, all, or any combination of options in each question block. The multiple colors in the bars indicate that raters considered more than one answer appropriate (panels 1 to 6). The different sizes of color-blocks in each bar indicate raters’ propensity towards a certain answer (e.g., panels 5 and 6: raters favoring physical therapy and psychological counseling, respectively, over answer alternatives). Evenly distributed colors could either mean total agreement that all chosen answers were correct or total disagreement, each rater choosing a different answer (panel 1, in this case disagreement). No bar in panel 7 indicates that no rater is seeing a need for palliative care. Overall rater agreement was high on this case compared to all other cases (Fleiss’ Kappa = 0.58, “moderate agreement”, p < 0.001) and even higher if ChatGPT was left out of the calculation (Fleiss’ Kappa = 0.66, “substantial agreement”, p < 0.001).
Figure 8
Figure 8
Bar charts for total number of votes per question block for each intervention/therapy decision in case 5. Bar chart length shows the sum of cast votes, and each color block within the bar shows the absolute number of votes for each available option. Each rater could choose none, all, or any combination of options in each question block. The multiple colors in the bars indicate that raters considered more than one answer appropriate (e.g., panel 1). The different sizes of color-blocks in each bar indicate raters’ propensity towards a certain answer (e.g., panels 5 and 6: raters favoring physical therapy and psychological counseling, respectively, over answer alternatives). Evenly distributed colors could either mean total agreement that all chosen answers were correct or total disagreement, each rater choosing a different answer (panel 4, in this case agreement). No bar in panel 7 indicates that no rater is seeing a need for palliative care. Overall rater agreement was low on this case compared to all other cases (Fleiss’ Kappa = 0.48, “moderate agreement”, p < 0.001) and lowest among all cases if ChatGPT was left out of the calculation (Fleiss’ Kappa = 0.40, “moderate agreement”, p < 0.001).
Figure 9
Figure 9
Answer Heatmap for each of 5 cases (separated by red vertical lines) shows the same 20 questions (x-axis, each question represented by 1 column) were answered by 6 raters (y-axis, rows) with either yes (blue color) or no light (grey color). Unanswered questions are indicated by dark grey color. A total of 20 questions were answered the same by all 6 raters, indicated by a column that is either all blue or all light grey. ChatGPT as rater 6 (row 1) is indicated by a cyan arrow.
Figure 10
Figure 10
Left Panel results of a case-by-case Fleiss’ Kappa estimation (y-axis, 0.2 to 0.39 “fair”, 0.4 to 0.59 “moderate”, 0.6 and above “substantial agreement) grouped by cases and overall result (x-axis, each case and the overall estimation have separate colors. The left bar in each color is an estimation with ChatGPT included, the right bar is the result without ChatGPT. Rater Agreement in cases 1 to 4, as well as overall increases without ChatGPT contributing. Right Panel results of Fleiss’ Kappa calculation by question block. Each block (x-axis) is in the same color, the left bar in each color represents the result with ChatGPT considered, and the right bar without the AI. Controversial questions revolved around surgical intervention, rehabilitation services, and palliative care. Except for rehabilitation, all question blocks profited from removing ChatGPT as rater.

References

    1. Morya V.K., Lee H.-W., Shahid H., Magar A.G., Lee J.-H., Kim J.-H., Jun L., Noh K.-C. Application of ChatGPT for Orthopedic Surgeries and Patient Care. Clin. Orthop. Surg. 2024;16:347. doi: 10.4055/cios23181. - DOI - PMC - PubMed
    1. Al Kuwaiti A., Nazer K., Al-Reedy A., Al-Shehri S., Al-Muhanna A., Subbarayalu A.V., Al Muhanna D., Al-Muhanna F.A. A Review of the Role of Artificial Intelligence in Healthcare. J. Pers. Med. 2023;13:951. doi: 10.3390/jpm13060951. - DOI - PMC - PubMed
    1. Dagher T., Dwyer E.P., Baker H.P., Kalidoss S., Strelzow J.A. “Dr. AI Will See You Now”: How Do ChatGPT-4 Treatment Recommendations Align with Orthopaedic Clinical Practice Guidelines? Clin. Orthop. Relat. Res. 2024;482:2098–2106. doi: 10.1097/CORR.0000000000003234. - DOI - PMC - PubMed
    1. Kaul V., Enslin S., Gross S.A. History of artificial intelligence in medicine. Gastrointest. Endosc. 2020;92:807–812. doi: 10.1016/j.gie.2020.06.040. - DOI - PubMed
    1. Pressman S.M., Borna S., Gomez-Cabello C.A., Haider S.A., Haider C., Forte A.J. AI and Ethics: A Systematic Review of the Ethical Considerations of Large Language Model Use in Surgery Research. Healthcare. 2024;12:825. doi: 10.3390/healthcare12080825. - DOI - PMC - PubMed

LinkOut - more resources