Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 10;8(1):424.
doi: 10.1038/s41746-025-01826-5.

How machine learning on real world clinical data improves adverse event recording for endoscopy

Affiliations

How machine learning on real world clinical data improves adverse event recording for endoscopy

Stefan Wittlinger et al. NPJ Digit Med. .

Abstract

Endoscopic interventions are essential for diagnosing and treating gastrointestinal conditions. Accurate and comprehensive documentation is crucial for enhancing patient safety and optimizing clinical outcomes; however, adverse events remain underreported. This study evaluates a machine learning-based approach for systematically detecting endoscopic adverse events from real-world clinical metadata, including structured hospital data such as ICD-codes and procedure timings. Using a random forest classifier detecting adverse events perforation, bleeding, and readmission, we analysed 2490 inpatient cases, achieving significant improvements over baseline prediction accuracy. The model achieved AUC-ROC/AUC-PR values of 0.9/0.69 for perforation, 0.84/0.64 for bleeding, and 0.96/0.9 for readmissions. Results highlight the importance of multiple metadata features for robust predictions. This semi-automated method offers a privacy-preserving tool for identifying documentation discrepancies and enhancing quality control. By integrating metadata analysis, this approach supports better clinical decision-making, quality improvement initiatives, and resource allocation while reducing the risk of missed adverse events in endoscopy.

PubMed Disclaimer

Conflict of interest statement

Competing interests: S.B. declares consulting services for Olympus. I.W. received honoraria from AstraZeneca. J.K. declares consulting services for Bioptimus, France; Panakeia, UK; AstraZeneca, UK; and MultiplexDx, Slovakia. Furthermore, he holds shares in StratifAI, Germany, Synagen, Germany, Ignition Lab, Germany; has received an institutional research grant by G.S.K.; and has received honoraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Example of data generated during a hospital stay.
This figure displays an example of data generated during a hospital stay, which includes both unstructured data, primarily in the form of text (e.g., endoscopy reports and discharge letters), and structured data (metadata), such as diagnoses, materials used during endoscopy, and time until discharge. For a comprehensive list of the metadata used, refer to Supplementary Tables 2–3.
Fig. 2
Fig. 2. Training and testing scheme for adverse events perforation, and bleeding.
For adverse events and perforation adverse events, the scheme for training and testing is displayed. For this purpose a combination of LLM-generated and manually generated labels was used. The random forest was trained for two types of adverse events, perforation and bleeding using a training set with n = 1990 cases. The labels for the training set were obtained by running a large language model on the endoscopy reports and discharge letters. The performance metrics were obtained by testing on the remaining n = 500 manually labeled cases representing the ground truth. To estimate the stability of the machine learning algorithm, the large language model labels were used for the entire data set (n = 2490). With these, we performed random subsampling with 100 iterations. In each iteration, the data was randomly split into training (n = 1990) and test set (n = 500). From this, the standard deviation of the performance metrics was calculated. Perforation or bleeding that occurred after readmission was not classified as adverse events, perforation or bleeding, but rather as adverse event readmission. The listed data is available at discharge, allowing the detection of adverse events such as bleeding or perforation to be performed at discharge or any later time.
Fig. 3
Fig. 3. Training and testing scheme for adverse events of readmission.
Training and testing scheme for adverse event readmission within 30 days due to adverse events in connection with previously performed EMR. The entire data set, n = 213, consisting of all readmissions within 30 days was manually labeled. Given the limited sample size, the metadata used was restricted to the time until readmission and the ICD codes recorded at readmission. The random forest classifier was trained on n = 163 cases and tested on n = 50 cases. To evaluate the stability of the machine learning algorithm, random subsampling was performed over 100 iterations, with different splits between training and testing sets in each iteration. The listed data is available at readmission, allowing the detection of adverse event readmission to be performed at readmission or any later time.
Fig. 4
Fig. 4. Test results for adverse event readmission.
Test results (AUC-ROC and AUC-PR) and errors for adverse event readmission within 30 days due to adverse events in connection with previously performed EMR are displayed. The dataset (n = 213) with manually labeled data was randomly split into a training set (n = 163) and a testing set (n = 50). This random subsampling process was repeated 100 times. The AUC-ROC and AUC-PR values were calculated as the mean across all runs, with error bars representing the standard deviation.
Fig. 5
Fig. 5. Test results for adverse events bleeding and perforation.
a The test results for adverse events bleeding and perforation (AUC-ROC and AUC-PR) are displayed. The model was trained on a training set (n = 1990) with labels generated by a large language model and tested on a manually labeled test set (n = 500). Direct error bars cannot be computed for this process, as random subsampling would require manual labels for all cases. b Estimated error values using only labels generated by a large language model are shown. Labels generated by a large language model are used for both training (n = 1990) and testing (n = 500). This process is repeated over 100 iterations using random subsampling, with a different split of training and test data in each iteration. Performance metrics (AUC-ROC, AUC-PR, and dummy classifier) are calculated as mean values, with the error bars representing the standard deviations shown in the plot.
Fig. 6
Fig. 6. Ten most important features for adverse events perforation, bleeding, and readmission.
The 10 most important features for a perforation b bleeding and c readmission are displayed. SHAP was used to determine feature importance.

Similar articles

References

    1. Kavic, S. M. & Basson, M. D. Complications of endoscopy. Am. J. Surg.181, 319–332 (2001). - PubMed
    1. Mergener, K. Defining and measuring endoscopic complications: more questions than answers. Gastrointest. Endosc. Clin. N. Am.17, 1–9 (2007). - PubMed
    1. Adler, A. et al. Data quality of the German screening colonoscopy registry. Endoscopy45, 813–818 (2013). - PubMed
    1. Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit Med.4, 5 (2021). - PMC - PubMed
    1. Harerimana, G., Kim, J. W., Yoo, H. & Jang, B. Deep learning for electronic health records analytics. IEEE Access7, 101245–101259 (2019).

LinkOut - more resources