Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database
- PMID: 38002431
- PMCID: PMC10669818
- DOI: 10.3390/bioengineering10111307
Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database
Abstract
Background: Although electronic health records (EHR) provide useful insights into disease patterns and patient treatment optimisation, their reliance on unstructured data presents a difficulty. Echocardiography reports, which provide extensive pathology information for cardiovascular patients, are particularly challenging to extract and analyse, because of their narrative structure. Although natural language processing (NLP) has been utilised successfully in a variety of medical fields, it is not commonly used in echocardiography analysis.
Objectives: To develop an NLP-based approach for extracting and categorising data from echocardiography reports by accurately converting continuous (e.g., LVOT VTI, AV VTI and TR Vmax) and discrete (e.g., regurgitation severity) outcomes in a semi-structured narrative format into a structured and categorised format, allowing for future research or clinical use.
Methods: 135,062 Trans-Thoracic Echocardiogram (TTE) reports were derived from 146967 baseline echocardiogram reports and split into three cohorts: Training and Validation (n = 1075), Test Dataset (n = 98) and Application Dataset (n = 133,889). The NLP system was developed and was iteratively refined using medical expert knowledge. The system was used to curate a moderate-fidelity database from extractions of 133,889 reports. A hold-out validation set of 98 reports was blindly annotated and extracted by two clinicians for comparison with the NLP extraction. Agreement, discrimination, accuracy and calibration of outcome measure extractions were evaluated.
Results: Continuous outcomes including LVOT VTI, AV VTI and TR Vmax exhibited perfect inter-rater reliability using intra-class correlation scores (ICC = 1.00, p < 0.05) alongside high R2 values, demonstrating an ideal alignment between the NLP system and clinicians. A good level (ICC = 0.75-0.9, p < 0.05) of inter-rater reliability was observed for outcomes such as LVOT Diam, Lateral MAPSE, Peak E Velocity, Lateral E' Velocity, PV Vmax, Sinuses of Valsalva and Ascending Aorta diameters. Furthermore, the accuracy rate for discrete outcome measures was 91.38% in the confusion matrix analysis, indicating effective performance.
Conclusions: The NLP-based technique yielded good results when it came to extracting and categorising data from echocardiography reports. The system demonstrated a high degree of agreement and concordance with clinician extractions. This study contributes to the effective use of semi-structured data by providing a useful tool for converting semi-structured text to a structured echo report that can be used for data management. Additional validation and implementation in healthcare settings can improve data availability and support research and clinical decision-making.
Keywords: Big Data; data extraction; echo report; echocardiography analysis; electronic health records (EHR); natural language processing (NLP); unstructured data; validation.
Conflict of interest statement
All authors declare that there are no competing interests.
Figures





Similar articles
-
Identifying the Severity of Heart Valve Stenosis and Regurgitation Among a Diverse Population Within an Integrated Health Care System: Natural Language Processing Approach.JMIR Cardio. 2024 Sep 30;8:e60503. doi: 10.2196/60503. JMIR Cardio. 2024. PMID: 39348175 Free PMC article.
-
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
-
Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system.BMC Med Res Methodol. 2022 May 12;22(1):136. doi: 10.1186/s12874-022-01583-z. BMC Med Res Methodol. 2022. PMID: 35549854 Free PMC article.
-
Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.Methods Inf Med. 2020 Dec;59(S 02):e64-e78. doi: 10.1055/s-0040-1716403. Epub 2020 Oct 14. Methods Inf Med. 2020. PMID: 33058101 Free PMC article.
-
Natural language processing systems for extracting information from electronic health records about activities of daily living. A systematic review.JAMIA Open. 2024 May 24;7(2):ooae044. doi: 10.1093/jamiaopen/ooae044. eCollection 2024 Jul. JAMIA Open. 2024. PMID: 38798774 Free PMC article. Review.
Cited by
-
Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification.BMC Med Inform Decis Mak. 2025 Mar 7;25(1):115. doi: 10.1186/s12911-025-02897-w. BMC Med Inform Decis Mak. 2025. PMID: 40050820 Free PMC article.
-
Ontology-guided machine learning outperforms zero-shot foundation models for cardiac ultrasound text reports.Sci Rep. 2025 Feb 14;15(1):5456. doi: 10.1038/s41598-024-83540-y. Sci Rep. 2025. PMID: 39953053 Free PMC article.
-
Triglyceride index as a predictor of mortality after cardiac surgery.iScience. 2024 Oct 5;27(11):111107. doi: 10.1016/j.isci.2024.111107. eCollection 2024 Nov 15. iScience. 2024. PMID: 39620137 Free PMC article.
-
Determinants of artificial intelligence electrocardiogram-derived age and its association with cardiovascular events and mortality: a systematic review and meta-analysis.NPJ Digit Med. 2025 May 29;8(1):322. doi: 10.1038/s41746-025-01727-7. NPJ Digit Med. 2025. PMID: 40442323 Free PMC article.
-
Identifying the Severity of Heart Valve Stenosis and Regurgitation Among a Diverse Population Within an Integrated Health Care System: Natural Language Processing Approach.JMIR Cardio. 2024 Sep 30;8:e60503. doi: 10.2196/60503. JMIR Cardio. 2024. PMID: 39348175 Free PMC article.
References
-
- Thompson J., Hu J., Mudaranthakam D.P., Streeter D., Neums L., Park M., Koestler D.C., Gajewski B., Jensen R., Mayo M.S. Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records. Sci. Rep. 2019;9:9253. doi: 10.1038/s41598-019-45705-y. - DOI - PMC - PubMed
-
- Morgan S.E., Diederen K., Vértes P.E., Ip S.H.Y., Wang B., Thompson B., Demjaha A., De Micheli A., Oliver D., Liakata M., et al. Natural Language Processing markers in first episode psychosis and people at clinical high-risk. Transl. Psychiatry. 2021;11:630. doi: 10.1038/s41398-021-01722-y. - DOI - PMC - PubMed
-
- Dickerson L.K., Rouhizadeh M., Korotkaya Y., Bowring M.G., Massie A.B., McAdams-Demarco M.A., Segev D.L., Cannon A., Guerrerio A.L., Chen P.-H., et al. Language impairment in adults with end-stage liver disease: Application of natural language processing towards patient-generated health records. NPJ Digit. Med. 2019;2:106. doi: 10.1038/s41746-019-0179-9. - DOI - PMC - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources