Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2025 Aug:226:112214.
doi: 10.1016/j.diabres.2025.112214. Epub 2025 May 2.

Characterizing the clinical profile and prevalence of people with diabetes attended in the hospital setting by using unstructured healthcare data and natural language processing: the Diabetic@ study

Affiliations
Free article
Multicenter Study

Characterizing the clinical profile and prevalence of people with diabetes attended in the hospital setting by using unstructured healthcare data and natural language processing: the Diabetic@ study

A J Blanco-Carrasco et al. Diabetes Res Clin Pract. 2025 Aug.
Free article

Abstract

Aims: This study aimed to evaluate the potential of unstructured electronic health records (EHRs) data, analyzed using natural language processing (NLP) and machine learning (ML), to describe the prevalence and clinical spectrum of diabetes mellitus (DM) in hospitals.

Methods: A multicenter, retrospective study was conducted using EHRs from eight Spanish hospitals (2013-2018). Unstructured data were extracted using EHRead® (NLP and ML) and SNOMED CT. Individuals with type 1 or 2 DM (T1DM/T2DM) were identified, and a semi-supervised ML classifier was developed for unregistered types (UrDM). DM prevalence and related complications were analyzed in the final subpopulations (sT1DM/sT2DM).

Results: From 56,181,954 EHRs of 2,582,778 individuals, 638,730 were identified with DM: 75.4% with UrDM, 21.3% with T2DM, and 3.3% with T1DM. The ML model reclassified 93.5% as T2DM and 6.5% as T1DM. Over 50% of relevant variables like anthropometrics, laboratory values and treatments were missing. The prevalence of sT1DM/sT2DM was 2.6%/38.4%. Major comorbidities included hypertension, dyslipidemia, chronic kidney disease (CKD), ischemic heart disease, and chronic heart failure (CHF). CKD and CHF were the most frequent complications for sT1DM/sT2DM at 60 months.

Conclusions: NLP and ML for profiling DM using EHRs unstructured data are helpful, but additional data and better EHR documentation are crucial.

Keywords: Diabetes Comorbidities; Diabetes Complications; Diabetes Mellitus (DM); Machine Learning (ML); Natural Language Processing (NLP); Unstructured Data.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publication types

LinkOut - more resources