Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
- PMID: 38336713
- PMCID: PMC10858490
- DOI: 10.1186/s13023-024-03063-7
Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
Abstract
Background: Rare diseases affect approximately 400 million people worldwide. Many of them suffer from delayed diagnosis. Among them, NPHP1-related renal ciliopathies need to be diagnosed as early as possible as potential treatments have been recently investigated with promising results. Our objective was to develop a supervised machine learning pipeline for the detection of NPHP1 ciliopathy patients from a large number of nephrology patients using electronic health records (EHRs).
Methods and results: We designed a pipeline combining a phenotyping module re-using unstructured EHR data, a semantic similarity module to address the phenotype dependence, a feature selection step to deal with high dimensionality, an undersampling step to address the class imbalance, and a classification step with multiple train-test split for the small number of rare cases. The pipeline was applied to thirty NPHP1 patients and 7231 controls and achieved good performances (sensitivity 86% with specificity 90%). A qualitative review of the EHRs of 40 misclassified controls showed that 25% had phenotypes belonging to the ciliopathy spectrum, which demonstrates the ability of our system to detect patients with similar conditions.
Conclusions: Our pipeline reached very encouraging performance scores for pre-diagnosing ciliopathy patients. The identified patients could then undergo genetic testing. The same data-driven approach can be adapted to other rare diseases facing underdiagnosis challenges.
Keywords: Diagnosis support; Electronic health record; Imbalanced dataset; Rare disease; Semantic similarity; Supervised machine learning.
© 2024. The Author(s).
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
References
-
- Hens D, Wyers L, Claeys KG. Validation of an Artificial Intelligence driven framework to automatically detect red flag symptoms in screening for rare diseases in electronic health records: hereditary transthyretin amyloidosis polyneuropathy as a key example. J Peripher Nervous Syst. 2023 doi: 10.1111/jns.12523. - DOI - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical