Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct:240:107693.
doi: 10.1016/j.cmpb.2023.107693. Epub 2023 Jun 25.

From free-text electronic health records to structured cohorts: Onconum, an innovative methodology for real-world data mining in breast cancer

Affiliations

From free-text electronic health records to structured cohorts: Onconum, an innovative methodology for real-world data mining in breast cancer

Antoine Simoulin et al. Comput Methods Programs Biomed. 2023 Oct.

Abstract

Purpose: A considerable amount of valuable information is present in electronic health records (EHRs) however it remains inaccessible because it is embedded into unstructured narrative documents that cannot be easily analyzed. We wanted to develop and evaluate a methodology able to extract and structure information from electronic health records in breast cancer.

Methods: We developed a software platform called Onconum (ClinicalTrials.gov Identifier: NCT02810093) which uses a hybrid method relying on machine learning approaches and rule-based lexical methods. It is based on natural language processing techniques that allows a targeted analysis of free-text medical data related to breast cancer, independently of any pre-existing dictionary, in a French context (available in N files). We then evaluated it on a validation cohort called Senometry.

Findings: Senometry cohort included 9,599 patients with breast cancer (both invasive and in situ), treated between 2000 and 2017 in the breast cancer unit of Strasbourg University Hospitals. Extraction rates ranged from 45 to 100%, depending on the type of each parameter. Precision of extracted information was 68%-94% compared to a structured cohort, and 89%-98% compared to manually structured databases and it retrieved more rare occurrences compared to another database search engine (+17%).

Interpretation: This innovative method can accurately structure relevant medical information embedded in EHRs in the context of breast cancer. Missing data handling is the main limitation of this method however multiple sources can be incorporated to reduce this limit. Nevertheless, this methodology does not need neither pre-existing dictionaries nor manually annotated corpora. It can therefore be easily implemented in non-English-speaking countries and in other diseases outside breast cancer, and it allows prospective inclusion of new patients.

Keywords: Artificial intelligence; Breast cancer; Dictionaries; Electronic health record; Natural language processing; Ontology.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Carole Mathelin reports financial support was provided by University Hospitals Strasbourg.

Associated data