Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 25:11:20552076251323110.
doi: 10.1177/20552076251323110. eCollection 2025 Jan-Dec.

Structuring and centralizing breast cancer real-world biomarker data from pathology reports through C-LAB® artificial intelligence platform

Affiliations

Structuring and centralizing breast cancer real-world biomarker data from pathology reports through C-LAB® artificial intelligence platform

Florent Le Borgne et al. Digit Health. .

Abstract

Purpose: To evaluate the effectiveness of C-LAB®, an artificial intelligence (AI) platform, in extracting, structuring, and centralizing biomarker data from breast cancer pathology reports within the challenging, heterogeneous dataset of the Institut de Cancérologie de l'Ouest (ICO).

Methods: C-LAB® was deployed at the ICO to analyze HER2 and hormonal receptor data from breast cancer pathology reports. During the development phase, 292 anatomic pathology reports were used to design and refine the rule-based extraction algorithm through an iterative process of monitoring and adjustments. After finalizing the algorithm, it was applied to a total of 2323 anatomic pathology reports. To evaluate the platform's accuracy, performance metrics could only be calculated for a subset of these reports that were also available in the structured National Epidemiological Strategy and Medical Economics (ESME) database. Out of the 2323 pathology reports belonging to 487 patients analyzed by C-LAB®, 666 corresponded to 97 patients present in the ESME database. These reports were used as the gold standard for performance assessment, as ESME provides structured data against which the outputs of the C-LAB® algorithm could be compared.

Results: C-LAB® achieved over 80% agreement with human extractions (precision, recall, and F1-score) in structuring biomarker data from complex, unstructured pathology reports, despite dataset variability and optical character recognition errors. While the ESME database served as a benchmark, its reliance on single manual data entry without secondary review introduces potential inaccuracies, suggesting the observed performance reflects close alignment between human and algorithmic extractions rather than absolute accuracy. C-LAB® demonstrates significant potential to reduce manual workload, centralize data, and enable scalable, real-time reporting.

Conclusion: AI technologies like C-LAB® show significant potential in creating accessible and actionable digital factories from complex pathology data, aiding in the precision management of diseases such as breast cancer diagnostics and treatment.

Keywords: Cancer disease; artificial intelligence general; biomarker; genetics medicine; machine learning general; oncology medicine.

PubMed Disclaimer

Conflict of interest statement

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: FLB, CM, TP, OK, JSF, JR, MC and FB declare no conflict of interest. CG, YN, YE, JM, JC, MFP declare a potential conflict of interest as they were involved in both the development of the C-LAB® platform and the authorship of this study, as consultants or employees of Connect by Circular-Lab. However, every effort was made to ensure the integrity and objectivity of the research, including adherence to rigorous evaluation protocols. Note that none of these people had access to the ESME data and that the evaluation of the algorithm's performance was carried out by FLB, which has no connection with Connect by Circular-Lab.

Figures

Figure 1.
Figure 1.
Narrative content and OCR errors in reports. This figure showcases examples of report formats and wording variations in breast biomarker results sourced from various laboratories, anonymized as Labs A, B, C, and D. These examples highlight the substantial diversity in structure and terminology used in pathology reports, particularly in biomarker data presentation, which poses significant challenges for standardized data description and analysis. The narrative data ICO accessed was processed through OCR technology, a tool designed to extract text based on character shape recognition. While OCR is effective for basic text extraction, it has inherent limitations in accurately distinguishing visually similar characters. For instance, the OCR software struggled to differentiate between “o” and “0,” or “2” and “Z,” resulting in transcription errors such as “HER2” being misinterpreted as “HERZ.” Similarly, nuanced distinctions between accented characters, such as “à” and “a,” were occasionally missed, leading to inaccuracies. Another frequent error involved symbols, where OCR occasionally rendered “%” as “#.” The English translation is provided to improve clarity.
Figure 2.
Figure 2.
Venn diagram of study cohort and report selection.
Figure 3.
Figure 3.
Distribution of matching temporal distances between C-LAB® results and ESME results. Positive distances mean that the date of the result found by C-LAB® is earlier than the date of the result present in ESME. A: HER2; B: ISH; C: ER (Positive, Negative), D: ER (percentage); E: PR (Positive, Negative), F: PR (percentage).
Figure 4.
Figure 4.
Conventional versus digital pathology workflow and C-LAB® positioning. The current digitalization of the pathology workflow shows a gap in addressing the reporting and monitoring of testing data.

References

    1. Jameson JL, Longo DL. Precision medicine–personalized, problematic, and promising. N Engl J Med 4 juin 2015; 372: 2229–2234. - PubMed
    1. Garrido P, Aldaz A, Vera R, et al. Proposal for the creation of a national strategy for precision medicine in cancer: a position statement of SEOM, SEAP, and SEFH. Clin Transl Oncol avr 2018; 20: 443–447. - PMC - PubMed
    1. Harbeck N, Gnant M. Breast cancer. Lancet 18 mars 2017; 389: 1134–1150. - PubMed
    1. Kunte S, Abraham J, Montero AJ. Novel HER2-targeted therapies for HER2-positive metastatic breast cancer. Cancer 1 oct 2020; 126: 4278–4288. - PubMed
    1. Gámez-Chiachio M, Sarrió D, Moreno-Bueno G. Novel therapies and strategies to overcome resistance to anti-HER2-targeted drugs. Cancers (Basel) 19 sept 2022; 14: 4543. - PMC - PubMed

LinkOut - more resources