Structuring and centralizing breast cancer real-world biomarker data from pathology reports through C-LAB^® artificial intelligence platform

Florent Le Borgne¹, Camille Garnier², Camille Morisseau¹, Yanis Navarrete², Yanina Echeverria², Juan Mir², Jaume Calafell², Tanguy Perennec³, Olivier Kerdraon⁴, Jean-Sébastien Frenel^{5

6}, Judith Raimbourg^{5

6}, Mario Campone^{5

6}, Maria Fe Paz², François Bocquet^{1

7}

Affiliations

¹ Data Factory & Analytics Department, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
² Connect By Circular Lab, Madrid, Spain.
³ Department of Radiation Oncology, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
⁴ Department of Pathology, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
⁵ Oncology Department, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
⁶ Center for Research in Cancerology and Immunology Nantes-Angers, Nantes University and Angers University, Nantes-Angers, France.
⁷ Law and Social Change Laboratory, Faculty of Law and Political Sciences, Nantes University, Nantes, France.

PMID: 40013074
PMCID: PMC11863259
DOI: 10.1177/20552076251323110

Structuring and centralizing breast cancer real-world biomarker data from pathology reports through C-LAB^® artificial intelligence platform

Florent Le Borgne et al. Digit Health. 2025.

. 2025 Feb 25:11:20552076251323110.

doi: 10.1177/20552076251323110. eCollection 2025 Jan-Dec.

Authors

Affiliations

¹ Data Factory & Analytics Department, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
² Connect By Circular Lab, Madrid, Spain.
³ Department of Radiation Oncology, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
⁴ Department of Pathology, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
⁵ Oncology Department, Institut de Cancérologie de l'Ouest, Nantes-Angers, France.
⁶ Center for Research in Cancerology and Immunology Nantes-Angers, Nantes University and Angers University, Nantes-Angers, France.
⁷ Law and Social Change Laboratory, Faculty of Law and Political Sciences, Nantes University, Nantes, France.

PMID: 40013074
PMCID: PMC11863259
DOI: 10.1177/20552076251323110

Abstract

Purpose: To evaluate the effectiveness of C-LAB^®, an artificial intelligence (AI) platform, in extracting, structuring, and centralizing biomarker data from breast cancer pathology reports within the challenging, heterogeneous dataset of the Institut de Cancérologie de l'Ouest (ICO).

Methods: C-LAB^® was deployed at the ICO to analyze HER2 and hormonal receptor data from breast cancer pathology reports. During the development phase, 292 anatomic pathology reports were used to design and refine the rule-based extraction algorithm through an iterative process of monitoring and adjustments. After finalizing the algorithm, it was applied to a total of 2323 anatomic pathology reports. To evaluate the platform's accuracy, performance metrics could only be calculated for a subset of these reports that were also available in the structured National Epidemiological Strategy and Medical Economics (ESME) database. Out of the 2323 pathology reports belonging to 487 patients analyzed by C-LAB^®, 666 corresponded to 97 patients present in the ESME database. These reports were used as the gold standard for performance assessment, as ESME provides structured data against which the outputs of the C-LAB^® algorithm could be compared.

Results: C-LAB^® achieved over 80% agreement with human extractions (precision, recall, and F1-score) in structuring biomarker data from complex, unstructured pathology reports, despite dataset variability and optical character recognition errors. While the ESME database served as a benchmark, its reliance on single manual data entry without secondary review introduces potential inaccuracies, suggesting the observed performance reflects close alignment between human and algorithmic extractions rather than absolute accuracy. C-LAB^® demonstrates significant potential to reduce manual workload, centralize data, and enable scalable, real-time reporting.

Conclusion: AI technologies like C-LAB^® show significant potential in creating accessible and actionable digital factories from complex pathology data, aiding in the precision management of diseases such as breast cancer diagnostics and treatment.

Keywords: Cancer disease; artificial intelligence general; biomarker; genetics medicine; machine learning general; oncology medicine.

PubMed Disclaimer

Conflict of interest statement

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: FLB, CM, TP, OK, JSF, JR, MC and FB declare no conflict of interest. CG, YN, YE, JM, JC, MFP declare a potential conflict of interest as they were involved in both the development of the C-LAB® platform and the authorship of this study, as consultants or employees of Connect by Circular-Lab. However, every effort was made to ensure the integrity and objectivity of the research, including adherence to rigorous evaluation protocols. Note that none of these people had access to the ESME data and that the evaluation of the algorithm's performance was carried out by FLB, which has no connection with Connect by Circular-Lab.

Figures

**Figure 1.**
Narrative content and OCR errors in reports. This figure showcases examples of report formats and wording variations in breast biomarker results sourced from various laboratories, anonymized as Labs A, B, C, and D. These examples highlight the substantial diversity in structure and terminology used in pathology reports, particularly in biomarker data presentation, which poses significant challenges for standardized data description and analysis. The narrative data ICO accessed was processed through OCR technology, a tool designed to extract text based on character shape recognition. While OCR is effective for basic text extraction, it has inherent limitations in accurately distinguishing visually similar characters. For instance, the OCR software struggled to differentiate between “o” and “0,” or “2” and “Z,” resulting in transcription errors such as “HER2” being misinterpreted as “HERZ.” Similarly, nuanced distinctions between accented characters, such as “à” and “a,” were occasionally missed, leading to inaccuracies. Another frequent error involved symbols, where OCR occasionally rendered “%” as “#.” The English translation is provided to improve clarity.

**Figure 2.**
Venn diagram of study cohort and report selection.

**Figure 3.**
Distribution of matching temporal distances between C-LAB^® results and ESME results. Positive distances mean that the date of the result found by C-LAB^® is earlier than the date of the result present in ESME. A: HER2; B: ISH; C: ER (Positive, Negative), D: ER (percentage); E: PR (Positive, Negative), F: PR (percentage).

**Figure 4.**
Conventional versus digital pathology workflow and C-LAB^® positioning. The current digitalization of the pathology workflow shows a gap in addressing the reporting and monitoring of testing data.

See this image and copyright information in PMC

References

1. Jameson JL, Longo DL. Precision medicine–personalized, problematic, and promising. N Engl J Med 4 juin 2015; 372: 2229–2234. - PubMed
1. Garrido P, Aldaz A, Vera R, et al. Proposal for the creation of a national strategy for precision medicine in cancer: a position statement of SEOM, SEAP, and SEFH. Clin Transl Oncol avr 2018; 20: 443–447. - PMC - PubMed
1. Harbeck N, Gnant M. Breast cancer. Lancet 18 mars 2017; 389: 1134–1150. - PubMed
1. Kunte S, Abraham J, Montero AJ. Novel HER2-targeted therapies for HER2-positive metastatic breast cancer. Cancer 1 oct 2020; 126: 4278–4288. - PubMed
1. Gámez-Chiachio M, Sarrió D, Moreno-Bueno G. Novel therapies and strategies to overcome resistance to anti-HER2-targeted drugs. Cancers (Basel) 19 sept 2022; 14: 4543. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Structuring and centralizing breast cancer real-world biomarker data from pathology reports through C-LAB^® artificial intelligence platform

Affiliations

Structuring and centralizing breast cancer real-world biomarker data from pathology reports through C-LAB^® artificial intelligence platform

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous