Constructing multicancer risk cohorts using national data from medical helplines and secondary care

Hadi Modarres¹, Dimitris Pipinis², Divya Balasubramanian³, Rupert Chaplin³, Scarlett Kynoch³, Achut Manandhar³, Gursimran Thandi⁴, Rebecca Cavilla⁵, Emma Hirst-Williams⁵

Affiliations

¹ NHS England, Data Science and Applied AI team, London, England, UK. hadi.modarres@nhs.net.
² NHS England, Strategic Analysis team, London, England, UK. dimitris.pipinis@nhs.net.
³ NHS England, Data Science and Applied AI team, London, England, UK.
⁴ NHS England, Strategic Analysis team, London, England, UK.
⁵ NHS England, NHS Cancer Programme, London, England, UK.

PMID: 40866501
PMCID: PMC12391443
DOI: 10.1038/s41746-025-01855-0

Constructing multicancer risk cohorts using national data from medical helplines and secondary care

Hadi Modarres et al. NPJ Digit Med. 2025.

. 2025 Aug 27;8(1):551.

doi: 10.1038/s41746-025-01855-0.

Authors

Hadi Modarres¹, Dimitris Pipinis², Divya Balasubramanian³, Rupert Chaplin³, Scarlett Kynoch³, Achut Manandhar³, Gursimran Thandi⁴, Rebecca Cavilla⁵, Emma Hirst-Williams⁵

Affiliations

¹ NHS England, Data Science and Applied AI team, London, England, UK. hadi.modarres@nhs.net.
² NHS England, Strategic Analysis team, London, England, UK. dimitris.pipinis@nhs.net.
³ NHS England, Data Science and Applied AI team, London, England, UK.
⁴ NHS England, Strategic Analysis team, London, England, UK.
⁵ NHS England, NHS Cancer Programme, London, England, UK.

PMID: 40866501
PMCID: PMC12391443
DOI: 10.1038/s41746-025-01855-0

Abstract

Identification of cohorts at higher risk of cancer can enable earlier diagnosis of the disease, which significantly improves patient outcomes. In this study, we select nine cancer sites with high incidence of late-stage diagnosis or worsening survival rates, and where there are currently no national screening programmes. We use data from medical helplines (NHS 111) and secondary care appointments from all hospitals in England. We show that features based on information captured in NHS 111 calls are among the most influential in driving predictions of a future cancer diagnosis. Our predictive models exhibit good discrimination, ranging from 0.69 (ovarian cancer) to 0.83 (oesophageal cancer). We present an approach of constructing cohorts at higher risk of cancer based on feature importance and considering possible bias in model results. This approach is flexible and can be tailored based on data availability and the group the intervention targets (i.e. symptomatic or asymptomatic patients).

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Data flow diagram.**
The population dataset from Bridges to Health (all those registered to a GP practice in England) is filtered to the age range of 40–74, and those with current and previous cancer diagnoses (relative to August 2021) are removed, resulting in an analysis dataset of 23.6M patients, which are then split into training, validation and testing datasets.

**Fig. 2. Patient pathway.**
The figure shows a stylised patient pathway with various types of events recorded during the 5 (history) + 1 (predictive window) years we observe the patients.

**Fig. 3. Model feature importance from SHAP and XGBoost average gain.**
a SHAP value, shown by order of the top 20 features based on model gain value. Red (Blue) colour indicates high (low) values for the specific feature. Dots to the right (left) of the vertical line where SHAP value is zero indicate that this feature increases (decreases) predicted probability of cancer diagnosis in year 6. Features written in bold with an asterisk were also among the top 20 in feature importance based on mean absolute SHAP value. b XGBoost model average gain value c Mean absolute SHAP value for the top 20 features.

**Fig. 4. Illustration of Method A—the risk-based cohort construction.**
Cohorts of different sizes are created by applying thresholds to model risk scores. The cancer incidence in such cohorts is calculated and compared to the baseline cancer rate to generate the lift value.

**Fig. 5. Variation in lift value (ratio of cancer incidence in cohort to baseline incidence) with increasing cohort size.**
Lift curve values for cohorts from a 0.5% to 100% of the population b 0.5% to 22% of the population. Predictions were obtained from a XGBoost model trained on all variables, and on another trained only on demographic variables to showcase the improved accuracy in identifying high risk groups when additional variables such as comorbidities and symptoms related to 111 calls are added.

**Fig. 6. Illustration of Method B—Feature based cohort construction.**
The most informative features from the trained model are identified and used to identify high-risk cohorts by identifying pairs of features which result in the cohorts with highest incidence. These decision rules can be applied either population wide, or to specific demographic sub-groups.

See this image and copyright information in PMC

References

1. Appelbaum, L. et al. Development and validation of a pancreatic cancer risk model for the general population using electronic health records: an observational study. Eur. J. Cancer143, 19–30 (2021). - PubMed
1. Wang, Y. H., Nguyen, P. A., Mohaimenul Islam, M., Li, Y. C. & Yang, H. C. Development of deep learning algorithm for detection of colorectal cancer in EHR data. in Studies in Health Technology and Informatics Vol. 264, 438–441 (IOS Press, 2019). - PubMed
1. Wang, X. et al. Prediction of the 1-year risk of incident lung cancer: prospective study using electronic health records from the state of Maine. J. Med. Internet Res.21, e13260 (2019). - PMC - PubMed
1. Hippisley-Cox, J. & Coupland, C. Development and validation of risk prediction algorithms to estimate future risk of common cancers in men and women: prospective cohort study. BMJ Open5, e007825 (2015). - PMC - PubMed
1. Malhotra, A., Rachet, B., Bonaventure, A., Pereira, S. P. & Woods, L. M. Can we screen for pancreatic cancer? Identifying a sub-population of patients at high risk of subsequent diagnosis using machine learning techniques applied to primary care data. PLoS ONE16, e0251876 (2021). - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Constructing multicancer risk cohorts using national data from medical helplines and secondary care

Affiliations

Constructing multicancer risk cohorts using national data from medical helplines and secondary care

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources