DEcancer: Machine learning framework tailored to liquid biopsy based cancer detection and biomarker signature selection

Affiliations

¹ Oxford Cancer Analytics Ltd, 696, BioEscalator, Innovation Building, Old Road Campus, Roosevelt Drive, Headington, Oxford, UK.
² Princess Margaret Cancer Centre, University Health Network, Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
³ Target Discovery Institute, Center for Medicines Discovery, Nuffield Department of Medicine, University of Oxford, Roosevelt Drive, Oxford, OX3 7FZ, UK.

PMID: 37168566
PMCID: PMC10165183
DOI: 10.1016/j.isci.2023.106610

DEcancer: Machine learning framework tailored to liquid biopsy based cancer detection and biomarker signature selection

Andreas Halner et al. iScience. 2023.

. 2023 Apr 11;26(5):106610.

doi: 10.1016/j.isci.2023.106610. eCollection 2023 May 19.

Authors

Affiliations

¹ Oxford Cancer Analytics Ltd, 696, BioEscalator, Innovation Building, Old Road Campus, Roosevelt Drive, Headington, Oxford, UK.
² Princess Margaret Cancer Centre, University Health Network, Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
³ Target Discovery Institute, Center for Medicines Discovery, Nuffield Department of Medicine, University of Oxford, Roosevelt Drive, Oxford, OX3 7FZ, UK.

PMID: 37168566
PMCID: PMC10165183
DOI: 10.1016/j.isci.2023.106610

Abstract

Cancer is a leading cause of mortality worldwide. Over 50% of cancers are diagnosed late, rendering many treatments ineffective. Existing liquid biopsy studies demonstrate a minimally invasive and inexpensive approach for disease detection but lack parsimonious biomarker selection, exhibit poor cancer detection performance and lack appropriate validation and testing. We established a tailored machine learning pipeline, DEcancer, for liquid biopsy analysis that addresses these limitations and improved performance. In a test set from a published cohort of 1,005 patients including 8 cancer types and 812 cancer-free individuals, DEcancer increased stage 1 cancer detection sensitivity across cancer types from 48 to 90%. In addition, with a test set cohort of patients from a high dimensional proteomics dataset of 61 lung cancer patients and 80 cancer-free individuals, DEcancer's performance using a 14-43 protein panel was comparable to 1,000 original proteins. DEcancer is a promising tool which may facilitate improved cancer detection and management.

Keywords: Cancer; Diagnostics; Machine learning.

PubMed Disclaimer

Conflict of interest statement

A.H. is a founder, an employee, and shareholder of Oxford Cancer Analytics Ltd. L.H. is an employee of Oxford Cancer Analytics Ltd. Z.L. is an employee of Oxford Cancer Analytics Ltd. F.P. was an employee and is a shareholder of Oxford Cancer Analytics Ltd. D.S. is an employee and shareholder of Oxford Cancer Analytics Ltd. E.M. is an employee of Oxford Cancer Analytics Ltd. G.L. declares no competing interests. B.K. is a shareholder and member of the Scientific Advisory Board of Oxford Cancer Analytics Ltd. J.S. is an employee of Oxford Cancer Analytics Ltd. P.J.L. is a founder, an employee and shareholder of Oxford Cancer Analytics Ltd. A.H. and P.J.L. are co-inventors of the patent “A METHOD AND SYSTEM DETECTING A HEALTH ABNORMALITY IN A LIQUID BIOPSY SAMPLE” (International Patent Application Number: PCT/EP2022/075710).

Figures

**Figure 1**
Test set receiver operating characteristic (ROC) curve and area under the curve (AUC) showing the performance of the optimized classifier model for distinguishing 201 cancer patients from 163 cancer-free individuals of the Cohen et al. dataset DEcancer_P uses proteins-only and DEcancer_PDE includes all 39 proteins, DNA-based and epidemiology factors. (Blue) The DEcancer_PDE approach for all cancer versus cancer-free test set ROC curve showing performance of optimal model. An AUC of 1.00 is achieved. At a fixed specificity of 99%, DEcancer achieves a sensitivity of 99%. (Orange) The 28-protein model uses the DEcancer_P approach for all cancer versus cancer-free test set ROC curve. An AUC of 1.00 is achieved. At a fixed specificity of 99%, DEcancer achieves a sensitivity of 93%.

**Figure 2**
Bar chart comparing the performance of DEcancer_PDE using protein, DNA and epidemiological data and DEcancer_P using proteins alone to Cohen et al.’s cancer detection sensitivity for stage 1, 2 and 3 of 8 cancer types Specificity was held at 99%.

**Figure 3**
Test set receiver operating characteristic (ROC) curve and area under the curve (AUC) showing the performance of the best optimized classifier model (selected based on validation results) for detecting a target cancer from the Cohen et al. dataset The test set results represent a generalizable estimate of performance for DEcancer’s all cancer pipeline. DEcancer_P uses proteins-only and DEcancer_PDE includes all 39 proteins, DNA-based and epidemiology factors. (Blue) The DEcancer_PDE approach for target cancer versus cancer-free test set ROC curve showing performance of optimal model. (Orange) The DEcancer_P approach for the target cancer versus cancer-free test set ROC curve. (Green) The DEcancer_PDE approach for the target cancer versus other cancers test set ROC curve showing performance of optimal model. (Red) The DEcancer_P approach for the target cancer versus other cancers test set ROC curve. (Purple) The DEcancer_PDE approach for the target cancer versus other cancers or cancer-free test set ROC curve showing performance of optimal model. (Brown) The DEcancer_P approach for the target cancer versus other cancers or cancer-free test set ROC curve. (A) Target cancer: lung. (Blue) An AUC of 1.00 is achieved with the DEcancer_PDE model. (Orange) An AUC of 1.00 is achieved with a 12-protein DEcancer_P model. (Green) An AUC of 0.95 is achieved with the DEcancer_PDE model. (Red) An AUC of 0.95 is achieved with a 39-protein DEcancer_P model. (Purple) An AUC of 0.96 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.95 is achieved with a 22-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 95.24 and 99.39%, respectively. (B) Target cancer: breast. (Blue) An AUC of 1.00 is achieved with the DEcancer_PDE model. (Orange) An AUC of 1.00 is achieved with a 27-protein DEcancer_P model. (Green) An AUC of 0.93 is achieved with a DEcancer_PDE model. (Red) An AUC of 0.88 is achieved with a 29-protein DEcancer_P model. (Purple) An AUC of 0.97 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.94 is achieved with a 26-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 97.62 and 100.00%, respectively. (C) Target cancer: colorectal. (Blue) An AUC of 1.00 is achieved with the DEcancer_PDE model. (Orange) An AUC of 0.99 is achieved with a 22-protein DEcancer_P model. (Green) An AUC of 0.95 is achieved with the DEcancer_PDE model. (Red) An AUC of 0.94 is achieved with a 22-protein DEcancer_P model. (Purple) An AUC of 0.97 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.96 is achieved with a 35-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 94.87 and 99.38%, respectively. (D) Target cancer: esophageal. (Blue) An AUC of 0.98 is achieved with the DEcancer_PDE model. (Orange) An AUC of 0.99 is achieved with an 8-protein DEcancer_P model. (Green) An AUC of 0.80 is achieved with the DEcancer_PDE. (Red) An AUC of 0.71 achieved with a 23-protein DEcancer_P model. (Purple) An AUC of 0.83 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.81 is achieved with a 15-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 88.89 and 94.48%, respectively. (E) Target cancer: liver. (Blue) An AUC of 1.00 is achieved with the DEcancer_PDE model. (Orange) An AUC of 1.00 is achieved with a 23-protein DEcancer_P model. (Green) An AUC of 0.93 is achieved with the DEcancer_PDE model. (Red) An AUC of 0.92 is achieved with an 8-protein DEcancer_P model. (Purple) An AUC of 0.96 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.96 is achieved with a 19-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 88.89 and 98.77%, respectively. (F) Target cancer: ovarian. (Blue) An AUC of 0.98 is achieved with the DEcancer_PDE model. (Orange) An AUC of 0.98 is achieved with a 15-protein DEcancer_P model. (Green) An AUC of 0.99 is achieved with the DEcancer_PDE model. (Red) An AUC of 0.99 is achieved with a 13-protein DEcancer_P model. (Purple) An AUC of 0.99 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.98 is achieved with a 12-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 90.91 and 95.47%, respectively. (G) Target cancer: pancreatic. (Blue) An AUC of 0.98 is achieved with the DEcancer_PDE model. (Orange) An AUC of 0.98 is achieved with a 8-protein DEcancer_P model. (Green) An AUC of 0.92 is achieved with the DEcancer_PDE model. (Red) An AUC of 0.91 is achieved with a 14-protein DEcancer_P model. (Purple) An AUC of 0.97 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.97 is achieved with a 9-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 84.21 and 98.77%, respectively. (H) Target cancer: gastric. (Blue) An AUC of 0.97 is achieved with the DEcancer_PDE model. (Orange) An AUC of 0.93 is achieved with a 17-protein DEcancer_P model. (Green) An AUC of 0.88 is achieved with the DEcancer_PDE model. (Red) An AUC of 0.85 is achieved with a 19-protein DEcancer_P model. (Purple) An AUC of 0.88 is achieved with the DEcancer_PDE model. (Brown) An AUC of 0.88 is achieved with a 25-protein DEcancer_P model. The optimal sensitivity and specificity corresponding to the top left corner of ROC curve is 85.71 and 93.21%, respectively.

**Figure 4**
Control receiver operating characteristic (ROC) curve and area under the curve (AUC) showing the performance of an optimized classifier model with the full protein set used by Blume et al.’s final model to distinguish cancer-free individuals from NSCLC patients For each spion, the full protein set detected by that spion but not detected with the depleted plasma approach is used as per the methodology in Blume et al. The final model was retrained on all 110 training and validation samples and the retrained classifier was then assessed on the 31 test set samples. An ROC curve is shown for each of the spions of Blume et al. and the number of proteins included in the model is indicated. (Blue) SP003 with 711 proteins (AUC 0.92). (Orange) SP006 with 532 proteins (AUC 0.90). (Green) SP007 with 421 proteins (AUC 0.95). (Red) SP333 with 293 proteins (AUC 0.85); (Purple) SP339 with 416 proteins (0.91).

**Figure 5**
Receiver operating characteristic (ROC) curves and area under the curve (AUC) showing the 31-individual test set performance of the optimized randomforest classifier model for each spion or depleted plasma data of Blume et al. For each spion or depleted plasma, the optimal subset of proteins used by the final classifier model to distinguish between cancer-free and non-lung cancer samples is indicated (Blue) Depleted plasma in which the optimal protein set included 30 proteins (AUC 0.97); (Orange) SP003 in which the optimal protein set included 32 proteins (AUC 0.93); (Green) SP006 in which the optimal protein set included 26 proteins (AUC 0.81); (Red) SP007 in which the optimal protein set included 14 proteins (AUC 0.93); (Purple) SP333 in which the optimal protein set included 36 proteins (AUC 0.92); (Brown) SP339 in which the optimal protein set included 43 proteins (AUC 0.92).

**Figure 6**
Flowchart outlining the DEcancer early cancer detection pipeline DEcancer was applied to the Cohen et al. dataset of cancer-free individuals and patients with one of 8 types of cancer, as well as the Blume et al. dataset with cancer-free and lung cancer patients. An approximately 20% test set is first separated out. For each classification task, the remaining data not part of the test set are used to form training and validation folds as part of a 200-fold Monte Carlo cross validation scheme. Various data augmentation approaches are applied to the training data. Feature selection and hyperparameter optimization of the classifier model are carried out based on performance on the validation folds. The independent t-test is used to compare the performance of the classifier model with the best performing feature set to that of the classifier using the smallest subset of variables, such that the performance is not statistically significantly lower than that of the best feature set. The best data processing framework, classifier models and feature set are then selected. Subsequently, retraining is performed on all training and validation data combined and the models assessed on the test set samples.

See this image and copyright information in PMC

References

1. Sung H., Ferlay J., Siegel R.L., Laversanne M., Soerjomataram I., Jemal A., Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J. Clin. 2021;71:209–249. doi: 10.3322/caac.21660. - DOI - PubMed
1. Loud J.T., Murphy J. Cancer screening and early detection in the 21stCentury. Semin. Oncol. Nurs. 2017;33:121–128. doi: 10.1016/j.soncn.2017.02.002. - DOI - PMC - PubMed
1. Miller K.D., Nogueira L., Devasia T., Mariotto A.B., Yabroff K.R., Jemal A., Kramer J., Siegel R.L. Cancer treatment and survivorship statistics. CA. Cancer J. Clin. 2022;72:409–436. doi: 10.3322/caac.21731. - DOI - PubMed
1. Runowicz C.D., Leach C.R., Henry N.L., Henry K.S., Mackey H.T., Cowens-Alvarado R.L., Cannady R.S., Pratt-Chapman M.L., Edge S.B., Jacobs L.A., et al. American cancer society/American society of clinical oncology breast cancer survivorship care guideline. CA. Cancer J. Clin. 2016;66:43–73. doi: 10.3322/caac.21319. - DOI - PubMed
1. Fiorica J.V. Breast cancer screening, mammography, and other modalities. Clin. Obstet. Gynecol. 2016;59:688–709. doi: 10.1097/GRF.0000000000000246. - DOI - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DEcancer: Machine learning framework tailored to liquid biopsy based cancer detection and biomarker signature selection

Affiliations

DEcancer: Machine learning framework tailored to liquid biopsy based cancer detection and biomarker signature selection

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources