. 2023 Jun 16;17(1):53.

doi: 10.1186/s40246-023-00482-8.

Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer

Charmeine Ko¹, James P Brody²

Affiliations

¹ Department of Biomedical Engineering, University of California, Irvine, USA.
² Department of Biomedical Engineering, University of California, Irvine, USA. jpbrody@uci.edu.

PMID: 37328908
PMCID: PMC10273758
DOI: 10.1186/s40246-023-00482-8

Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer

Charmeine Ko et al. Hum Genomics. 2023.

. 2023 Jun 16;17(1):53.

doi: 10.1186/s40246-023-00482-8.

Authors

Charmeine Ko¹, James P Brody²

Affiliations

¹ Department of Biomedical Engineering, University of California, Irvine, USA.
² Department of Biomedical Engineering, University of California, Irvine, USA. jpbrody@uci.edu.

PMID: 37328908
PMCID: PMC10273758
DOI: 10.1186/s40246-023-00482-8

Abstract

Introduction: The ability to accurately predict whether a woman will develop breast cancer later in her life, should reduce the number of breast cancer deaths. Different predictive models exist for breast cancer based on family history, BRCA status, and SNP analysis. The best of these models has an accuracy (area under the receiver operating characteristic curve, AUC) of about 0.65. We have developed computational methods to characterize a genome by a small set of numbers that represent the length of segments of the chromosomes, called chromosomal-scale length variation (CSLV).

Methods: We built machine learning models to differentiate between women who had breast cancer and women who did not based on their CSLV characterization. We applied this procedure to two different datasets: the UK Biobank (1534 women with breast cancer and 4391 women who did not) and the Cancer Genome Atlas (TCGA) 874 with breast cancer and 3381 without.

Results: We found a machine learning model that could predict breast cancer with an AUC of 0.836 95% CI (0.830.0.843) in the UK Biobank data. Using a similar approach with the TCGA data, we obtained a model with an AUC of 0.704 95% CI (0.702, 0.706). Variable importance analysis indicated that no single chromosomal region was responsible for significant fraction of the model results.

Conclusion: In this retrospective study, chromosomal-scale length variation could effectively predict whether or not a woman enrolled in the UK Biobank study developed breast cancer.

Keywords: Breast cancer; Copy number variation; Germline; Machine learning; TCGA; UK biobank; h2o.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
We identified 874 women in the TCGA dataset with breast cancer and 3381 women as controls, women who had another form of cancer but not breast cancer. We characterized the germ line genetics of each of these women with 22 numbers, each one representing the average copy number of a chromosome, or the “length”. Based on this genetic characterization, we found a machine learning algorithm that can classify women with breast cancer compared to other women in the TCGA dataset with an area under the curve of (AUC) of 0.72. This figure depicts the receiver operator characteristic curve

**Fig. 2**
The receiver operator characteristic curves for predicting breast cancer using chromosomal scale length variation with machine learning algorithms. We used a subset of the UK Biobank dataset consisting of 5925 women (1534 who had been diagnosed with breast cancer and 4391 who had never been diagnosed with any form of cancer). We partitioned this group into a training and test set. We used the training set to train algorithms to recognize differences in chromosomal scale length variation data between the women with breast cancer and those without. We then tested this algorithm on the test set. We repeated this process multiple times with different training/test set partitions and found that the AUC was 0.836 with a 95% confidence interval of 0.830 to 0.843

**Fig. 3**
This Shapley additive explanations plot (known as a SHAP plot) provides interpretability to the machine learning model. This SHAP plot is from the UK Biobank machine learning model, shown in Fig. 1. In this model, we used the chromosome-scale length variation on four segments from each chromosome, numbered from 0 to 3. The normalized value represents the value of the parameters. For instance, the red points (closer to 1.0) represent the people with the “longest” associated chromosome, while the blue points (closer to 0) represent people with the shortest associated chromosome. This SHAP plot indicates that the top contribution to the model is from Chromosome 22, segment 3 (the top label on the left axis). However, the SHAP contribution plot also indicates that many different chromosomal regions contribute equally to the model. No one segment is responsible for a majority of the predictive value of the model. Thus, one should not ascribe any particular significance to the third segment of Chromosome 22

See this image and copyright information in PMC

Cited by

A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores.
Fatapour Y, Brody JP. Fatapour Y, et al. BioData Min. 2025 Jun 19;18(1):44. doi: 10.1186/s13040-025-00459-4. BioData Min. 2025. PMID: 40537821 Free PMC article.
Improved breast cancer risk prediction using chromosomal-scale length variation.
Fatapour Y, Brody JP. Fatapour Y, et al. Hum Genomics. 2025 Jun 11;19(1):65. doi: 10.1186/s40246-025-00776-z. Hum Genomics. 2025. PMID: 40500782 Free PMC article.
A contemporary review of breast cancer risk factors and the role of artificial intelligence.
Nicolis O, De Los Angeles D, Taramasco C. Nicolis O, et al. Front Oncol. 2024 Apr 18;14:1356014. doi: 10.3389/fonc.2024.1356014. eCollection 2024. Front Oncol. 2024. PMID: 38699635 Free PMC article. Review.

References

1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71. - PubMed
1. Krontiras H, Farmer M, Whatley J. Breast cancer genetics and indications for prophylactic mastectomy. Surgical Clinics of North America. 2018. - PubMed
1. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. JNCI J Natl Cancer Inst. 1989;81:1879–1886. doi: 10.1093/jnci/81.24.1879. - DOI - PubMed
1. Chlebowski RT, Anderson GL, Lane DS, Aragaki AK, Rohan T, Yasmeen S, et al. Predicting risk of breast cancer in postmenopausal women by hormone receptor status. J Natl Cancer Inst. 2007;99. - PubMed
1. Tyrer J, Duffy SW, Cuzick J. A breast cancer prediction model incorporating familial and personal risk factors. Stat Med. 2004;23:1111–1130. doi: 10.1002/sim.1668. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer

Affiliations

Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Medical