Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 16;17(1):53.
doi: 10.1186/s40246-023-00482-8.

Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer

Affiliations

Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer

Charmeine Ko et al. Hum Genomics. .

Abstract

Introduction: The ability to accurately predict whether a woman will develop breast cancer later in her life, should reduce the number of breast cancer deaths. Different predictive models exist for breast cancer based on family history, BRCA status, and SNP analysis. The best of these models has an accuracy (area under the receiver operating characteristic curve, AUC) of about 0.65. We have developed computational methods to characterize a genome by a small set of numbers that represent the length of segments of the chromosomes, called chromosomal-scale length variation (CSLV).

Methods: We built machine learning models to differentiate between women who had breast cancer and women who did not based on their CSLV characterization. We applied this procedure to two different datasets: the UK Biobank (1534 women with breast cancer and 4391 women who did not) and the Cancer Genome Atlas (TCGA) 874 with breast cancer and 3381 without.

Results: We found a machine learning model that could predict breast cancer with an AUC of 0.836 95% CI (0.830.0.843) in the UK Biobank data. Using a similar approach with the TCGA data, we obtained a model with an AUC of 0.704 95% CI (0.702, 0.706). Variable importance analysis indicated that no single chromosomal region was responsible for significant fraction of the model results.

Conclusion: In this retrospective study, chromosomal-scale length variation could effectively predict whether or not a woman enrolled in the UK Biobank study developed breast cancer.

Keywords: Breast cancer; Copy number variation; Germline; Machine learning; TCGA; UK biobank; h2o.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
We identified 874 women in the TCGA dataset with breast cancer and 3381 women as controls, women who had another form of cancer but not breast cancer. We characterized the germ line genetics of each of these women with 22 numbers, each one representing the average copy number of a chromosome, or the “length”. Based on this genetic characterization, we found a machine learning algorithm that can classify women with breast cancer compared to other women in the TCGA dataset with an area under the curve of (AUC) of 0.72. This figure depicts the receiver operator characteristic curve
Fig. 2
Fig. 2
The receiver operator characteristic curves for predicting breast cancer using chromosomal scale length variation with machine learning algorithms. We used a subset of the UK Biobank dataset consisting of 5925 women (1534 who had been diagnosed with breast cancer and 4391 who had never been diagnosed with any form of cancer). We partitioned this group into a training and test set. We used the training set to train algorithms to recognize differences in chromosomal scale length variation data between the women with breast cancer and those without. We then tested this algorithm on the test set. We repeated this process multiple times with different training/test set partitions and found that the AUC was 0.836 with a 95% confidence interval of 0.830 to 0.843
Fig. 3
Fig. 3
This Shapley additive explanations plot (known as a SHAP plot) provides interpretability to the machine learning model. This SHAP plot is from the UK Biobank machine learning model, shown in Fig. 1. In this model, we used the chromosome-scale length variation on four segments from each chromosome, numbered from 0 to 3. The normalized value represents the value of the parameters. For instance, the red points (closer to 1.0) represent the people with the “longest” associated chromosome, while the blue points (closer to 0) represent people with the shortest associated chromosome. This SHAP plot indicates that the top contribution to the model is from Chromosome 22, segment 3 (the top label on the left axis). However, the SHAP contribution plot also indicates that many different chromosomal regions contribute equally to the model. No one segment is responsible for a majority of the predictive value of the model. Thus, one should not ascribe any particular significance to the third segment of Chromosome 22

Similar articles

Cited by

References

    1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71. - PubMed
    1. Krontiras H, Farmer M, Whatley J. Breast cancer genetics and indications for prophylactic mastectomy. Surgical Clinics of North America. 2018. - PubMed
    1. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. JNCI J Natl Cancer Inst. 1989;81:1879–1886. doi: 10.1093/jnci/81.24.1879. - DOI - PubMed
    1. Chlebowski RT, Anderson GL, Lane DS, Aragaki AK, Rohan T, Yasmeen S, et al. Predicting risk of breast cancer in postmenopausal women by hormone receptor status. J Natl Cancer Inst. 2007;99. - PubMed
    1. Tyrer J, Duffy SW, Cuzick J. A breast cancer prediction model incorporating familial and personal risk factors. Stat Med. 2004;23:1111–1130. doi: 10.1002/sim.1668. - DOI - PubMed