Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 29:12:1435428.
doi: 10.3389/fmed.2025.1435428. eCollection 2025.

Diagnostics of lung cancer by fragmentated blood circulating cell-free DNA based on machine learning methods

Affiliations

Diagnostics of lung cancer by fragmentated blood circulating cell-free DNA based on machine learning methods

Ivan O Meshkov et al. Front Med (Lausanne). .

Abstract

Introduction: Minimally invasive diagnostics based on liquid biopsy makes it possible early detection of lung cancer (LC). The blood plasma circulating cell-free DNA (cfDNA) fragments reflect the genome and chromatin status and are considered as integral cancer biomarkers and the biological entities for 'cancer-of-origin' prediction. The aim of this work is to create a method for processing next-generation sequencing (NGS) data and an interpretable binary classification model (CM), which analyzed cfDNA fragmentation features for distinguishing healthy subjects and subjects with LC.

Methods: 148 healthy subjects and 138 subjects with LC were included in the study. cfDNA fractions, isolated from blood plasma biospecimens, were used for DNA libraries preparations and NGS on the NovaSeq 6,000 Illumina system with a coverage of 100 million reads/sample. Twelve variables, describing the abundance and length distribution of cfDNA fragments within each genomic interval, and 40 variables based on the values of position-weight matrices, describing combinations of 5-bp-long terminal motifs of cfDNA fragments, were used to characterize genomic fragmentation. Classification models of the first phase of machine learning were based either on logistic regression with L1- and L2-regularization or were probabilistic CMs based on Gaussian processes. The second phase CM was based on kernel logistic regression.

Results: The final CM can distinguish healthy subjects and subjects with LC with AUC values of 0.872-0.875. The performance of developed CM was evaluated using datum and testing sets for each LC stage category. Sensitivity values ranged from 66.7 to 85.7%, from 77.8 to 100%, and from 70 to 80% for LC stages I, II, and III, respectively. Specificity values ranged from 79.3 to 90.0%.

Discussion: Thus, the CM has a good diagnostic value and does not require clinical or other data on tumor-associated biomarkers. The current method for LC detection has some advantages for future clinical implementation as a decision-making support system due to the performance of the CM requires data exclusively from NGS-analysis of blood plasma cfDNA fragmentation; the accuracy of the CM does not depend on any additional clinical data; the CM is highly interpretable and traceable; CM has appropriate modular architecture.

Keywords: cancer early detection; cfDNA; circulating cell-free DNA; diagnostic classification model; fragmentome; lung cancer; machine learning methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
A flowchart of the algorithm for bioinformatics processing of NGS data on the cfDNA fragmentome.
Figure 2
Figure 2
A flowchart of the algorithm for statistical processing of data on the cfDNA fragmentome.
Figure 3
Figure 3
Results of selection of hyperparameters λ and σ2 (second phase of machine learning).
Figure 4
Figure 4
Results of selection of the threshold value ( pi -value). The optimal cutoff value is 0.35 due to the highest average values of balanced accuracy.
Figure 5
Figure 5
The location of the training dataset observations in the coordinates of the principal components and the result of testing the classification model on the datum and testing datasets. The coordinate axes forming the scatter diagram correspond to the principal components obtained during the calculation of the phase II model. The points correspond to patients. Panels from left to right: training, datum and testing dataset. The top row of panels shows the results obtained by the phase II model without using the cut-off value of 0.35. The middle row is a demonstration of the quality of the phase II model using a threshold value ( pthr ). The bottom row is a demonstration of the quality of performance of the phase II model in diagnosing lung cancer depending on the disease stages.

References

    1. GLOBOCAN . Global Cancer Observatory. (2020). Available at: http://gco.iarc.fr/today/data/factsheets/cancers/15-Lung-fact-sheet.pdf (accessed on 22 March, 2024)
    1. Luo G, Zhang Y, Etxeberria J, Arnold M, Cai X, Hao Y, et al. Projections of lung cancer incidence by 2035 in 40 countries worldwide: population-based study. JMIR public Heal Surveill. (2023) 9:e43651. doi: 10.2196/43651, PMID: - DOI - PMC - PubMed
    1. Sharma R. Mapping of global, regional and national incidence, mortality and mortality-to-incidence ratio of lung cancer in 2020 and 2050. Int J Clin Oncol. (2022) 27:665–75. doi: 10.1007/s10147-021-02108-2, PMID: - DOI - PMC - PubMed
    1. Huang J, Deng Y, Tin MS, Lok V, Ngai CH, Zhang L, et al. Distribution, risk factors, and temporal trends for lung cancer incidence and mortality: a global analysis. Chest. (2022) 161:1101–11. doi: 10.1016/j.chest.2021.12.655, PMID: - DOI - PubMed
    1. Sands J, Tammemägi MC, Couraud S, Baldwin DR, Borondy-Kitts A, Yankelevitz D, et al. Lung screening benefits and challenges: a review of the data and outline for implementation. J Thorac Oncol. (2021) 16:37–53. doi: 10.1016/j.jtho.2020.10.127, PMID: - DOI - PubMed

LinkOut - more resources