Diagnostics of lung cancer by fragmentated blood circulating cell-free DNA based on machine learning methods
- PMID: 39944482
- PMCID: PMC11813899
- DOI: 10.3389/fmed.2025.1435428
Diagnostics of lung cancer by fragmentated blood circulating cell-free DNA based on machine learning methods
Abstract
Introduction: Minimally invasive diagnostics based on liquid biopsy makes it possible early detection of lung cancer (LC). The blood plasma circulating cell-free DNA (cfDNA) fragments reflect the genome and chromatin status and are considered as integral cancer biomarkers and the biological entities for 'cancer-of-origin' prediction. The aim of this work is to create a method for processing next-generation sequencing (NGS) data and an interpretable binary classification model (CM), which analyzed cfDNA fragmentation features for distinguishing healthy subjects and subjects with LC.
Methods: 148 healthy subjects and 138 subjects with LC were included in the study. cfDNA fractions, isolated from blood plasma biospecimens, were used for DNA libraries preparations and NGS on the NovaSeq 6,000 Illumina system with a coverage of 100 million reads/sample. Twelve variables, describing the abundance and length distribution of cfDNA fragments within each genomic interval, and 40 variables based on the values of position-weight matrices, describing combinations of 5-bp-long terminal motifs of cfDNA fragments, were used to characterize genomic fragmentation. Classification models of the first phase of machine learning were based either on logistic regression with L1- and L2-regularization or were probabilistic CMs based on Gaussian processes. The second phase CM was based on kernel logistic regression.
Results: The final CM can distinguish healthy subjects and subjects with LC with AUC values of 0.872-0.875. The performance of developed CM was evaluated using datum and testing sets for each LC stage category. Sensitivity values ranged from 66.7 to 85.7%, from 77.8 to 100%, and from 70 to 80% for LC stages I, II, and III, respectively. Specificity values ranged from 79.3 to 90.0%.
Discussion: Thus, the CM has a good diagnostic value and does not require clinical or other data on tumor-associated biomarkers. The current method for LC detection has some advantages for future clinical implementation as a decision-making support system due to the performance of the CM requires data exclusively from NGS-analysis of blood plasma cfDNA fragmentation; the accuracy of the CM does not depend on any additional clinical data; the CM is highly interpretable and traceable; CM has appropriate modular architecture.
Keywords: cancer early detection; cfDNA; circulating cell-free DNA; diagnostic classification model; fragmentome; lung cancer; machine learning methods.
Copyright © 2025 Meshkov, Koturgin, Ershov, Safonova, Remizova, Maksyutina, Maralova, Astafieva, Ivashechkin, Ignatiev, Makhotenko, Snigir, Makarov, Yudin, Keskinov, Yudin, Makarova and Skvortsova.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
References
-
- GLOBOCAN . Global Cancer Observatory. (2020). Available at: http://gco.iarc.fr/today/data/factsheets/cancers/15-Lung-fact-sheet.pdf (accessed on 22 March, 2024)
LinkOut - more resources
Full Text Sources
