Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 21;15(1):10090.
doi: 10.1038/s41467-024-53851-9.

Deep generative AI models analyzing circulating orphan non-coding RNAs enable detection of early-stage lung cancer

Affiliations

Deep generative AI models analyzing circulating orphan non-coding RNAs enable detection of early-stage lung cancer

Mehran Karimzadeh et al. Nat Commun. .

Abstract

Liquid biopsies have the potential to revolutionize cancer care through non-invasive early detection of tumors. Developing a robust liquid biopsy test requires collecting high-dimensional data from a large number of blood samples across heterogeneous groups of patients. We propose that the generative capability of variational auto-encoders enables learning a robust and generalizable signature of blood-based biomarkers. In this study, we analyze orphan non-coding RNAs (oncRNAs) from serum samples of 1050 individuals diagnosed with non-small cell lung cancer (NSCLC) at various stages, as well as sex-, age-, and BMI-matched controls. We demonstrate that our multi-task generative AI model, Orion, surpasses commonly used methods in both overall performance and generalizability to held-out datasets. Orion achieves an overall sensitivity of 94% (95% CI: 87%-98%) at 87% (95% CI: 81%-93%) specificity for cancer detection across all stages, outperforming the sensitivity of other methods on held-out validation datasets by more than ~ 30%.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors are either employees, shareholders, or stock option holders of Exai Bio, Inc. B.A., H.G., F.H., and M.K. have a pending patent application (U.S. Patent “Systems and Methods for Early-Stage Cancer Detection and Subtyping” Application Serial No. 18/636,128 and International Application No. PCT/US24/24682) related to this work.

Figures

Fig. 1
Fig. 1. oncRNA-based liquid biopsy platform and Orion architecture.
a We discovered NSCLC oncRNAs from TCGA tissue datasets and investigated them in the blood of patients with NSCLC and non-cancer controls. We showed an analogy depicting NSCLC oncRNA fingerprint as a hand-written digit, serum oncRNA fingerprint as a noisy pattern, and generative AI embeddings as a denoised version. Created in BioRender. Alipanahi, B. (2024) BioRender.com/b61n795. b Orion architecture requires two input count matrices for oncRNAs (x) and endogenous expressed RNAs (r). Each input is fed to a standard VAE where the objective is to learn a joint representation of oncRNA counts under a zero-inflated negative binomial distribution (right). A joint embedding will be used by the cancer inference neural network for classification tasks (bottom right). c Schematic of triplet margin loss application on simulated data. The left panel shows a label-agnostic embedding, and the right panel shows an embedding with a triplet margin loss constraint to minimize technical variations while preserving biological differences. For each sample, we use positive anchors (same phenotype, different dataset) and negative anchors (different phenotype, any dataset) to minimize or maximize the embedding distance, respectively. d Loss convergence plots show convergence of 5 of the losses of Orion as well as classification accuracy during training.
Fig. 2
Fig. 2. Model performance on training and validation set.
a The ROC plot on the tuning set of 10 non-overlapping folds of model training for Orion (red), XGBoost (blue), and SVM classifier (green). The vertical blue line shows specificity at 90%. The text shows the area under ROC and sensitivity at 90% specificity with 95% confidence intervals. b Sensitivity of the model for tumors of different cancer stages at 90% specificity for Orion (red), XGBoost (blue), and SVM classifier (green). Error bars indicate the 95% confidence interval. The bar plot shows the number of samples in each category. c Sensitivity of the model stratified by T score (size) similar to (b). d Performance measures of binary classification in the held-out validation dataset. We computed all threshold-dependent metrics (all except area under ROC) based on the cutoff resulting in 90% specificity in the 10-fold cross validated training dataset. The bar height shows the point estimate of area under ROC, F1 score, Matthew’s correlation coefficient (MCC), sensitivity, and specificity. e Barplot shows log1p of SHAP score (x-axis) for the top 20 oncRNAs (y-axis). Y-axis labels indicate the nearest gene to the oncRNA. The first rows shows the sum of the next 20 oncRNAs (oncRNAs ranked 21st to 40th by their SHAP score). For gene A, [A] indicates overlap, []A indicates 1 kbp distance, [] − A indicates 10 kbp distance, [] − − A indicates 100 kbp, and [] indicates no genes within 1 Mbp distance.
Fig. 3
Fig. 3. Ablation of Orion components.
a Area under the ROC of 5 different models when comparing score of the control samples with respect to the sample supplier. b Area under ROC (top panel) and cross entropy loss (bottom panel) for cancer detection as a function of the number of samples used during training. Orange shows Orion with generative sampling for computation of cross-entropy loss during training, and purple shows Orion without this feature. c Scatter plots overlaid with kernel density estimates show cancer (blue) and control (orange) samples based on the first two principal components of Orion’s embedding space in 4 different conditions. d Test-set cross entropy loss of the same models.
Fig. 4
Fig. 4. Orion allows distinguishing tumor subtypes from the oncRNA profiles of the blood.
a ROC plot of Orion for distinguishing squamous cell carcinoma from adenocarcinoma among stage III/IV NSCLC samples. b Confusion matrix of Orion’s subtype prediction at 70% specificity cutoff.

References

    1. American Cancer Society. Lung cancer statistics. https://www.cancer.org/cancer/types/lung-cancer/about/key-statistics.html (2023). Accessed: 2023-01-04.
    1. National Lung Screening Trial Research Team. et al. Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med.365, 395–409 (2011). - PMC - PubMed
    1. Lopez-Olivo, M. A. et al. Patient adherence to screening for lung cancer in the US: A systematic review and meta-analysis. JAMA Netw. Open3, e2025102 (2020). - PMC - PubMed
    1. Lebow, E. S. et al. ctDNA-based detection of molecular residual disease in stage I-III non-small cell lung cancer patients treated with definitive radiotherapy. Front. Oncol.13, 1253629 (2023). - PMC - PubMed
    1. Cascone, T. et al. Neoadjuvant durvalumab alone or combined with novel immuno-oncology agents in resectable lung cancer: the phase II NeoCOAST platform trial. Cancer Discov.13, 2394–2411 (2023). - PMC - PubMed

Publication types

MeSH terms